YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11

Du, Xinwu; Zhang, Xiaoxuan; Li, Tingting; Chen, Xiangyu; Yu, Xiufang; Wang, Heng

doi:10.3390/agriculture15141521

Open AccessArticle

YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11

by

Xinwu Du

^1,2,*,

Xiaoxuan Zhang

¹,

Tingting Li

¹,

Xiangyu Chen

¹,

Xiufang Yu

¹ and

Heng Wang

^1,2

¹

College of Agricultural Equipment Engineering, Henan University of Science and Technology, Luoyang 471003, China

²

Longmen Laboratory, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(14), 1521; https://doi.org/10.3390/agriculture15141521

Submission received: 23 June 2025 / Revised: 11 July 2025 / Accepted: 12 July 2025 / Published: 14 July 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Target detection is the key technology of the apple-picking robot. To overcome the limitations of existing apple target detection methods, including low recognition accuracy of multi-species apples in complex orchard environments and a complex network architecture that occupies large memory, a lightweight apple recognition model based on the improved YOLO11 model was proposed, named YOLO-WAS model. The model aims to achieve efficient and accurate automatic multi-species apple identification while reducing computational resource consumption and facilitating real-time applications on low-power devices. First, the study constructed a high-quality multi-species apple dataset and improved the complexity and diversity of the dataset through various data enhancement techniques. The YOLO-WAS model replaced the ordinary convolution module of YOLO11 with the Adown module proposed in YOLOv9, the backbone C3K2 module combined with Wavelet Transform Convolution (WTConv), and the spatial and channel synergistic attention module Self-Calibrated Spatial Attention (SCSA) combined with the C2PSA attention mechanism to form the C2PSA_SCSA module was also introduced. Through these improvements, the model not only ensured lightweight but also significantly improved performance. Experimental results show that the proposed YOLO-WAS model achieves a precision (P) of 0.958, a recall (R) of 0.921, and mean average precision at IoU threshold of 0.5 (mAP@50) of 0.970 and mean average precision from IoU threshold of 0.5 to 0.95 with step 0.05 (mAP@50:95) of 0.835. Compared to the baseline model, the YOLO-WAS exhibits reduced computational complexity, with the number of parameters and floating-point operations decreased by 22.8% and 20.6%, respectively. These results demonstrate that the model performs competitively in apple detection tasks and holds potential to meet real-time detection requirements in resource-constrained environments, thereby contributing to the advancement of automated orchard management.

Keywords:

target detection; YOLO11; lightweight; apple; computer vision

1. Introduction

China has a large and diverse fruit industry, with orchard area and fruit production ranking among the highest in the world for many years [1]. China has long maintained the largest apple cultivation area and production volume in the world. In 2022, the country’s apple planting area reached approximately 3.02 million hectares, with a total output of 41.83 million tons, accounting for nearly half of the global apple production [2,3]. However, fruit harvesting remains a highly labor-intensive task. With the advancement of agricultural automation, intelligent harvesting is gradually gaining importance, especially in the area of automated apple picking and visual recognition, which has become a prominent focus of current research [4,5,6,7]. In complex orchard environments, fruit detection is often hindered by challenges such as occlusion and varying lighting conditions, which significantly impact the accuracy and stability of recognition systems used in harvesting robots [8,9,10,11]. Therefore, improving the precision and speed of fruit detection, while enhancing the robustness of visual systems under complex environmental conditions, is essential for achieving efficient and reliable automated harvesting.

In recent years, fruit-picking robot technology has advanced rapidly, particularly in areas such as environmental adaptability, target detection accuracy, and harvesting efficiency. From a technological evolution perspective, fruit recognition methods have progressed from traditional digital image processing techniques to machine learning-based image segmentation and classification methods, and more recently to the mainstream use of deep learning approaches based on convolutional neural networks (CNNs) [12,13,14,15]. Early traditional image processing methods relied heavily on explicit fruit feature information. For example, Wei et al. [16] employed an improved Otsu adaptive thresholding algorithm in combination with features from the OHTA color space for fruit detection. However, this approach depended solely on color features, making it highly sensitive to environmental variations, resulting in poor robustness and fluctuating detection accuracy. Subsequently, machine learning techniques were introduced into object detection and recognition tasks. Moallem et al. [17] utilized K-means clustering to detect the calyx region based on the Cb component in the YCbCr color space, followed by multilayer perceptron neural networks for defect segmentation. Statistical, texture, and geometric features were then extracted from the segmented regions to perform recognition. Nevertheless, such methods are highly dependent on parameter settings; for instance, the K value must be predefined, yet determining an optimal value is challenging. This limits the generalization ability of the models and reduces their adaptability to varying environments [18]. In summary, although traditional image processing and machine learning methods have achieved some success in fruit detection tasks, they suffer from strong dependence on hand-crafted features, high sensitivity to environmental interference, difficulty in parameter tuning, and limited generalization capacity. These limitations hinder their ability to meet the demands of efficient and robust fruit recognition in complex orchard environments.

Due to its remarkable capability in extracting high-dimensional features, deep learning has been widely applied to target detection and recognition tasks in fruit-picking robots [19,20,21,22,23]. Wu [24] developed DNE-YOLO based on YOLOv8, which demonstrated robust detection performance across various lighting conditions, confirming its adaptability to complex weather scenarios. Liu et al. [25] introduced a low-computation partial depthwise convolution (PDWConv) structure and an efficient EIoU loss function, advancing the deployment of object detection models on edge computing devices. Shi et al. [26] enhanced YOLOv7 for apple fruitlet detection by incorporating the SE module, Atrous Spatial Pyramid Pooling (ASPP), the Convolutional Block Attention Module (CBAM), and an additional P2 layer, resulting in improved detection accuracy. Wang et al. [27] proposed an optimized apple picking path planning approach based on YOLOv5. By replacing the standard convolutional modules in the YOLOv5 backbone with inverted residual blocks from MobileNetV2, they reduced the model size by 57% and improved data processing speed by 26.8%, resulting in a lightweight model of only 6.01 MB—suitable for deployment on resource-constrained devices. Bedi et al. [28] introduced PlantGhostNet, a lightweight CNN that combines Ghost modules with Squeeze-and-Excitation modules for the identification of bacterial spot disease in peach trees. The model achieved high accuracy (99.75% training and 99.51% validation) with a low parameter count, making it ideal for low-resource environments. These studies collectively demonstrate the significant progress of deep learning in orchard fruit detection, showcasing a trend toward model diversification and high-performance solutions. However, most current approaches focus on detecting a single apple variety and rarely consider inter-varietal differences in color, size, and shape, which limits the general applicability of the models. Moreover, in complex orchard environments—characterized by varying lighting conditions, fruit occlusion, and overlapping targets—existing models still face challenges in achieving high detection accuracy and generalization. While some models excel in accuracy, this often comes at the cost of increased parameter size and computational complexity. Therefore, there is an urgent need for apple detection models that not only maintain high accuracy but also feature lightweight and efficient designs to meet the demands of real-world applications.

In summary, this paper proposes a lightweight apple detection algorithm based on the improvement of YOLO11. An efficient lightweight model, YOLO-WAS, is proposed by collecting image information of three varieties of apples in a complex scene to build a dataset. The research of this study is divided into the following points:

(1): In order to meet the requirements of real-time recognition of the three components, the model takes YOLO11 as the basic architecture. It combines the trunk C3K2 module with wavelet convolution WTConv and uses wavelet transform to solve the problem of over-parameterization encountered by convolutional neural networks (CNNS) when realizing large receptive fields. It provides a more efficient, robust, and easy-to-integrate convolutional layer solution.
(2): The ADown module, originally proposed in YOLOv9, is adopted and adapted in this work to enhance the downsampling efficiency within the convolutional components of YOLOv11.
(3): The spatial and channel collaborative attention (SCSA) module is incorporated to further develop the existing C2PSA attention mechanism in YOLOv11, resulting in the C2PSA_SCSA module. This integration is intended to more effectively combine spatial and channel attention mechanisms and better exploit multi-scale semantic features.

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Dataset Acquisition

In order to comprehensively evaluate the detection performance of YOLO-WAS proposed in this paper on multiple varieties of apples, we build a multi-species apple image dataset covering each scene. The image data were taken in October 2024 by a tripod-supported cell phone. Three different varieties of apples were selected as targets in the standard picking garden in Luoyang City, Henan Province. The model of the cell phone used was Apple 15, and the resolution of the captured images was 4284 × 4284. In order to ensure the diversity of the dataset, the acquisition process was carried out at different times, lighting conditions, and viewpoints. Different varieties of apples were sampled under a variety of conditions, such as near and far, smooth and backlight, and dark light, to cover more scene variations and environmental factors. In the end, a total of 1653 images were collected. Figure 1 shows the shooting process and sample images of different varieties of apples taken at different shooting distances and under different lighting conditions.

2.1.2. Dataset Creation

In order to improve the accuracy and robustness of the model and avoid overfitting, a comprehensive pre-processing was carried out on the acquired image data. The dataset included 692 Yantai Fuji apples, 510 cream Fuji apples, and 451 Gala apples. Firstly, adjust the size of the picture to 640 × 640 pixels. To further enrich the dataset, we used data augmentation techniques based on random strategies, such as scaling, cropping, flipping, noise addition, and brightness adjustment, to ensure that the model could better cope with different changes and complex situations in practical applications. In addition, we used LabelImg annotation software (1.8.6) for labeling and uniformly approved the images for detailed bounding box annotations to ensure the consistency and accuracy of data annotation. The final dataset ratio for the three varieties was approximately 1:1:1, with a total of 4267 images. Under the premise of category balance, the dataset was divided into training, testing, and validation sets at a ratio of 8:1:1. Figure 2 shows sample images from the dataset, displaying images before data augmentation and images after technical processing such as rotation, scaling, and noise addition.

2.2. Model Improvement

2.2.1. WAS-YOLO11 Model

Our work is based on the official version of YOLO11 released by Ultralytics (Frederick, MD, USA) in October 2024, serving as the fundamental detection framework. Compared with the previous version, YOLO11 has made significant improvements in architecture and training methods. Compared to the original YOLOv8 model, YOLO11 replaces the original C2f module with an improved C3K2 module and adds a C2PSA module, which is an extension of C2f, after the SPPF module. This module introduces Position-Sensitive Attention (PSA), combining multi-head attention and feedforward neural networks to enhance feature extraction capability. It adopts an improved trunk and neck architecture, which greatly enhances feature extraction capability and enables more accurate target detection in complex tasks [29].

However, we are also aware of the shortcomings of YOLO11 in real-world applications in complex environments, such as the large number of parameters and weights of the model, which makes it difficult to be deployed on resource-constrained platforms (e.g., mobile devices). There is still a lot of room for improvement in the accuracy of recognizing different varieties of targets. Therefore, in order to further improve the accuracy, while reducing the computational complexity and the number of parameters, and to make the model more lightweight and efficient, the improvement of the YOLO11 model is proposed in this study. The structure of the improved model is shown in Figure 3. Firstly, the C3K2 module in the original trunk is combined with wavelet convolution WTConv to form the C3K2_WT module. Wavelet transform is used to solve the problem of over-parameterization encountered by convolutional neural networks (CNNS) when realizing large receptor fields. While improving certain accuracy, the number of parameters is reduced slightly. The Adown module proposed in YOLOv9 is introduced to improve the common convolution module of YOLO11 for downsampling, which not only improves the accuracy, but also reduces the complexity of the model by reducing the number of parameters. The space and channel collaborative attention module SCSA is introduced to innovate and improve the C2PSA attention mechanism in the YOLO11 module to form the C2PSA_SCSA module, which effectively combines the advantages of space and channel attention and further improves the detection accuracy of the model for different varieties of apples.

2.2.2. C3K2_WT Module

In the YOLO11 model, the C3K2 module is an important feature extraction component. The C3K2 module usually divides the input features into two parts, one part is passed directly by normal convolution operation, and the other part is passed through multiple C3K (when the c3k parameter is set to True) or Bottleneck structures for deep feature extraction. The final two parts of the features are spliced. The C3K2 module introduces a variable convolutional kernel C3K, where K is a tunable convolutional kernel size (3 × 3, 5 × 5, etc.) and combines with a channel separation strategy, which extends the sensory field to be able to capture a wider range of contextual information, and improves the efficiency of the model’s feature extraction for real-time target detection in complex scenarios.

However, while expanding the sensory field, it also increases the model complexity, over-parameters, and limits the model deployment platform. The introduction of wavelet convolution effectively solves this problem. Traditional CNNs are limited by the size of the convolution kernel, which makes it difficult to effectively capture global contextual information [30]. WTConv utilizes wavelet transform to expand the convolutional receptive field through multi-frequency response and performs small kernel convolution operations in different frequency ranges. With wavelet decomposition, the model can capture low-frequency information in a wider range while avoiding over-parameterization of the model [31].

Wavelet transform is a powerful mathematical technique that allows an image to be decomposed into multiple frequency components across different scales. Unlike the Fourier transform, which only provides frequency information, the wavelet transform offers simultaneous localization in both spatial and frequency domains [32]. This property makes it especially suitable for detecting and analyzing localized features in images. The two-dimensional continuous wavelet transform (2D-CWT) of an image I (x, y) can be expressed as follows:

W_{a, b} = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} I (x, y) ψ_{a, b} (x, y) d x d y

(1)

where

ψ_{a, b} (x, y)

represents the wavelet function, and a, b are scale and translation parameters, respectively.

In the WTConv module, the conventional standard convolution operation is replaced by a wavelet-based filtering mechanism, enabling the network to perform localized frequency decomposition of the input feature maps. Mathematically, this operation is analogous to traditional convolution; however, the convolution kernels are constituted by wavelet functions, allowing effective capture of local features across multiple scales within the image. Consequently, the WTConv operation fuses spatial and frequency information, enhancing the representational power for feature extraction. The output of the WTConv operation can be expressed as:

I_{o u t} (x, y) = \sum_{i, j} I (i, j) ψ_{a, b} (x - i, y - j)

(2)

where

ψ_{a, b} (x - i, y - j)

refers to a wavelet filter parameterized by scale a and translation b, which enables localized feature extraction across different spatial resolutions. The double summation over i and j defines a discrete two-dimensional convolution process, where the input image is convolved with the wavelet filter. This formulation facilitates multi-scale representation of image features, offering improved spatial-frequency localization compared to standard convolution.

This study proposes a lightweight improvement method for the C3K2 feature extraction module. By introducing wavelet transform (WT), this method effectively addresses the issue of overparameterization faced by convolutional neural networks (CNNs) when pursuing large receptive fields. Specifically, we combine the wavelet transform with the Bottleneck structure in the C3K2 module to form a new WTBottleneck structure. Under specific conditions, replacing the traditional convolutional layer with a WTConv2d layer not only preserves the basic architecture and functionality of the original Bottleneck module but also successfully incorporates the advantages of wavelet convolution. This improvement significantly enhances the model’s ability to process multi-frequency information in images. Figure 4 illustrates the original architecture of C3K2 and the improved C3K2_WT module structure.

2.2.3. ADown Module

In deep learning models, downsampling is a common technique to reduce the spatial dimension of feature maps, which helps the model capture the features of images at a higher level while reducing the amount of computation [33]. The ADown module in YOLOv9 is a convolution block for downsampling operations in object detection tasks. It provides an efficient downsampling solution for real-time object detection through lightweight design and flexibility.

In this study, the normal convolution module in YOLO11 is replaced with the ADown module. In backbone, ADown can be used to downsample between different layers of the feature map. And in the neck part, it can help to further refine the resolution of the feature map for more accurate target detection. The ADown module uses convolutional layers to extract useful information from the feature map, reduces the spatial dimensions of the feature map by adjusting the step size of the convolutional layers, and optimizes the number of parameters in the convolutional layers to reduce the complexity of the model. More importantly, its learning ability can be adjusted according to different data scenarios. Specifically, ADown refines feature maps through optimized structural design, maintaining high-resolution expression capabilities while downsampling, thereby mitigating the impact of feature loss [34]. After the introduction of this module, the number of parameters and weights of the model decreases significantly, and the accuracy is improved. The structure of the ADown network is shown in Figure 5, where h denotes the height of the image, w represents the width, and c indicates the number of channels. Specifically, the ADown module is designed to reduce the spatial resolution of the feature map while retaining essential semantic information [35]. As illustrated in Figure 4, the input feature map with dimensions h × w × c first passes through an average pooling layer (AvgPool2d), resulting in a reduced size of (h − 1) × (w − 1) × c. It is then split into two branches, each with dimensions (h − 1) × (w − 1) × c/2. One branch undergoes a max pooling operation (MaxPool2d), further reducing its size to h/2 × w/2 × c/2, while the other branch is processed by a series of convolutions (3 × 3 followed by 1 × 1), maintaining its size at h/2 × w/2 × c/2. Finally, the outputs of the two branches are concatenated (Concat) to produce a unified feature map with dimensions h/2 × w/2 × c. This architecture effectively preserves critical semantic features essential for apple detection, while significantly reducing computational complexity [35].

2.2.4. C2PSA_SCSA Module

The Synergistic Attention Module (SCSA) is designed to exploit the synergistic effect between spatial and channel attention modules to bring significant enhancement to a variety of downstream vision tasks [36]. SCSA consists of two main components: shared multisemantic spatial attention (SMSA) and progressive channel self-attention (PCSA). SMSA utilizes multi-scale depth-shared one-dimensional convolution to capture multisemantic spatial information to enhance local and global feature representations [37]. Specifically, the SCSA module first uses SMSA (multi-scale spatial attention) to process the original feature map, thereby enhancing the ability to express spatial information at different scales. Subsequently, the features are recalibrated between channels through PCSA (Channel Attention). This mechanism enables the model to focus more clearly on the marked area and reduce background noise [38]. In complex agricultural environments such as orchards, it can focus more clearly on the target area of apples, significantly suppressing noise caused by background interference from leaves and branches. In addition, SCSA can effectively enhance the backbone network’s perception of small targets (such as distant or partially occluded apples) without introducing additional parameters and computational overhead [39]. Especially during downsampling, SCSA assigns higher weights to small target features, thereby retaining more granular information. This design plays an important role in improving the detection accuracy of apple targets under complex natural conditions, verifying the adaptability and effectiveness of SCSA in agricultural target detection tasks. The overall architecture is shown in Figure 6.

In Figure 6, B represents the batch size, C represents the number of channels, H and W represent the height and width of the feature map, and MS-DWConv represents the shared correlation one-dimensional convolution of multiple receptive fields. In order to solve the finite receptive field caused by the decomposition of features into H and W dimensions and the application of one-dimensional convolution, respectively, we use a lightweight shared convolution for alignment, implicitly modeling dependencies between the two dimensions by learning consistency features on both dimensions. In PCSA, the self-attention mechanism is calculated along the channel dimension, where Q, K, V ∈ R^B*C*N.

The C2PSA module of YOLO11 originally combined the PSA module for enhanced feature extraction and attention mechanisms. However, the multisemantic information inherent in spatial and channel dimensions is neglected, whereas attention to channel and spatial dimensions brought significant improvements in feature dependency and spatial structural relationship extraction for various visual tasks, respectively. Therefore, this study combines the PSA module with the collaborative attention module SCSA to form the PSASCSA module, which improves the ability of the attention mechanism to learn different channel features, mitigates semantic differences, and facilitates semantic interaction.

2.2.5. Model Evaluation Index

Evaluation metrics are important tools for quantitatively assessing model performance. In order to comprehensively evaluate the performance of the model in the target detection task of different apple varieties, precision (P), recall (R), mean accuracy value (mAP50), expanded mean accuracy (mAP50-95), Frames per second (FPS), and the weight size of the model (MB) were adopted in this paper.

Precision (P) The accuracy of the model’s recognition results, in this study, P is denoted as the proportion of samples predicted to be apples that are actually apples. High precision means that the model rarely incorrectly predicts the background as an object. As shown in Equation (3).

P = \frac{T P}{T P + F P}

(3)

where TP denotes the correctly detected objects and FP denotes the incorrect detection of background regions as objects.

R is the proportion of all real apples that are successfully detected by the model. With a high recall, the model is able to detect as many objects as possible, but this may result in higher false positives (FP). As shown in Equation (4), False Negative (FN) denotes the missed objects.

R = \frac{T P}{T P + F N}

(4)

mAP (mean Average Precision is the most commonly used comprehensive index in object detection, measuring the accuracy of the model under multiple categories and IoU thresholds. Especially in the case of a large number of target objects, it can comprehensively measure the performance of the model in each category. AP: Calculates the average accuracy for each category under different IoU thresholds. mAP: Averages the APs of all categories. We usually pay attention to mAP50 (mAP when the IOU threshold is 0.5, paying special attention to the accuracy of detecting the match between the frame and the real frame) and MAP50-95 (average mAP value when the IOU threshold is changed from 0.5 to 0.95, with 0.05 as the step to provide performance under different IoU thresholds). MAP50-95 more fully reflects the average performance of the model under different IOU thresholds. These two indicators are calculated as shown in Equations (5) and (6).

mAP = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(5)

A P = \int_{0}^{- 1} P (R) d R

(6)

where n represents the number of classes in the dataset, and APi represents the average precision of class i.

F P S = \frac{1}{Intrence Time}

(7)

The Frames per second (FPS) is an important indicator of the speed of model recognition and detection, which reflects the number of images processed by the model per second. The larger the FPS, the faster the processing speed of the model.

In addition, the efficiency evaluation metrics of the model include memory footprint, parameters, computational complexity (GFLOPS, Giga Floating Point Operations Per Second), and inference time, which are used to comprehensively evaluate the model’s demand for hardware resources and inference speed.

3. Model Training Results

3.1. Experimental Environment

The improved YOLO11 model in this study was implemented and trained using the open-source deep learning framework PyTorch (version 3.10). All experiments were conducted on a computer running the Windows 10 Professional operating system. The hardware configuration included an NVIDIA GeForce RTX 4060 Ti GPU with 64 GB of RAM and an Intel Xeon E5-2686 v4 CPU running at 2.30 GHz. To accelerate data processing and model training, CUDA (version 12.1) and cuDNN (version 8.8.1) environments were configured to fully utilize GPU parallel computing, effectively supporting the multi-variety apple multi-target detection task. The input image size for all experiments was set to 640 × 640 pixels.

During model training, the initial learning rate was set to 0.01, and the batch size was set to 200. The IoU Loss function was employed to guide the optimization process. A stochastic gradient descent (SGD) optimizer with a momentum of 0.937 was used, and a weight decay coefficient of 0.0005 was applied to enhance the generalization ability of the model and mitigate overfitting.

3.2. Ablation Test

The ablation experiments were designed to investigate the impact of the innovative modules on target detection in multi-species apples. In order to verify the effectiveness of each module improvement, a series of ablation experiments were conducted on the multi-species apple dataset. In this study, the YOLO11 model was used as the basis for several improvements by combining the C3K2_WT module, the ADown module, and the C2PSA_SCSA module in order to enhance the superiority of the performance of the apple target detection model. And its performance is evaluated with several performance metrics. The results of the comparison between the performance of the added improved model and the original model are shown in Table 1.

As shown in Table 1, among them, “-” indicates that this module is not selected, and “√” indicates that this module is selected. After replacing the ordinary convolution in C3K2 with the wavelet convolution WTconv, all four metrics are improved, with P, R, mAP50, and Map50-95 numerically increased to 0.942, 0.894, 0.955, and 0.803, respectively, and the weights as well as the parameter counts slightly decreased by 1.8% and 2.2%, respectively. Subsequently, replacing the ordinary convolution in the YOLO11 model with the Adown module not only exerts its downsampling function to decrease the Parameters, GFLOPs values, and model weight sizes by 18.7%, 19.1%, and 18.2%, respectively, but also improves the model’s P to 0.957, R to 0.896, and mAP50 and mAP50-95, respectively, to 0.96 and 0.805, which indicates that the Adown module not only has a positive impact in reducing the number of parameters as well as the computational complexity of the model, but also enhances the performance of the model. With the introduction of the collaborative attention module SCSA, the P of the model is increased to 0.936, the R is increased to 0.884, the mAP50 and mAP50-95 reach 0.953 and 0.801, respectively, and the number of parameters as well as the magnitude of the weights are reduced by 1.9% and 1.8%, which shows that the improvement of this attention mechanism improves the accuracy while the complexity of the model is reduced.

In addition, simultaneously in the experiments where the C3K2_WT module and Adown module were introduced to improve the base model, the number of parameters as well as the size of the weights of the model decreased compared to the addition of the improved module alone, making the model lighter and also leading to a slight decrease in the p-value. However, R, mAP50, and mAP50-95 were improved to 0.902, 0.964, and 0.810, respectively. The performance of the model was additionally improved in the combined experiment of improving C3K2_WT, ADown, and C2PSA_SCSA. The evaluation metrics were further improved, with P, R, mAP50, and mAP50-95 elevated to 0.958, 0.921, 0.97, and 0.835, respectively, and the number of covariates, GFLOPs, and weight sizes decreased by 22.8%, 20.6%, and 21.8%, respectively. Better than the performance of any single trial, as well as the combined trial. Taken together, these results indicate that the combined improvement has a significant positive impact on model performance.

3.3. Comparative Experiments on Attention Mechanisms

To enhance the model’s ability to focus on critical features, we introduced the SCSA attention mechanism into the C2PSA structure. To validate its effectiveness, we conducted comparative experiments by embedding several mainstream attention mechanisms, Spatial Hierarchical Self-Attention (SHSA), Squeeze-and-Excitation Attention Module (SEAM), Multi-Level Coordinate Attention (MLCA), and Convolutional Block Attention Module (CBAM) into the same C2PSA module. These mechanisms aim to adaptively emphasize important information in either the spatial or channel domain, thereby improving the model’s ability to distinguish foreground objects from the background, particularly in challenging scenarios such as occlusions and small object detection.

As shown in Table 2, all modified models outperform the original YOLO11 across key evaluation metrics, confirming the effectiveness of combining attention mechanisms with the C2PSA structure. Among them, YOLO11-SCSA achieves the best overall performance, with a Precision of 0.936, Recall of 0.884, mAP@0.5 of 0.953, and mAP@0.5:0.95 of 0.801, representing improvements of 4.4%, 7.6%, 5.3%, and 8.1%, respectively, over the baseline YOLO11. While other attention mechanisms, such as CBAM and SEAM, also yield noticeable improvements, SCSA demonstrates superior ability in enhancing feature discrimination by jointly capturing spatial and channel dependencies.

In summary, the C2PSA structure exhibits strong generality and extensibility when integrated with various attention mechanisms. In particular, the integration of the SCSA module leads to a significant performance boost, offering a promising direction for further optimization of lightweight and accurate object detection networks.

3.4. Performance Comparison of Different Models

Currently, the two most mainstream frameworks in object detection research are YOLO and DETR. This study evaluated the performance of the YOLO-WAS model, four basic models in the YOLO series, and RTDETR-resnet18 in the identification and detection of three different varieties of apples. Table 3 shows the detection accuracy, parameter number, GFLOPs, and weight size of different models, and Figure 7 shows the changing trend of accuracy of different models.

As shown in Table 3, the RTDETR-resnet18 model has a p value of 0.897, an R value of 0.808, a Map50 value of 0.9, and a Map50-95 value of 0.72. Its accuracy evaluation indicators are the lowest among all models, and it also has the most parameters compared to other models. MobileNetV2 and YOLOv7-Tiny achieve the same precision (0.927). The fps of both models is relatively low, and the performance is relatively poor. The detection accuracy of different YOLO models was compared. Among them, YOLOv8n and YOLO11n models showed relatively high accuracy. p value of YOLOv8n model was 0.936, R was 0.860, mAP50 was 0.945, and MAP50-95 was 0.785. The accuracy of the YOLO11n model was similar to that of the YOLO11n model, at 0.934, 0.870, 0.945, and 0.784, respectively. However, the parameter number of the YOLOv8n model is higher, reaching 3,066,233, and GFLOPs and Weight are both higher, while the parameter number, GFLOPs, and Weight of YOLO11n are at a medium level. The significant advantage of the YOLOv9t model is that the parameter number and weight size are low, at 19,971,369 and 4.7 M, and the calculation complexity is medium. However, its lightweight also leads to low accuracy, and the p-value, R, mAP50, and MAP50-95 are 0.919, 0.847, 0.937, and 0.773, respectively, which are lower than the average level. The number of parameters, GFLOPs, and Weight of the YOLOv10n model are also at a medium level, and its accuracy is low (p-value is 0.919). In contrast, the YOLO-WAS model performed best in the Apple detection task, showing high accuracy, and the p-value, R, mAP50, and MAP50-95 were 0.958, 0.921, 0.970, and 0.835, respectively, which were much higher than other models. YOLO-WAS was also the lightest model overall. Although its reference count was 1.08% higher than that of the YOLOv9t model, GFLOPs and Weight values were 34.2% and 8.5% lower than YOLOv9t model. In the sample picture of the detection effect, the red box marks the cases of false detection and missing detection by the model. It can be seen that the proposed YOLO-WAS model has the best comprehensive performance and more prominent advantages (Figure 8). Various evaluation metrics can be derived from the confusion matrix. Figure 9 presents the confusion matrices of different models, which visually compare the predicted labels with the ground truth to illustrate detailed classification results. Among these models, YOLO11-WAS demonstrates the best performance, achieving the most accurate classification.

Figure 7 shows the change in accuracy of each model as the number of iterations increases. The accuracy of each model increases with the number of iterations, but the YOLO-WAS model maintains the leading edge while improving smoothly. In conclusion, the analysis shows that appropriate adjustment and optimization of the model structure can greatly improve the performance of the model, and by integrating the advantages of the C3K2_WT module, the ADown module, and the C2PSA_SCSA module, the combination of their advantages. It achieves high accuracy while also being lightweight, and this combination of lightweight design and high accuracy not only improves the adaptability and practicality of the model but also provides new insights into the field of agricultural automation and smart agriculture.

4. Discussion

A lightweight multi-species apple detection model, YOLO-WAS, based on improved YOLO11, is proposed in this study and fetches good detection results on a self-built dataset. The model is based on the YOLO11n model, and the C3K2 module and C2PSA module in the backbone are improved, and the Adown module is used to improve the ordinary convolution module of YOLO11. These optimizations significantly improve the model’s P, R, mAP50, and mAP50-95, and effectively reduce the number of parameters, the amount of floating-point computation, and the weight size of the model, while achieving a balance of the model’s lightweight and accuracy improvement. The balance between model lightweight and accuracy improvement is achieved.

Firstly, the C3K2 module in the original backbone is combined with the wavelet convolution WTConv to form the C3K2_WT module, which utilizes the wavelet transform to solve the over-parameterization problem encountered by convolutional neural networks (CNNs) when realizing a large sensory field, and to improve a certain degree of accuracy while decreasing the number of parameters by a small margin. The Adown module proposed in YOLOv9 is introduced to improve the ordinary convolutional module of YOLO11. In the backbone, the Adown module can be used to downsample between different layers of the feature map, while in the head section, it helps to further refine the resolution of the feature map for more accurate target detection, which not only improves the accuracy but also reduces the model complexity by reducing the number of parameters. Attention to channel and spatial multisemantic information brings significant improvements in feature dependencies and spatial structural relationship extraction for various visual tasks. The introduction of the spatial and channel synergistic attention module SCSA innovatively improves the C2PSA attention mechanism in the YOLO11 module to form the C2PSA_SCSA module, which effectively combines the advantages of spatial and channel attention to further improve the model’s detection accuracy for different varieties of apples.

In the comparative model analysis, the YOLO-WAS model performs well in the task of apple detection for multiple varieties, with significantly better accuracy than the other models, and achieves a certain degree of lightness, which also lays the groundwork for subsequent machine deployments in the display orchard. Although the overall results of the model are outstanding, there are limitations: the selected scenes are two picking orchards and three varieties located in Luoyang City, and the dataset scenes and varieties are fixed, which restricts us from experimenting on a wider dataset due to the lack of images of many apple varieties. In practical applications, due to the complex and changing environment of the orchard, we may face a variety of situations that were not anticipated by the study, such as various weather problems, and the problem of misdetection and omission may occur, which may affect the detection results. Despite the limitations, based on the general properties of convolutional neural networks, we maintain confidence in the algorithm for the different varieties of apples involved in the study.

In future research, we plan to closely integrate multi-variety apple detection technology with automated orchard management, introducing an expanded dataset of samples from multiple regions, climatic conditions, and stages of maturity. We will consider introducing more systematic methods of multiple trials and statistical tests. We test the applicability and generalization ability of the model on other varieties of apples, as well as other varieties of fruits and vegetables, and aim to optimize the model in a wider range of complex orchard scenarios.

5. Conclusions

A lightweight Apple detection model based on YOLO11 was proposed in this study. The C3K2_WT module, ADown module, and C2PSA_SCSA module are combined to form the YOLO-WAS model, which improves a certain degree of accuracy while reducing the number of parameters and model complexity. The YOLO-WAS model demonstrates favorable performance in multi-variety apple detection tasks, maintaining high accuracy while adopting a lightweight design. It achieves a precision of 0.958, a recall of 0.921, a mAP@50 of 0.970, and a mAP@50:95 of 0.835, all of which are higher than those of the comparison models. Furthermore, the model exhibits reduced computational complexity and smaller model size, showing potential for real-time deployment in resource-constrained environments and contributing to the advancement of automated orchard management.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z. and T.L.; software, X.Z. and X.D.; validation, X.D., X.C. and H.W.; formal analysis, T.L. and X.D.; investigation, X.C. and T.L.; resources, X.D. and X.Y.; data curation, X.C. and H.W.; writing—original draft preparation, X.Z.; writing review and editing, X.Y. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

Longmen laboratory project (Grant No. LMFKCY2023001), National Nature Science Foundation of China (Grant No. 52075150).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, C.L.; Pan, W.Y.; Zou, T.L.; Li, C.J.; Han, Q.Y.; Wang, H.M.; Yang, J.; Zou, X.J. A Review of Perception Technologies for Berry Fruit-Picking Robots: Advantages, Disadvantages, Challenges, and Prospects. Agriculture 2024, 14, 1346. [Google Scholar] [CrossRef]
Hua, W.J.; Zhang, Z.; Zhang, W.Q.; Liu, X.H.; Hu, C.; He, Y.C.; Mhamed, M.; Li, X.L.; Dong, H.X.; Saha, C.K.; et al. Key technologies in apple harvesting robot for standardized orchards: A comprehensive review of innovations, challenges, and future directions. Comput. Electron. Agric. 2025, 235, 110343. [Google Scholar] [CrossRef]
Wei, J.; Yi, D.; Bo, X.; Guangyu, C.Y.; Dean, Z. Adaptive Variable Parameter Impedance Control for Apple Harvesting Robot Compliant Picking. Complexity 2020, 2020, 4812657. [Google Scholar] [CrossRef]
Li, J.; Karkee, M.; Zhang, Q.; Xiao, K.H.; Feng, T. Characterizing apple picking patterns for robotic harvesting. Comput. Electron. Agric. 2016, 127, 633–640. [Google Scholar] [CrossRef]
Hu, G.R.; Zhou, J.G.; Chen, Q.Y.; Luo, T.Y.; Li, P.H.; Chen, Y.; Zhang, S.; Chen, J. Effects of different picking patterns and sequences on the vibration of apples on the same branch. Biosyst. Eng. 2024, 237, 26–37. [Google Scholar] [CrossRef]
Xin, Q.; Luo, Q.; Zhu, H. Key Issues and Countermeasures of Machine Vision for Fruit and Vegetable Picking Robot. Adv. Transdiscipl. Eng. 2024, 46, 69–78. [Google Scholar] [CrossRef]
Li, X.; Wang, W.H.; Wu, L.J.; Chen, S.; Hu, X.L.; Li, J.; Tang, J.H.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Electr. Network, Virtual, 6–12 December 2020. [Google Scholar]
Chen, Y.; Chen, B.B.; Li, H.T. Object Identification and Location Used by the Fruit and Vegetable Picking Robot Based on Human-decision Making. In Proceedings of the 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), E China Normal University, Shanghai, China, 14–16 October 2017. [Google Scholar]
Wu, Y.; Wan, X.; Zhang, J.; Yang, Y. Research on fruit picking recognition based on deep learning. In Proceedings of the Optoelectronic Imaging and Multimedia Technology X 2023, Beijing, China, 15–16 October 2023; Chinese Optical Society (COS): Beijing, China; The Society of Photo-Optical Instrumentation Engineers (SPIE): Bellingham, WA, USA, 2023. [Google Scholar]
Chu, P.Y.; Li, Z.J.; Lammers, K.; Lu, R.F.; Liu, X.M. Deep learning-based apple detection using a suppression mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Nan, Y.L.; Zhang, H.C.; Zeng, Y.; Zheng, J.Q.; Ge, Y.F. Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Zhang, J.; Kang, N.B.; Qu, Q.J.; Zhou, L.H.; Zhang, H.B. Automatic fruit picking technology: A comprehensive review of research advances. Artif. Intell. Rev. 2024, 57, 54. [Google Scholar] [CrossRef]
Huang, J.; Lan, H. Multi-type fruit picking image recognition method based on deep learning. In Proceedings of the 2021 International Conference on Internet of Things and Machine Learning, IoTML 2021, Dalian, China, 17–19 December 2021; Academic Exchange Information Center (AEIC): Guangzhou, China, 2022. [Google Scholar]
Li, Z.; Yuan, X.; Wang, C. A review on structural development and recognition–localization methods for end-effector of fruit–vegetable picking robots. Int. J. Adv. Robot. Syst. 2022, 19. [Google Scholar] [CrossRef]
Rana, S.; Gerbino, S.; Sekehravani, E.A.; Russo, M.B.; Carillo, P. Crop Growth Analysis Using Automatic Annotations and Transfer Learning in Multi-Date Aerial Images and Ortho-Mosaics. Agronomy 2024, 14, 2052. [Google Scholar] [CrossRef]
Wei, X.Q.; Jia, K.; Lan, J.H.; Li, Y.W.; Zeng, Y.L.; Wang, C.M. Automatic method of fruit object extraction under complex agricultural background for vision system of fruit picking robot. Optik 2014, 125, 5684–5689. [Google Scholar] [CrossRef]
Moallem, P.; Serajoddin, A.; Pourghassem, H. Computer vision-based apple grading for golden delicious apples based on surface features. Inf. Process. Agric. 2017, 4, 33–40. [Google Scholar] [CrossRef]
Liu, Q.; Cao, C.Y.; Zhang, X.D.; Li, K.; Xu, W.L. Design of Strawberry Picking Hybrid Robot Based on Kinect Sensor. In Proceedings of the International Conference on Sensing, Diagnostics, Prognostics and Control (SDPC), Xi’an, China, 15–17 August 2018; pp. 248–251. [Google Scholar]
Tang, Y.C.; Qiu, J.J.; Zhang, Y.Q.; Wu, D.X.; Cao, Y.H.; Zhao, K.X.; Zhu, L.X. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Wang, Z.H.; Xun, Y.; Wang, Y.K.; Yang, Q.H. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar] [CrossRef]
Liu, S.H.; Xue, J.L.; Zhang, T.Y.; Lv, P.F.; Qin, H.H.; Zhao, T.X. Research progress and prospect of key technologies of fruit target recognition for robotic fruit picking. Front. Plant Sci. 2024, 15, 1423338. [Google Scholar] [CrossRef]
Bedi, P.; Gole, P.; Marwaha, S. PDSE-Lite: Lightweight framework for plant disease severity estimation based on Convolutional Autoencoder and Few-Shot Learning. Front. Plant Sci. 2024, 14, 1319894. [Google Scholar] [CrossRef] [PubMed]
Nasiri, A.; Taheri-Garavand, A.; Zhang, Y.D. Image-based deep learning automated sorting of date fruit. Postharvest Biol. Technol. 2019, 153, 133–141. [Google Scholar] [CrossRef]
Wu, H.T.; Mo, X.T.; Wen, S.J.; Wu, K.L.; Ye, Y.; Wang, Y.M.; Zhang, Y.H. DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102220. [Google Scholar] [CrossRef]
Liu, Z.F.; Abeyrathna, R.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
Shi, B.X.; Hou, C.K.; Xia, X.L.; Hu, Y.H.; Yang, H. Improved young fruiting apples target recognition method based on YOLOv7 model. Neurocomputing 2025, 623, 129186. [Google Scholar] [CrossRef]
Wang, J.X.; Su, Y.H.; Yao, J.H.; Liu, M.; Du, Y.R.; Wu, X.; Huang, L.; Zhao, M.H. Apple rapid recognition and processing method based on an improved version of YOLOv5. Ecol. Inform. 2023, 77, 102196. [Google Scholar] [CrossRef]
Bedi, P.; Gole, P. PlantGhostNet: An Efficient Novel Convolutional Neural Network Model to Identify Plant Diseases Automatically. In Proceedings of the 9th IEEE International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), ICRITO 2021, Noida, India, 3–4 September 2021. [Google Scholar]
Lin, Y.T.; Xia, Y.J.; Xia, P.C.; Liu, Z.Y.; Wang, H.D.; Qin, C.J.; Gong, L.; Liu, C.L. YOLO11-ARAF: An Accurate and Lightweight Method for Apple Detection in Real-World Complex Orchard Environments. Agriculture 2025, 15, 1104. [Google Scholar] [CrossRef]
Luo, W.J.; Li, Y.J.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Wu, W.Y.; Cheng, H.Y.; Pan, J.C.; Zhong, L.L.; Zhang, Q.C. Wavelet-Enhanced YOLO for Intelligent Detection of Welding Defects in X-Ray Films. Appl. Sci. 2025, 15, 4586. [Google Scholar] [CrossRef]
Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2020, 124, 319–327. [Google Scholar] [CrossRef]
Gu, W.J.; Gao, W.Q.; Zou, Y.; Ma, S.Y. ATW-YOLO: Reconstructing the downsampling process and attention mechanism of yolo network for rail foreign body detection. Signal Image Video Process. 2025, 19, 368. [Google Scholar] [CrossRef]
Yang, L.S.; Zhang, T.; Zhou, S.H.; Guo, J.T. AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture 2025, 15, 836. [Google Scholar] [CrossRef]
Liu, J.X.; Zhou, R.G.; Li, Y.C.; Ren, P.J. Enhanced underwater object detection with YOLO-LDFE: A model for improved accuracy with balanced efficiency. J. Real-Time Image Process. 2025, 22, 58. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention. arXiv 2024, arXiv:2407.05128. [Google Scholar] [CrossRef]
Deng, Y.; Huang, L.D.; Gan, X.S.; Lu, Y.F.; Shi, S.X. A heterogeneous attention YOLO model for traffic sign detection. J. Supercomput. 2025, 81, 765. [Google Scholar] [CrossRef]
Liu, C.; Yang, D.G.; Tang, L.; Zhou, X.; Deng, Y. A Lightweight Object Detector Based on Spatial-Coordinate Self-Attention for UAV Aerial Images. Remote Sens. 2023, 15, 83. [Google Scholar] [CrossRef]

Figure 1. The shooting process and apple images under different conditions.

Figure 2. Sample image of dataset after data enhancement.

Figure 3. YOLO-WAS model architecture.

Figure 4. Network structure of C3K2 architecture and C3K2_WT module.

Figure 5. ADown network structure diagram. Note: h is the height of the feature map, w is the width, and c is the number of channels.

Figure 6. Overall SCSA architecture.

Figure 7. Variation trend of accuracy of different models.

Figure 8. Different model detection effect diagrams.

Figure 9. Confusion matrices of different models.

Table 1. Comparison of ablation performance.

C3K2_WT	ADown	C2PSA_SCSA	P	R	mAP50	mAP50-95	Parameters	GFLOPs	Speed (FPS)
-	-	-	0.934	0.870	0.945	0.784	2,582,737	6.3	192.307
√		-	0.942	0.894	0.955	0.803	2,523,385	6.3	185.185
-	√	-	0.951	0.896	0.96	0.805	2,100,177	5.1	243.902
-	-	√	0.936	0.884	0.953	0.801	2,534,865	6.3	192.307
√	√	-	0.949	0.902	0.964	0.810	2,040,825	5.1	232.558
√		√	0.936	0.89	0.957	0.802	2,475,513	6.2	153.846
	√	√	0.941	0.908	0.96	0.813	2,052,305	5.1	227.273
√	√	√	0.958	0.921	0.970	0.835	1,992,953	5.0	243.902

Table 2. Comparative Experiments of Different Attention Mechanisms.

Model	P	R	mAP50	mAP50-95
YOLO11	0.934	0.870	0.945	0.784
YOLO11-SHSA	0.921	0.881	0.948	0.788
YOLO11-SEAM	0.926	0.876	0.945	0.793
YOLO11-MLCA	0.935	0.876	0.95	0.787
YOLO11-CBAM	0.936	0.880	0.952	0.799
YOLO11-SCSA	0.936	0.884	0.953	0.801

Table 3. Performance comparison results of different models.

Model	P	R	mAP50	mAP50-95	Parameters	GFLOPs	Speed (FPS)
RTDETR-resnet18	0.897	0.808	0.9	0.72	21,799,409	52.3	100.692
MobileNetv2	0.927	0.834	0.922	0.744	4,757,846	10.2	120.320
YOLOv7-Tiny	0.927	0.870	0.944	0.790	6,007,596	13.1	50.505
YOLOv8n	0.936	0.860	0.945	0.785	3,066,233	8.1	200.000
YOLOv9t	0.919	0.847	0.937	0.773	1,971,369	7.6	200.000
YOLOv10n	0.919	0.853	0.935	0.781	2,695,586	8.2	250.000
YOLO11n	0.934	0.870	0.945	0.784	2,582,737	6.3	192.307
YOLO11-WAS	0.958	0.921	0.970	0.835	1,992,953	5.0	243.902

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, X.; Zhang, X.; Li, T.; Chen, X.; Yu, X.; Wang, H. YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11. Agriculture 2025, 15, 1521. https://doi.org/10.3390/agriculture15141521

AMA Style

Du X, Zhang X, Li T, Chen X, Yu X, Wang H. YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11. Agriculture. 2025; 15(14):1521. https://doi.org/10.3390/agriculture15141521

Chicago/Turabian Style

Du, Xinwu, Xiaoxuan Zhang, Tingting Li, Xiangyu Chen, Xiufang Yu, and Heng Wang. 2025. "YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11" Agriculture 15, no. 14: 1521. https://doi.org/10.3390/agriculture15141521

APA Style

Du, X., Zhang, X., Li, T., Chen, X., Yu, X., & Wang, H. (2025). YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11. Agriculture, 15(14), 1521. https://doi.org/10.3390/agriculture15141521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Dataset Acquisition

2.1.2. Dataset Creation

2.2. Model Improvement

2.2.1. WAS-YOLO11 Model

2.2.2. C3K2_WT Module

2.2.3. ADown Module

2.2.4. C2PSA_SCSA Module

2.2.5. Model Evaluation Index

3. Model Training Results

3.1. Experimental Environment

3.2. Ablation Test

3.3. Comparative Experiments on Attention Mechanisms

3.4. Performance Comparison of Different Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI