Walnut Recognition Method for UAV Remote Sensing Images

: During the process of walnut identification and counting using UAVs in hilly areas, the complex lighting conditions on the surface of walnuts somewhat affect the detection effectiveness of deep learning models. To address this issue, we proposed a lightweight walnut small object recognition method called w-YOLO. We reconstructed the feature extraction network and feature fusion network of the model to reduce the volume and complexity of the model. Additionally, to improve the recognition accuracy of walnut objects under complex lighting conditions, we adopted an attention mechanism detection layer and redesigned a set of detection heads more suitable for walnut small objects. A series of experiments showed that when identifying walnut objects in UAV remote sensing images, w-YOLO outperforms other mainstream object detection models, achieving a mean Average Precision (mAP0.5) of 97% and an F1-score of 92%, with parameters reduced by 52.3% compared to the YOLOv8s model. Effectively addressed the identification of walnut targets in Yunnan, China, under the influence of complex lighting conditions.


Introduction
The walnut, scientifically known as Juglans regia, is a type of nut tree belonging to the Juglandaceae.Walnuts are rich in protein, unsaturated fatty acids, vitamins, and other minerals beneficial to human health [1].Currently, there are 21 species of walnuts distributed across the West Indies, Southern Europe, Asia, Central America, North America, and western South America [2].Among them, China is the world's largest walnut producer, accounting for over half of the global walnut production.The main walnut-producing regions in China include Yunnan Province (880,000 tons), Xinjiang Uygur Autonomous Region (440,000 tons), Sichuan Province (300,000 tons), Shaanxi Province (200,000 tons), and others [3].Taking Yunnan Province, the largest walnut-producing region, as an example, the primary variety is the Deep-ridged walnut, a unique variety in southwestern China.When Deep-ridged walnuts mature in autumn, their shells harden, change color, and the kernels complete their development.Farmers harvest them during the walnut's ripe period to ensure optimal taste and flavor.However, the ripe period of walnuts is short, and ripe walnuts are prone to oxidation and spoilage.Additionally, overripe walnut kernels tend to stick tightly to the shell, which increases the difficulty of processing after harvesting.Currently, there are two main challenges in walnut production management: firstly, the primary method of walnut harvesting is manual, resulting in very low efficiency and causing many walnuts to overripe and rot on the trees.Secondly, most walnut trees in Yunnan are planted in hilly areas with complex terrain and uneven distribution, making manual counting of walnut fruits extremely difficult.
In recent years, agricultural digitization has continuously improved, promoting the rational utilization of modern production technologies and traditional agricultural production elements, which plays a crucial role in adjusting agricultural production methods and achieving precision agriculture [4,5].Agricultural digitization refers to the use of advanced technologies such as big data [6], machine learning [7,8], the Internet of Things [9], and deep learning [10][11][12] in the agricultural production process.Shantam Shorewala et al. [13] proposed a semi-supervised decision method to identify the density and distribution of weeds from color images to locate weeds in fields.Validation results demonstrate that the method generalizes well to different plant species, achieving a maximum recall of 0.99 and a maximum accuracy of 82.13%.Cheng et al. used a deep residual network to detect pests in fields with complex backgrounds.Experimental results showed that the accuracy of this method was higher than support vector machines and backpropagation neural networks and higher than the recognition accuracy of traditional convolutional neural networks (CNNs).However, the network structure complexity of ResNet is relatively high, requiring more computation [14].Behroozi-Khazaei et al. combined artificial neural networks (ANN) with genetic algorithms (GA) to segment grape clusters similar in color to the background and leaves.Although the improved algorithm can automatically detect grape clusters in images and effectively predict yields, it remains challenging to successfully detect when there is little color difference between grape clusters and leaves [15].Juan Ignacio Arribas et al. segmented RGB images to separate sunflower leaves from the background and then used a Generalized Sensory Perceptron (GSP) neural network architecture combined with a Posterior Probability Model Selection (PPMS) algorithm to classify sunflower leaves and weeds.However, classification accuracy may be affected when lighting conditions are complex [16].In summary, algorithms still face challenges such as high computational complexity.Additionally, when background and object features are too similar, models may struggle to meet expectations for crop detection.
With the development of deep learning, object detection algorithms have been widely applied in various fields including remote sensing [17,18], urban data analysis [19], agricultural production [20], embedded development [21] and multispectral image detection [22].Object detection mainly includes two-stage and one-stage algorithms.Two-stage algorithms include R-CNN [23], Fast-RCNN [24], Faster-RCNN [25], Mask R-CNN [26], etc.These algorithms classify objects based on pre-generated candidate regions, and their detection accuracy is usually higher.However, due to the multi-stage processing required by two-stage algorithms, their complexity is relatively high, and real-time performance is poor, requiring higher hardware requirements.In order to optimize the cumbersome detection process of two-stage algorithms, one-stage detection algorithms have been proposed.Joseph Redmon et al. proposed a one-stage object detection algorithm called YOLO (You Only Look Once), which promoted the development of real-time object detection [27].Subsequently, many researchers proposed improved one-stage detection algorithms, such as SSD [28], CenterNet [29], YOLOv3 [30], YOLOv7 [31], etc. Chen et al. proposed an improved YOLOv4 model for detecting and counting bayberry trees in images captured by UAVs.Experimental results show that the improved model achieves higher recall while ensuring accuracy [32].Hao et al. improved the YOLOv3 algorithm for detecting green walnuts.This algorithm utilizes Mixup data augmentation and introduces the lightweight convolutional network MobileNet-v3.In the experiment for detecting green walnuts, the model size is 88.6 MB, and the accuracy reaches 93.3% [33].Zhong et al. conducted research on walnut recognition in natural environments.They improved the YOLOX algorithm using the Swin Transformer multi-feature fusion module.The improved model achieved an AP50 of 96.72% in natural environments, with a model parameter of 20.55 M [34].In Li et al.'s study, by improving the feature fusion structure of the YOLOX model, the model's ability to interact with local information in UAV remote sensing images is enhanced, achieving stronger small object detection capabilities [35].
Considering the significant challenge of manually counting walnut fruits in hilly areas and recognizing the superiority of the YOLOv8s algorithm in object detection, we proposed the w-YOLO algorithm to address walnut fruit object detection in hilly terrain and under complex lighting conditions.
The contributions of our work can be summarized in the following points: 1.
We utilized UAVs to collect remote-sensing images of walnut trees and established a representative dataset of small walnut targets.The dataset consists of 2490 images with a resolution of 640 × 640, containing a total of 12,138 walnut targets.This work fills the gap in walnut datasets and provides valuable data for walnut target detection and recognition under complex lighting conditions.

2.
We made improvements to the YOLOv8s model and designed a w-YOLO model, which includes a lightweight feature extraction network, a better-performing feature fusion network, and a new detection layer.These improvements aim to reduce the model's parameter count, and decrease the size of the model's weight files for deployment on edge computing devices.At the same time, it enhances the model's ability to capture walnut object features, making the model more suitable for walnut detection and recognition under different lighting conditions.

3.
The w-YOLO model we designed achieved the recognition of small walnut targets under complex lighting conditions.It significantly improves walnut detection accuracy, with a mAP0.5 of 97% and an F1-score of 92%.The parameter count decreased by 52.3%, and the model's weight file size reduced by 50.7%.Its detection performance surpasses the baseline YOLOv8s and other mainstream object detection models, providing valuable references for walnut detection and management under complex lighting conditions.
The remaining structure of the paper is as follows.In Section 2, we provide an overview of our dataset and introduce the design details of the w-YOLO model.In Section 3, we conduct a series of experiments and analyze the results.Section 4 delves into a detailed discussion of some factors influencing the w-YOLO model.Finally, in the Section 5, we present our conclusions.

Materials and Methods
In this study, we collect a large amount of walnut image data and preprocess the remote sensing image data, first.Then, we improve YOLOv8s and continuously optimize the training parameters of the model to obtain the optimal walnut detection model, w-YOLO.The basic process of the detection model is illustrated in Figure 1.The input image undergoes data augmentation to increase data diversity, and then undergoes feature extraction through the backbone network.w-YOLO also incorporates BiFPN [36] as the feature fusion network, followed by prediction on the feature outputs.Using the w-YOLO model to compare with other mainstream models, we evaluated the walnut recognition performance under facing light, side light, and backlight.The w-YOLO model achieves both reliable accuracy and the best recall rate.Therefore, the effectiveness of this model can provide effective guidance and high-quality technical support for the management of walnut orchards.The contributions of our work can be summarized in the following points: 1. We utilized UAVs to collect remote-sensing images of walnut trees and established a representative dataset of small walnut targets.The dataset consists of 2490 images with a resolution of 640 × 640, containing a total of 12,138 walnut targets.This work fills the gap in walnut datasets and provides valuable data for walnut target detection and recognition under complex lighting conditions.2. We made improvements to the YOLOv8s model and designed a w-YOLO model, which includes a lightweight feature extraction network, a better-performing feature fusion network, and a new detection layer.These improvements aim to reduce the model's parameter count, and decrease the size of the model's weight files for deployment on edge computing devices.At the same time, it enhances the model's ability to capture walnut object features, making the model more suitable for walnut detection and recognition under different lighting conditions.3. The w-YOLO model we designed achieved the recognition of small walnut targets under complex lighting conditions.It significantly improves walnut detection accuracy, with a mAP0.5 of 97% and an F1-score of 92%.The parameter count decreased by 52.3%, and the model's weight file size reduced by 50.7%.Its detection performance surpasses the baseline YOLOv8s and other mainstream object detection models, providing valuable references for walnut detection and management under complex lighting conditions.
The remaining structure of the paper is as follows.In Section 2, we provide an overview of our dataset and introduce the design details of the w-YOLO model.In Section 3, we conduct a series of experiments and analyze the results.Section 4 delves into a detailed discussion of some factors influencing the w-YOLO model.Finally, in the fifth Section 5, we present our conclusions.

Materials and Methods
In this study, we collect a large amount of walnut image data and preprocess the remote sensing image data, first.Then, we improve YOLOv8s and continuously optimize the training parameters of the model to obtain the optimal walnut detection model, w-YOLO.The basic process of the detection model is illustrated in Figure 1.The input image undergoes data augmentation to increase data diversity, and then undergoes feature extraction through the backbone network.w-YOLO also incorporates BiFPN [36] as the feature fusion network, followed by prediction on the feature outputs.Using the w-YOLO model to compare with other mainstream models, we evaluated the walnut recognition performance under facing light, side light, and backlight.The w-YOLO model achieves both reliable accuracy and the best recall rate.Therefore, the effectiveness of this model can provide effective guidance and high-quality technical support for the management of walnut orchards.

Research Process
The process of our research work is illustrated in Figure 2. Firstly, we used a DJI Matrice-300-RTK (DJI, Shenzhen, China) equipped with a Zenmuse P1 lens and conducted data collection of walnut tree images using the "following terrain" flight mode.Then, we processed and cropped these images to create a dataset suitable for deep learning models.The dataset was divided into training (64%), validation (16%), and testing (20%) sets, and we trained a preliminary walnut detection model.Subsequently, we selected better parameter combinations to improve the detection performance of the model, resulting in the w-YOLO model.Finally, we evaluated the performance of the w-YOLO model through qualitative and quantitative analysis.

Research Process
The process of our research work is illustrated in Figure 2. Firstly, we used a DJI Matrice-300-RTK (DJI, Shenzhen, China) equipped with a Zenmuse P1 lens and conducted data collection of walnut tree images using the "following terrain" flight mode.Then, we processed and cropped these images to create a dataset suitable for deep learning models.The dataset was divided into training (64%), validation (16%), and testing (20%) sets, and we trained a preliminary walnut detection model.Subsequently, we selected better parameter combinations to improve the detection performance of the model, resulting in the w-YOLO model.Finally, we evaluated the performance of the w-YOLO model through qualitative and quantitative analysis.The workflow of the research in this paper.Walnut tree images are captured using UAVs in the nadir view, and after preprocessing, they form the dataset and labels.These images are then used to train the walnut detection model, and the best parameter combination is utilized to achieve optimal detection performance.

Study Area
The research site is located in Changning County (WGS 84: 25.024486° N, 99.773675° E), Baoshan City, Yunnan Province, China (Figure 3).Walnuts are a specialty of Changning County and are a Geographical Indication product of China.The area has an average altitude of 1875 m and belongs to a subtropical monsoon climate zone with abundant rainfall (annual precipitation ranging from 700 to 2100 mm), mild temperatures (annual average temperature of 14.8 to 21.3 °C), and long sunshine hours (annual average sunshine of 2335).The favorable climate in this area is very suitable for the growth of walnut trees.The workflow of the research in this paper.Walnut tree images are captured using UAVs in the nadir view, and after preprocessing, they form the dataset and labels.These images are then used to train the walnut detection model, and the best parameter combination is utilized to achieve optimal detection performance.

Dataset Production
We first used a sliding window to crop the 180 aerial images into sizes of 640 × 640 pixels.Then, we discarded some images that did not contain walnut trees, resulting in a dataset consisting of 2490 walnut images.Subsequently, we used the Labelimg to annotate the walnut fruit objects.The detailed information on the walnut dataset we established is shown in Table 1.

Dataset Production
We first used a sliding window to crop the 180 aerial images into sizes of 640 × 640 pixels.Then, we discarded some images that did not contain walnut trees, resulting in a dataset consisting of 2490 walnut images.Subsequently, we used the Labelimg to annotate

Dataset Production
We first used a sliding window to crop the 180 aerial images into sizes of 640 × 640 pixels.Then, we discarded some images that did not contain walnut trees, resulting in a dataset consisting of 2490 walnut images.Subsequently, we used the Labelimg to annotate  Note: pcs is the abbreviation for pieces.

A Detection Algorithm for Small Walnut Objects-w-YOLO
In this study, our aim was to design a lightweight walnut detection model without sacrificing detection accuracy, as measured by metrics such as P, R, F1-score, and mAP0.5.Building upon the advantages of YOLOv8s, we made improvements to it, resulting in the w-YOLO model.It is more suitable for detecting small walnut objects in images and can provide technical support for future real-time walnut fruit detection tasks.

YOLOv8s Model
The YOLOv8s model is the latest version of YOLO open-sourced by Ultralytics [37].YOLOv8s mainly consists of three parts: Backbone, Neck, and Head.The Backbone serves as the feature extraction network, similar to YOLOv5's Backbone, both belonging to CSPDarknet.Here, the input image undergoes initial feature extraction to form an image feature set.The Neck is a feature fusion network, which utilizes a combination of Feature Pyramid Network (FPN) [38] and Path Aggregation Network (PAN) [39] structures to fuse feature maps from different layers of the Backbone, enhancing detection accuracy and robustness.In the Head, a decoupled head structure is used to separate the classification and detection heads, while employing an Anchor-Free approach different from YOLOv5.

w-YOLO
To obtain the w-YOLO model (Figure 5), we first replaced the feature extraction network and the C2f structure in the Neck of YOLOv8 to make it more lightweight.Secondly, we adopted a Weighted Bi-directional Feature Pyramid Network (BiFPN) in the feature fusion part to enhance the feature fusion capability of YOLOv8s.Thirdly, to address the challenge of capturing important feature information, which can be challenging due to multiple downsampling of feature maps, we introduced a self-attention dynamic detection head-DyHead [40].Finally, considering the characteristics of small walnut objects, we added a detection head of size 160 × 160 and removed the detection head of size 20 × 20.
In this study, our aim was to design a lightweight walnut detection model without sacrificing detection accuracy, as measured by metrics such as P, R, F1-score, and mAP0.5.Building upon the advantages of YOLOv8s, we made improvements to it, resulting in the w-YOLO model.It is more suitable for detecting small walnut objects in images and can provide technical support for future real-time walnut fruit detection tasks.

YOLOv8s Model
The YOLOv8s model is the latest version of YOLO open-sourced by Ultralytics [37].YOLOv8s mainly consists of three parts: Backbone, Neck, and Head.The Backbone serves as the feature extraction network, similar to YOLOv5's Backbone, both belonging to CSPDarknet.Here, the input image undergoes initial feature extraction to form an image feature set.The Neck is a feature fusion network, which utilizes a combination of Feature Pyramid Network (FPN) [38] and Path Aggregation Network (PAN) [39] structures to fuse feature maps from different layers of the Backbone, enhancing detection accuracy and robustness.In the Head, a decoupled head structure is used to separate the classification and detection heads, while employing an Anchor-Free approach different from YOLOv5.

w-YOLO
To obtain the w-YOLO model (Figure 5), we first replaced the feature extraction network and the C2f structure in the Neck of YOLOv8 to make it more lightweight.Secondly, we adopted a Weighted Bi-directional Feature Pyramid Network (BiFPN) in the feature fusion part to enhance the feature fusion capability of YOLOv8s.Thirdly, to address the challenge of capturing important feature information, which can be challenging due to multiple downsampling of feature maps, we introduced a self-attention dynamic detection head-DyHead [40].Finally, considering the characteristics of small walnut objects, we added a detection head of size 160 × 160 and removed the detection head of size 20 × 20.

Lightweight Feature Extraction Backbone-Fasternet
Currently, there are several convolutional networks that enable deep learning models to become lightweight, such as MobileNet [41], ShuffleNet [42], and GhostNet [43].They utilize depthwise convolution and group convolution to extract features, aiming to reduce computational complexity.However, operations like concatenation, shuffling, and

Lightweight Feature Extraction Backbone-Fasternet
Currently, there are several convolutional networks that enable deep learning models to become lightweight, such as MobileNet [41], ShuffleNet [42], and GhostNet [43].They utilize depthwise convolution and group convolution to extract features, aiming to reduce computational complexity.However, operations like concatenation, shuffling, and pooling in these networks still contribute significantly to the runtime, which remains challenging for smaller models.Another lightweight network variant includes MobileViT [44] and MobileFormer [45], which combine depthwise convolution (DWConv) with attention mechanisms to reduce computational complexity.Nevertheless, DWConv remains a challenge for further lightweighting in such networks.
FasterNet (Figure 6) achieves lightweight by reducing memory access and computational redundancy in convolutions [46].FasterNet consists mainly of Embedding layers, Merging layers, and FasterNet Blocks.The FasterNet Block relies on PConv (Partial Convolution) and PWConv (Point-Wise Convolution).PConv is an improvement over DWConv.While DWConv utilizes multiple filters w ∈ R k×k to compute the output O ∈ R c×h×w , its computational complexity is as shown in Equation (1).In contrast, PConv performs Conv only on a subset of input channels c p while keeping the rest unchanged.The computational complexity of PConv can be expressed as shown in Equation (2).
When the typical ratio   = 1 4 , the computational complexity of PConv is only 1 16 of DWConv.To fully utilize information from all channels, PWConv is attached to PConv in a separable manner, as depicted in Figure 7.Its computational complexity can be expressed as When the typical ratio , the computational complexity of PConv is only 1 16 of DWConv.
To fully utilize information from all channels, PWConv is attached to PConv in a separable manner, as depicted in Figure 7.Its computational complexity can be expressed as Agriculture 2024, 14, x FOR PEER REVIEW 8 of 20

Multi-Scale Feature Fusion
In the feature fusion network of YOLOv8s, a combination of FPN and PAN is used, aiming to add a bottom-up aggregation pathway to the top-down basis of FPN (Figure 8a).However, this structure introduces a significant number of parameters and computations.We will replace it with the BiFPN structure (Figure 8b) as the feature fusion network for YOLOv8s.This structure learns the importance of different input features and adaptively fuses them.Additionally, the skip connections in BiFPN at the same scale can fuse more features without adding too much computational overhead.The expression of BiFPN can be represented by Equations ( 4) and (5).
We improved the Neck network's C2f using the FasterNet Block from FasterNet, making it more lightweight.We refer to the improved C2f as C2f-Faster, and its structure is depicted in Figure 9.

Multi-Scale Feature Fusion
In the feature fusion network of YOLOv8s, a combination of FPN and PAN is used, aiming to add a bottom-up aggregation pathway to the top-down basis of FPN (Figure 8a).However, this structure introduces a significant number of parameters and computations.We will replace it with the BiFPN structure (Figure 8b) as the feature fusion network for YOLOv8s.This structure learns the importance of different input features and adaptively fuses them.Additionally, the skip connections in BiFPN at the same scale can fuse more features without adding too much computational overhead.The expression of BiFPN can be represented by Equations ( 4) and (5).
We improved the Neck network's C2f using the FasterNet Block from FasterNet, making it more lightweight.We refer to the improved C2f as C2f-Faster, and its structure is depicted in Figure 9.We improved the Neck network's C2f using the FasterNet Block from FasterNet, making it more lightweight.We refer to the improved C2f as C2f-Faster, and its structure is depicted in Figure 9.

Improved Detection Head
Due to the small number of pixels occupied by walnuts in the images, there is limited available feature information, making it difficult to locate small objects on the 80 × 80 feature map of YOLOv8s.Having larger feature maps would help the model capture detailed information about small objects.Therefore, we added a small object detection head with a size of 160 × 160.We also found that the 20 × 20 detection head in YOLOv8s did not perform well in detecting walnut objects, so we removed it to simplify the model.
After multiple downsampling operations, there may be information loss in the feature maps, making it difficult for the detection head to distinguish walnut objects from the background.Therefore, we replaced the detection head of YOLOv8s with the DyHead, which incorporates a self-attention mechanism.After the feature information is input from the Neck network to DyHead, it undergoes a three-dimensional feature tensor  ∈  × × , consisting of scale-aware attention ( ), spatial-aware attention ( ), and taskaware attention ( ).DyHead integrates these three types of attention together, and its expression can be represented as:

Improved Detection Head
Due to the small number of pixels occupied by walnuts in the images, there is limited available feature information, making it difficult to locate small objects on the 80 × 80 feature map of YOLOv8s.Having larger feature maps would help the model capture detailed information about small objects.Therefore, we added a small object detection head with a size of 160 × 160.We also found that the 20 × 20 detection head in YOLOv8s did not perform well in detecting walnut objects, so we removed it to simplify the model.
After multiple downsampling operations, there may be information loss in the feature maps, making it difficult for the detection head to distinguish walnut objects from the background.Therefore, we replaced the detection head of YOLOv8s with the DyHead, which incorporates a self-attention mechanism.After the feature information is input from the Neck network to DyHead, it undergoes a three-dimensional feature tensor F ∈ R L×S×C , consisting of scale-aware attention (π L ), spatial-aware attention (π S ), and task-aware attention (π C ). DyHead integrates these three types of attention together, and its expression can be represented as:

Experimental Setup
All experiments were conducted using the PyTorch 2.0.1 framework and a CUDA 11.7 server on a Quadro RTX 6000 GPU.The dataset images were all 640 × 640 pixels in size, with the training and validation sets divided in an 8:2 ratio, comprising 1594 and 398 images, respectively.During training, we utilized the stochastic gradient descent (SGD) optimizer to update the model parameters, with lr0 set to 0.01, lrf set to 0.07, weight decay set to 0.0005, momentum set to 0.917, batch size set to 32, and epochs set to 300.These settings were consistent across all experiments.

Evaluation Indicators
In this experiment, we use the metrics P (Precision), R (Recall), mAP (mean Average Precision), Parameters, and GFLOPs to evaluate the performance of the w-YOLO.The calculation formulas for P, R, and mAP are shown in Equations ( 7)-( 9).
where TP represents true positives, FP represents false positives, and FN represents false negatives.mAP denotes the average precision across multiple classes and depends on both precision and recall.mAP0.5 represents the average precision at an IOU threshold of 0.5 for all classes, while mAP0.5:0.95represents the average precision for all classes at IOU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.In object detection tasks, a higher mAP value indicates better detection performance and is a commonly used and authoritative evaluation metric.F1-score (0 ≤ F1 ≤ 1) is used to measure the balance between precision and recall, as shown in Equation (10).It represents the harmonic mean of precision and recall, and a higher F1-score indicates better results.
The parameter count serves as a metric for evaluating the complexity and resource consumption of a model.Generally, a higher parameter count indicates a more complex model, requiring more computational resources and memory space for training and execution.GFLOPs represent the number of floating-point operations executed by the model per second during inference, and can be used to assess the model's complexity.

Experimental Results of w-YOLO on the Walnut Dataset
We trained YOLOv8s and the w-YOLO model, and the training curves are shown in Figure 10.The training results of both YOLOv8s and w-YOLO are the best results obtained when the training converges.From the localization loss curve, it can be observed that the curve of w-YOLO converges faster, indicating that it learns better in the task of walnut localization.This suggests that w-YOLO has a better predictive ability for the location information of walnuts.Additionally, in terms of the detection accuracy metric mAP0.5, w-YOLO achieved 0.970, which is 0.004 higher than YOLOv8s (0.966).This indicates that the feature extraction network, feature fusion network, and detection layer of YOLOv8s were improved, enhancing the detection effectiveness for walnut objects in the w-YOLO model.In actual walnut target detection tasks, although there is only a slight improvement in the mAP0.5 metric with w-YOLO, it signifies that we can ensure overall detection performance and reliability while reducing the model size.Especially for small objects like walnuts, even a small improvement can have a significant impact.The ability of the walnut detection system to perform detection tasks accurately and reliably is crucial for practical deployment and application.We tested YOLOv8s and w-YOLO using 165 walnut test images.Detailed information about these walnut test images is presented in Table 2.We subdivided the complex lighting conditions into facing light, side light, and backlight, and examples of walnuts in the test dataset under facing light, side light, and backlight are shown in (a)-(c) of Figure 11, respectively.Figure 12 illustrates the visual comparison of the detection results between YOLOv8s and w-YOLO.In the first row, it can be observed that YOLOv8s missed detections when walnuts were in backlighting conditions, whereas w-YOLO, after improvements, maintained stable accuracy even in such extreme lighting environments.In the second row, the circled walnuts are partially occluded and illuminated from the front.Due to the enhanced feature-capturing capability of w-YOLO, its detection performance is notably better than that of YOLOv8s.When the lighting on the walnut surface is uneven (third row), YOLOv8s struggles to distinguish between the features of leaves and walnuts, while w-YOLO still ensures accurate detection under such complex lighting conditions.We tested YOLOv8s and w-YOLO using 165 walnut test images.Detailed information about these walnut test images is presented in Table 2.We subdivided the complex lighting conditions into facing light, side light, and backlight, and examples of walnuts in the test dataset under facing light, side light, and backlight are shown in (a-c) of Figure 11, respectively.Figure 12 illustrates the visual comparison of the detection results between YOLOv8s and w-YOLO.In the first row, it can be observed that YOLOv8s missed detections when walnuts were in backlighting conditions, whereas w-YOLO, after improvements, maintained stable accuracy even in such extreme lighting environments.In the second row, the circled walnuts are partially occluded and illuminated from the front.Due to the enhanced feature-capturing capability of w-YOLO, its detection performance is notably better than that of YOLOv8s.When the lighting on the walnut surface is uneven (third row), YOLOv8s struggles to distinguish between the features of leaves and walnuts, while w-YOLO still ensures accurate detection under such complex lighting conditions.Note: pcs is the abbreviation for pieces.
Taking into account the lightweight design and detection performance of the model, we applied some lightweight optimizations to w-YOLO, significantly reducing the model's parameter count while improving detection accuracy.The comparison results between YOLOv8s and w-YOLO on metrics including P, R, F1-Score, Weighted file size, and Parameters are presented in Table 3. Thanks to the lightweight backbone network and C2f-Faster in the Neck, the Weighted file size of w-YOLO (11.1 MB) was reduced by 50.7%, and Parame-ters (5.31 M) decreased by 52.3%.w-YOLO achieves a good balance between model size and detection performance, which holds scientific value for future research in edge computing.
11, respectively.Figure 12 illustrates the visual comparison of the detection results between YOLOv8s and w-YOLO.In the first row, it can be observed that YOLOv8s missed detections when walnuts were in backlighting conditions, whereas w-YOLO, after improvements, maintained stable accuracy even in such extreme lighting environments.In the second row, the circled walnuts are partially occluded and illuminated from the front.Due to the enhanced feature-capturing capability of w-YOLO, its detection performance is notably better than that of YOLOv8s.When the lighting on the walnut surface is uneven (third row), YOLOv8s struggles to distinguish between the features of leaves and walnuts, while w-YOLO still ensures accurate detection under such complex lighting conditions.Note: pcs is the abbreviation for pieces.

Comparison with Other Popular Models
To demonstrate the superiority of w-YOLO in walnut detection under complex lighting conditions, we compared it with many other mainstream object detection models, including YOLOv3, YOLOv3-spp, YOLOv5s, YOLOv5m, YOLOv7, YOLOv7-Tiny, YOLOv8s, and YOLOv8m.The results of the comparative experiments are shown in Table 4. Compared to other models, w-YOLO exhibits the best detection performance and also has significant advantages in terms of model size.In terms of the F1-Score metric, w-YOLO belongs to the top tier, achieving a score of 0.92, similar to larger models such as YOLOv3, YOLOv3-spp, and YOLO5m, indicating that w-YOLO exhibits stronger generalization ability.Combined with mAP0.5, it can be observed that w-YOLO effectively balances recall and precision.Compared to the baseline model YOLOv8s and models with larger parameter counts, w-YOLO demonstrates higher detection accuracy and recall for walnut target detection tasks.In terms of parameters and weighted file size, w-YOLO has only 5.31 M parameters and a weight file size of only 11.1 MB.Compared to models such as YOLOv3, YOLOv3-spp, YOLOv5s, YOLOv5m, YOLOv7, YOLOv7-Tiny, YOLOv8s, and YOLOv8m, w-YOLO also has a significant advantage in model lightweightness, making it more suitable for deployment on edge computing devices.

Comparison of Detection Visualization with Other Models
In Section 3.3.2,we conducted a quantitative analysis.In this section, qualitative analysis is performed to visually demonstrate the detection capability of w-YOLO (Figure 13).In the images to be detected, there are a total of 10 walnut targets, with 5 in backlight, 3 in facing light, and 2 receiving only partial illumination.
From the perspective of walnut target detection under different lighting conditions, all models except YOLOv5m identified three walnut fruits illuminated by facing light and two under partial illumination.When detecting walnut targets in backlight conditions, YOLOv5s, YOLOv7, and YOLOv7-Tiny missed three targets, YOLOv3, YOLOv3-spp, YOLOv5m, and YOLOv8s missed two targets, while YOLOv8m and w-YOLO only missed one target.

Ablation Experiment
In this section, we discuss in detail the role of each module in the YOLOv8s model.A series of ablation experiments were conducted using YOLOv8s as the baseline model.From the perspective of walnut target detection under different lighting conditions, all models except YOLOv5m identified three walnut fruits illuminated by facing light and two under partial illumination.When detecting walnut targets in backlight conditions, YOLOv5s, YOLOv7, and YOLOv7-Tiny missed three targets, YOLOv3, YOLOv3-spp, YOLOv5m, and YOLOv8s missed two targets, while YOLOv8m and w-YOLO only missed one target.

Ablation Experiment
In this section, we discuss in detail the role of each module in the YOLOv8s model.A series of ablation experiments were conducted using YOLOv8s as the baseline model.The results of the ablation experiments are shown in Table 5.During the experiments, we sequentially introduced C2f-Faster, BiFPN, FasterNet, DyHead, and S2 to the model.Here, S2 refers to the operation of adding a 160 × 160 detection head and removing the 20 × 20 detection head.Group A represents the baseline model YOLOv8s.After improving the Neck with C2f (Group B), the model's Parameters and Weighted file size slightly decreased, but it did not negatively affect the walnut object detection accuracy.In experimental Group C, the BiFPN feature fusion structure further reduced the size of the model.Although the mAP0.5 decreased by 0.001 compared to Group B, it still maintained the initial performance.Building on Group C, we replaced the model's backbone network with the Faster-Net structure (Group D).At this point, the model improved significantly in terms of parameters and weighted file size, decreasing by 2.53 M and 5.1 MB, respectively.To mitigate the negative impact of lightweighting, we enhanced the model's detection head.In  Group A represents the baseline model YOLOv8s.After improving the Neck with C2f (Group B), the model's Parameters and Weighted file size slightly decreased, but it did not negatively affect the walnut object detection accuracy.In experimental Group C, the BiFPN feature fusion structure further reduced the size of the model.Although the mAP0.5 decreased by 0.001 compared to Group B, it still maintained the initial performance.Building on Group C, we replaced the model's backbone network with the FasterNet structure (Group D).At this point, the model improved significantly in terms of parameters and weighted file size, decreasing by 2.53 M and 5.1 MB, respectively.To mitigate the negative impact of lightweighting, we enhanced the model's detection head.In experimental Group E, with the effect of DyHead and S2, the model's mAP0.5 increased by 0.007, while there was only a slight increase in model size.

The Impact of Data Augmentation Parameters on the Model
After the images are inputted into the w-YOLO model, they first pass through the data augmentation module.Geometry-based data augmentation is equivalent to introducing variations in viewpoint and spatial position within the dataset, thereby enhancing the model's robustness in these aspects and improving testing accuracy [47].Therefore, To delve deeper into the impact of the data augmentation module on the detection performance of w-YOLO, we analyzed the following parameters: image rotation (Degree), image translation (Translate), image scale (Scale), image perspective (Perspective), image flip updown (Flipud), image flip left-right (Fliplr), image mosaic (Mosaic), image mixup (Mixup), and segment copy-paste (Copy_paste).The corresponding experimental results are shown in Figure 14a-i.From the figures, it can be observed that when Degree, Translate, Scale, Flipud, Fliplr, and Mosaic are set to −5, 0.45, 0.7, 0.7, 0.5, and 0.7, respectively, they have a beneficial effect on the model.However, using Perspective, Mixup, and Copy_paste may have a negative impact on the model.
ing variations in viewpoint and spatial position within the dataset, thereby enhancing the model's robustness in these aspects and improving testing accuracy [47].Therefore, To delve deeper into the impact of the data augmentation module on the detection performance of w-YOLO, we analyzed the following parameters: image rotation(Degree), image translation(Translate), image scale(Scale), image perspective(Perspective), image flip updown(Flipud), image flip left-right(Fliplr), image mosaic(Mosaic), image mixup(Mixup), and segment copy-paste(Copy_paste).The corresponding experimental results are shown in Figure 14a-i.From the figures, it can be observed that when Degree, Translate, Scale, Flipud, Fliplr, and Mosaic are set to −5, 0.45, 0.7, 0.7, 0.5, and 0.7, respectively, they have a beneficial effect on the model.However, using Perspective, Mixup, and Copy_paste may have a negative impact on the model.

Detection Layer Analysis
One of the drawbacks of the detection head in the YOLO algorithm is that, since the detection head typically operates at the final layer of the network, it may miss some low-level detailed information.This can result in lower detection accuracy, particularly for small objects or in complex scenes [48,49].Therefore, we redesigned the detection head of YOLO and thoroughly analyzed the performance of the detection head in w-YOLO.

Effect of the Number of Dyhead Blocks on Model Performance
To explore the impact of the number of DyHead blocks (Block_num) on the detection performance of w-YOLO, experiments were conducted by adding 1, 2, 3, 4, and 5 DyHead blocks.From Table 6, it can be observed that as the number of DyHead blocks increases, the model's complexity also increases.When the number is 1, the model achieves the optimal mean Average Precision (mAP0.5)value (0.970), with the lowest values observed for Layer, Parameters, GFLOPs, and Weighted file size indicators.This indicates that adding more DyHead blocks does not necessarily imply stronger feature-capturing capability for the model.Therefore, increasing the depth of the model could have a negative impact on walnut detection.The walnut target detection in UAV remote sensing images falls under the small object detection category.Therefore, in the design of w-YOLO, an additional detection layer specifically tailored for small objects with dimensions of 160 × 160 was added to address the detection of smaller walnut targets.To further compress the size of w-YOLO, the detection layer with dimensions of 20 × 20, suitable for larger targets, was removed.
In this section, we provide a detailed comparison of the experimental results regarding different combinations of detection head sizes to investigate their impact on the model's detection performance and parameter count.Using YOLOv8s as the baseline, the experimental results are presented in Table 7. From the table, it is evident that the metrics P (0.928) and mAP0.5 (0.968) in Group C outperform those in Groups A and B, while the values of layers and GFLOPs increased by only 13 and 5.7, respectively.Overall, the size configuration of Group C effectively balances the model's detection performance and complexity.8) enables the model to converge to the optimal solution more quickly, accelerating the training process.Figure 15 depicts the training loss curves obtained with different optimizers.From the zoomed-in plots, it can be observed that SGD converges notably faster, followed by Adamax.At the convergence point towards the end of the curves, w-YOLO trained with the SGD optimizer stops training early around 200 epochs, while the curve for Adamax continues to descend.The training loss for w-YOLO using Adamax is lower, which is advantageous for detecting walnut objects.

Model Performance Advantages and Limitations
Although w-YOLO achieves a certain degree of lightweight design without sacrificing detection accuracy, our research still has the following limitations: (1) The GFLOPs of the detection head in the original size are only 28.4, and in the design of S2 (34.1), we focus more on parameter count and mAP0.5, but this increases the computational load of the model.( 2) Despite w-YOLO outperforming YOLOv3, YOLOv3spp, YOLOv5s, YOLOv5m, YOLOv7, YOLOv7-Tiny, and YOLOv8s in detecting walnut targets under backlight conditions, there are still cases of missed detections.(3) Although we made some progress in walnut object detection tasks under facing light, side light, and backlight conditions, we did not further analyze the detection of occluded walnuts.In future research, we will strive to investigate the impact of walnut occlusion on object detection more deeply and continuously optimize the detection performance of w-YOLO to make it applicable to a wider range of walnut detection tasks.

Conclusions
In walnut agriculture production, yield prediction is a crucial step, and traditional manual counting methods face significant challenges in hilly areas.Given the advantages of deep learning models and low-altitude remote sensing technology in agricultural production, in this study, we constructed a walnut small object dataset using high-

Model Performance Advantages and Limitations
Although w-YOLO achieves a certain degree of lightweight design without sacrificing detection accuracy, our research still has the following limitations: (1) The GFLOPs of the detection head in the original size are only 28.4, and in the design of S2 (34.1), we focus more on parameter count and mAP0.5, but this increases the computational load of the model.( 2) Despite w-YOLO outperforming YOLOv3, YOLOv3-spp, YOLOv5s, YOLOv5m, YOLOv7, YOLOv7-Tiny, and YOLOv8s in detecting walnut targets under backlight conditions, there are still cases of missed detections.(3) Although we made some progress in walnut object detection tasks under facing light, side light, and backlight conditions, we did not further analyze the detection of occluded walnuts.In future research, we will strive to investigate the impact of walnut occlusion on object detection more deeply and continuously optimize the detection performance of w-YOLO to make it applicable to a wider range of walnut detection tasks.

Conclusions
In walnut agriculture production, yield prediction is a crucial step, and traditional manual counting methods face significant challenges in hilly areas.Given the advantages of deep learning models and low-altitude remote sensing technology in agricultural production, in this study, we constructed a walnut small object dataset using high-resolution aerial images captured by UAVs, addressing the problem of data scarcity in this research field.The dataset consists of 2490 images, totaling 12,138 walnut targets.In hilly areas, the complex lighting conditions experienced by walnut fruits during UAV data collection to some extent affect the accuracy of the model.Therefore, based on the YOLOv8s model, we made a series of improvements to obtain w-YOLO, including the utilization of FasterNet, C2f-Faster, and BiFPN to simplify the model's feature extraction and fusion networks, reducing parameters by 6.37 M and shrinking the weight file size to 9.8 MB.Additionally, we employed a DyHead detection layer with attention mechanisms and redesigned a detection head combination more suitable for walnut object identification.In the walnut recognition task in complex lighting conditions of UAV remote sensing images, w-YOLO achieved a mAP0.5 of 97%, an increase of 0.4% compared to YOLOv8s, with parameters and weight file size reduced by 52.3% and 50.7%, respectively.It is worth noting that our study focuses on model lightweight and enabling w-YOLO to adapt to walnut fruit detection under different lighting conditions.The detection performance of w-YOLO under backlighting was significantly improved compared to the original model, but there are still instances of missed detections, making walnut identification under backlighting conditions challenging.Furthermore, w-YOLO has shown excellent detection results under facing and side lighting.We believe that the lightweight w-YOLO can provide valuable assistance for walnut production management and support the development of edge hardware devices for walnut detection.
However, we recognize that there is still significant room for improvement in the robustness of walnut recognition models.Therefore, our walnut dataset still needs to be further expanded, such as adding walnut data in different occlusion scenarios and multispectral walnut image data.In future research, we will also conduct radar-based three-dimensional modeling of walnut forests and calculate vegetation indices to provide more valuable resources for walnut agriculture production research.

Figure 1 .
Figure 1.The basic flowchart of the w-YOLO model.Unlike YOLOv8, this model utilizes FasterNet for feature extraction and BiFPN for feature fusion.The blue dashed lines represent the transmission of feature information from the backbone network to the neck network, while the blue solid lines indicate the direction of feature information transmission within the neck network.

Figure 1 .
Figure 1.The basic flowchart of the w-YOLO model.Unlike YOLOv8, this model utilizes FasterNet for feature extraction and BiFPN for feature fusion.The blue dashed lines represent the transmission of feature information from the backbone network to the neck network, while the blue solid lines indicate the direction of feature information transmission within the neck network.

Figure 2 .
Figure 2.The workflow of the research in this paper.Walnut tree images are captured using UAVs in the nadir view, and after preprocessing, they form the dataset and labels.These images are then used to train the walnut detection model, and the best parameter combination is utilized to achieve optimal detection performance.

Figure 2 .
Figure 2.The workflow of the research in this paper.Walnut tree images are captured using UAVs in the nadir view, and after preprocessing, they form the dataset and labels.These images are then used to train the walnut detection model, and the best parameter combination is utilized to achieve optimal detection performance.

Figure 3 .
Figure 3.The study area and its location in this paper.

Figure 4 .
Figure 4. Walnut tree images from the perspective of a UAV.

Figure 3 .
Figure 3.The study area and its location in this paper.

Figure 3 .
Figure 3.The study area and its location in this paper.

Figure 4 .
Figure 4. Walnut tree images from the perspective of a UAV.

Figure 4 .
Figure 4. Walnut tree images from the perspective of a UAV.

Figure 5 .
Figure 5.The structure of the w-YOLO model.

Figure 5 .
Figure 5.The structure of the w-YOLO model.

Figure 6 .
Figure 6.The structure of FasterNet and Partial Convolution.

Figure 6 .
Figure 6.The structure of FasterNet and Partial Convolution.

Figure 7 .
Figure 7.The combination approach of Partial Convolution and Point-Wise Convolution.

Figure 7 .
Figure 7.The combination approach of Partial Convolution and Point-Wise Convolution.

Figure 8 .
Figure 8.The structure of the model's feature fusion network.(a) The PAN + FPN structure.(b) The BiFPN structure.

Figure 8 .
Figure 8.The structure of the model's feature fusion network.(a) The PAN + FPN structure.(b) The BiFPN structure.

Figure 10 .
Figure 10.Comparison of the training curves of YOLOv8s and the w-YOLO model.(a) Loss curves of YOLOv8s and the w-YOLO model; (b) mAP0.5 curves of YOLOv8s and the w-YOLO model.

Figure 11 .
Figure 11.Walnuts under complex lighting conditions.The walnuts in the red circles represent examples of walnuts under different lighting conditions.(a) Facing light.(b) Side light.(c) Backlight.

Figure 10 .
Figure 10.Comparison of the training curves of YOLOv8s and the w-YOLO model.(a) Loss curves of YOLOv8s and the w-YOLO model; (b) mAP0.5 curves of YOLOv8s and the w-YOLO model.

Figure 11 .
Figure 11.Walnuts under complex lighting conditions.The walnuts in the red circles represent examples of walnuts under different lighting conditions.(a) Facing light.(b) Side light.(c) Backlight.

Figure 11 .Figure 12 .
Figure 11.Walnuts under complex lighting conditions.The walnuts in the red circles represent examples of walnuts under different lighting conditions.(a) Facing light.(b) Side light.(c) Backlight.Agriculture 2024, 14, x FOR PEER REVIEW 12 of 20

Figure 12 .
Figure 12.Visualization comparison of detection results between YOLOv8s and w-YOLO.The walnuts circled in yellow represent the walnuts missed by YOLOv8s, while those circled in green represent the detection results of w-YOLO for the walnuts missed by YOLOv8s.(a) Original image; (b) Detection result of YOLOv8s; (c) Detection result of w-YOLO.

20 Figure 13 .
Figure 13.The visual detection results of w-YOLO compared to other mainstream models.The green boxes in the Ground Truth represent all walnut targets in this test image.

Figure 13 .
Figure 13.The visual detection results of w-YOLO compared to other mainstream models.The green boxes in the Ground Truth represent all walnut targets in this test image.

Figure 14 .
Figure 14.mAP0.5 values of w-YOLO under different parameters of various data augmentation methods.(a-i) represent the experimental results of image rotation, image translation, image scale, image perspective, image flip up-down, image flip left-right, image mosaic, image mixup, and segment copy-paste, respectively.

Figure 14 .
Figure 14.mAP0.5 values of w-YOLO under different parameters of various data augmentation methods.(a-i) represent the experimental results of image rotation, image translation, image scale, image perspective, image flip up-down, image flip left-right, image mosaic, image mixup, and segment copy-paste, respectively.

Figure 15 .
Figure 15.Training loss curves of different optimizers.

Table 1 .
The detailed information of the walnut dataset.

Table 2 .
The total number of targets in the test images and the number of targets under different lighting conditions.

Table 2 .
The total number of targets in the test images and the number of targets under different lighting conditions.

Table 3 .
Comparison Results between YOLOv8s and w-YOLO.

Table 4 .
The comparative experimental results between w-YOLO and other mainstream models.

Table 6 .
Comparison Experiment of Different Numbers of DyHead Blocks.

Table 7 .
Ablation experiment of different detection layers.Optimizers and learning rates play crucial roles in the model training process.Choosing different optimizers can have varied effects on the model's performance.Utilizing appropriate optimizers can facilitate faster and more stable model convergence during training.Similarly, learning rates play a significant role, and setting them scientifically (Table

Table 8 .
Results of mAP0.5 under different final learning rate values.
Figure 15.Training loss curves of different optimizers.