3.1. Research Design and Positioning
This study focuses on dataset construction and benchmark evaluation for precision agriculture supervision applications. Specifically, it targets the data resources and reproducible experimental baselines required for agricultural plant protection UAV spraying operation state recognition, rather than the engineering development process of a specific platform system. In terms of research depth and scope, our work is positioned at the application and engineering support level within the research pyramid. The goal is to provide a publicly reusable dataset, a clear annotation paradigm, and reproducible performance baselines to facilitate subsequent algorithmic research and system deployment.
From the perspective of research type and methodology, this paper constitutes an application-oriented quantitative empirical study. We adopt a data-driven paradigm to build the dataset through standardized procedures for data collection, cleaning, and annotation, and we conduct objective evaluations using a unified metric system and multi-model comparative experiments. With respect to research objectives, this work pursues the following three main goals: first, to describe and formalize the key data attributes required by agricultural spraying supervision tasks; second, to establish performance baselines of mainstream object detection models for this task; and third, to provide diagnostic insights for subsequent robustness improvements through cross-background comparisons and quality assessment analyses. The primary reasoning approach in this study is inductive inference.
Regarding data sources and the temporal dimension, our data are derived from publicly available online video materials collected from multiple platforms. We retrieve raw videos by keyword-based searches and maximize diversity in scenarios and imaging conditions without restricting geographic regions or time periods. We then perform frame extraction, screening, and verification to form the final data resource. In terms of research design, we follow a sequential workflow comprising demand and gap analysis, dataset design rationale, data acquisition and quality control, annotation scheme design, data statistics and quality assessment, and benchmark evaluation. We explicitly document reproducible technical details at each stage.
3.1.1. Dataset Design Rationale
In agricultural spraying supervision scenarios, the key challenges for visual recognition mainly include substantial target scale variation, the frequent occurrence of motion blur during imaging, and complex agricultural backgrounds. These factors directly affect the visibility of cues for spraying state identification and the model’s ability to generalize across scenarios. Therefore, we treat these challenges as design constraints and sampling criteria for the dataset, and we summarize them in
Figure 1.
- 1.
Large target scale variation: During agricultural UAV operations, the shooting distance can range from a few meters to several hundred meters, resulting in highly uneven pixel occupancy of the target in images [
13,
14,
15]. Small object detection has long been a challenging problem in computer vision, and it becomes even more pronounced in agricultural scenarios. Conventional feature extraction methods often fail when dealing with extremely small targets, which motivates the need for dedicated datasets to train and optimize algorithms;
- 2.
Presence of motion blur: When performing spraying operations, UAVs typically maintain high flight speeds. Together with wind disturbances in farmland environments and camera shake, the collected images commonly exhibit motion blur [
16,
17,
18,
19]. This blur not only degrades bounding box localization accuracy, but, more importantly, interferes with correct spraying state recognition, because the spray plume appearance itself is a key visual cue for determining the operational state;
- 3.
Complex and diverse background environments: Agricultural production environments include green cropland, bare farmland, orchard, woodland, mountainous terrain, and other scenarios, with substantial differences in illumination conditions, texture characteristics, and color distributions across backgrounds. Complex backgrounds not only increase the difficulty of object detection, but, more critically, affect model generalization ability [
20,
21,
22]. A model trained in a single scenario often struggles to adapt to diverse agricultural environments.
Based on these considerations, we adopt a dataset design strategy that emphasizes multi-scale targets, multi-background coverage, and the preservation of real-world degradations. Meanwhile, the core supervision question addressed in this work is whether the UAV is performing effective spraying at a given moment. Therefore, we model the operational state as a binary variable with the following two classes: spraying and flying without spraying. This binary formulation offers two advantages. First, spray plumes and droplet trajectories constitute relatively stable and observable visual evidence in most public video viewpoints, which can substantially reduce annotation subjectivity and improve inter-annotator consistency. Second, finer-grained states, such as about to start spraying, just stopped spraying, loitering while waiting, or transitioning between fields, often lack reliable visible cues under long-distance imaging, small target sizes, and motion blur. Forcing such fine categorization would introduce high-noise labels and weaken the reproducibility of the dataset as a benchmark.
It should be noted that flying without spraying semantically covers all non-spraying flight phases, such as transiting, returning, repositioning, and waiting for refill. This definition aligns with the supervision task while leaving room for future versions to extend toward more fine-grained behavior labels within the binary framework.
3.1.2. Dataset Gap Analysis
To clarify the necessity of the proposed dataset for the task of agricultural spraying operation state recognition, we compare representative public UAV and aerial vision datasets with our dataset in terms of key attributes, as shown in
Table 1. We further provide intuitive evidence using example images, as shown in
Figure 2.
The comparison in
Table 1 indicates that existing datasets differ in their emphasis on dataset scale, data sources, and multi-view capabilities. For instance, TartanAviation [
10] and M3D [
11] offer larger-scale data or cross-domain settings. Nevertheless, their annotation schemes still primarily target general detection and localization. Even when these datasets provide bounding boxes or multimodal information, they generally lack the operation state annotations required for agricultural spraying supervision. In other words, such datasets can effectively answer the questions of whether a UAV exists and where it is located, but they cannot directly address the supervision-oriented questions that matter in agriculture, such as whether the UAV is spraying and whether unauthorized or abnormal operations occur. This limitation at the data attribute level constrains both model training and reproducible benchmarking for operation state recognition.
Beyond annotation attributes,
Table 1 also reveals differences in application scenario suitability. Existing datasets are mostly designed for security surveillance, airspace management, or aviation operations, and their background compositions systematically deviate from agricultural ecological backgrounds. As a result, the feature representations learned from these data do not necessarily transfer to farmland spraying scenarios.
Figure 2 further illustrates this mismatch from a visual perspective. For example, sample images in the AC dataset [
8] tend to focus on backgrounds such as sky and airport runways, where textures are relatively simple and illumination is comparatively stable. Although the Real World dataset [
9] includes a rural category, its rural backgrounds are still dominated by manmade structures such as buildings and roads. The examples from TartanAviation [
10] and M3D [
11] similarly reflect their emphasis on aviation operations or cross-domain micro aerial vehicle detection in terms of both backgrounds and task definitions. In contrast, agricultural spraying scenes often involve high-frequency crop canopy textures, occlusions by orchard branches and leaves, undulating mountainous terrain, and strong illumination variations. These factors can produce distractors that resemble local UAV appearances, substantially increasing the risk of false detections and imposing higher requirements on cross-scene generalization.
Based on the attribute comparisons in
Table 1 and the intuitive evidence in
Figure 2, the design necessity of our dataset can be summarized into three points, which directly motivate the subsequent sections. First, agricultural supervision requires elevating the recognition target from UAV presence to whether a spraying operational state occurs. Therefore, we must introduce two state semantic labels, spraying and flying without spraying, and annotate them jointly with bounding boxes, so that the dataset can directly support training and evaluation for operation state recognition. Second, to reduce model reliance on background-specific statistical patterns and to improve cross-scene generalization, data sampling must cover multiple representative agricultural ecological backgrounds. Accordingly, we conduct systematic sampling across green cropland, bare farmland, orchard, woodland, mountainous terrain, and sky. Third, because long-distance observation, small targets, and imaging degradations are common in agricultural operations, dataset construction should retain task-relevant degraded samples and report reproducible experimental baselines through unified benchmark evaluations. This approach provides a diagnostic baseline for subsequent robustness improvements.
3.2. Dataset Construction
To construct a dedicated dataset for agricultural UAV spraying behavior recognition, a systematic approach was adopted. As shown in
Figure 3, the dataset construction process is divided into the following four sequential stages: multi-source data collection, quality-oriented preprocessing, data annotation, and data augmentation and evaluation.
3.2.1. Data Collection and Preliminary Processing
The dataset was obtained from multiple online platforms, and no restrictions were imposed on geographic regions or time periods during collection to maximize diversity. The primary search keywords during data collection included terms such as “agricultural UAV spraying”, “plant protection UAV operations”, and “pesticide spraying”.
As shown in
Table 2, a total of 71 videos covering various agricultural scenarios were obtained, from which 240 sets of valid independent image sequences were extracted. The extracted frames retained their original resolution without standardization, deliberately preserving heterogeneity introduced by different devices and shooting conditions to ensure that the trained models can adapt to the equipment diversity of real-world agricultural monitoring systems. In the preprocessing stage, frames without UAV targets were first removed through manual screening. Frames with severe quality defects were subsequently excluded; however, blurred samples were deliberately retained, as shown in
Figure 4b. This is because motion blur represents a critical feature of high-speed spraying operations rather than an artifact to be eliminated. Among the 65,838 frames extracted from the source videos, 9798 frames were retained after preliminary screening, and 9548 frames were ultimately annotated following verification and review.
3.2.2. Background Classification
As shown in
Table 3, the dataset encompasses the following six types of ecological backgrounds: green cropland with 3197 images (33.48%), sky with 1931 images (20.22%), orchard with 1481 images (15.51%), mountainous terrain with 1321 images (13.84%), woodland with 907 images (9.50%), and bare farmland with 711 images (7.45%).
These backgrounds reflect the measured operational frequency in precision agriculture, as farmland scenes dominate spraying activities due to active pest–crop interactions during the growth period, as shown in
Table 4. Green cropland backgrounds exhibit complex leaf vein textures and predominantly green tones, which differ from urban surveillance scenes and may introduce color-based interference. In contrast, sky backgrounds have relatively uniform colors with minimal interference, allowing for the separation of requirements between target detection and background suppression. In orchard scenarios, regular occlusion patterns occur; for example, periodically arranged tree canopies produce predictable shadow shapes. Woodland scenarios, however, involve irregular occlusion patterns, where randomly distributed branches and leaves are highly affected by illumination changes. Mountainous backgrounds feature steep slopes and terraced structures, which can induce perspective distortion and pose challenges for scale-invariant detection mechanisms.
This multi-background classification systematically samples the visual feature space of Chinese agricultural landscapes, encompassing both plain and hilly terrains. The low proportion of bare farmland (7.45%) aligns with the agricultural cycle, as exposed soil corresponds to fallow periods during which spraying operations are reduced. This ecological environment-based data imbalance brings training closer to real-world conditions, enabling the model to learn statistical weighting experience that allocates computational resources according to the frequency of different scene occurrences.
3.2.3. Data Annotation
The annotation strategy employs a binary classification framework distinguishing “spraying” and “flying without spraying” states, implemented according to visual standards. The “spraying” label is assigned only when visible motion traces of UAV nozzle droplets or atomized particles appear in the frame, while accommodating variations in spray visibility caused by illumination and viewing angles.
Figure 4 presents typical samples, as follows: (a) small target samples observed from a long-distance aerial perspective; (b) motion-blurred samples resulting from relative movement between the camera and UAV; and (c) complex environmental samples across six background categories. All samples are simultaneously annotated with rectangular bounding box localization and semantic state labels.
To ensure consistent decisions among different annotators under occlusion and near the start or end of a sequence, we define unified rules for spray evidence visibility and transition frames as follows:
- 1.
Partial visibility of spray: If the spray is visible only in part of the image, or appears as a weak yet still recognizable spray texture and trajectory, the frame is still annotated as spraying;
- 2.
Occluded nozzle: When the UAV body, including the nozzle area, is partially occluded by branches, leaves, or facilities, the annotation follows the visible spray evidence. If the spray plume remains visible, the frame is annotated as spraying. If neither spray evidence nor the nozzle area can be reliably recognized, the frame is annotated as flying without spraying;
- 3.
Transition frames at start and stop: We do not introduce an additional transition state category. Instead, we assign frame-level binary labels. For the same image sequence, the first frame in which recognizable spray evidence appears is defined as the spraying start frame. The first frame in which spray evidence no longer appears is defined as the first frame after spraying stops. This rule avoids subjective inference based on invisible factors, such as valve opening or closing, and improves annotation reproducibility;
- 4.
Handling uncertain frames: If severe occlusion, exposure issues, or extreme blur prevent reliable identification of spray evidence, we label the frame as an uncertain sample during the review stage and exclude it from the final annotated set. This strategy controls the impact of label noise on model learning and on the determination of sequence transitions.
As shown in
Table 2, the final dataset contains 5687 bounding boxes labeled as “spraying” and 4027 labeled as “flying without spraying,” with a ratio of approximately 1.41:1. This nearly balanced distribution offers dual advantages. On one hand, it effectively mitigates the common class imbalance problem in supervised learning; on the other hand, it realistically reflects UAV operational conditions, where substantial flight time is consumed during non-spraying activities such as transit, repositioning, and pesticide refilling. Moreover, this dual-state annotation directly addresses the deficiencies in operational state recognition identified in existing datasets in
Table 1, elevating the detection task from mere spatial localization to the higher-level perception and recognition of UAV operational behaviors.
To ensure the authenticity and reliability of the annotated data, the preliminary annotations were subjected to random sampling review, and any inconsistent samples identified during the review were re-annotated across their entire sequences.
3.2.4. Data Augmentation and Dataset Structure
To enhance the generalization capability of the model, geometric augmentation techniques were applied after annotation. Specifically, rotation was used to simulate UAV tilted flight states. It is noteworthy that Gaussian blur or chromatic distortion augmentation methods were deliberately avoided. This is because their use could confound the inherent motion blur and illumination variations present in the real samples. During augmentation, coordinate transformation matrices were used to preserve annotation integrity, ensuring that bounding boxes remained accurately aligned with geometrically transformed targets.
Figure 5 illustrates the three-level directory structure of the dataset. The top-level directory is named “Drone_Spraying_Dataset”, which contains the following two subdirectories: “Raw_Data” and “Annotations”. The “Raw_Data” directory stores the original video frames organized by background category; for example, frames corresponding to the “Green_Cropland” background are placed in the respective folder. The “Annotations” directory contains label files in JSON and TXT formats that comply with the field specifications listed in
Table 5.
Table 5 provides a detailed and comprehensive specification of the annotation format. It covers image metadata, including file paths, image dimensions, and Base64 pixel encoding; object attributes, including bounding box coordinates, UAV state labels, and shape descriptors; and global parameters such as the version of the annotation tool and the default color scheme. The annotation files are provided in JSON and TXT formats to ensure cross-platform compatibility and seamless integration with common frameworks such as PyTorch 2.9.0 and TensorFlow v2.16.1. A key innovation of this annotation scheme is the inclusion of Base64 image encoding within the annotation file, creating a self-contained label document. Consequently, there is no dependency on external image directories, which simplifies dataset distribution and version control.
In summary, the dataset construction methodology integrates ecological sampling strategies, rigorous quality control, and behavior-aware annotation to create a dedicated resource tailored for UAV monitoring in agriculture. The final 9548 annotated frames encompass six background types and dual operational states, accompanied by complete metadata encoding, providing a solid foundation for developing precision agriculture monitoring systems capable of real-time spraying state recognition and temporal compliance assessment.
3.3. Dataset Evaluation
In this section, the quality and diversity of the agricultural UAV spraying behavior recognition dataset are systematically evaluated through quantitative analysis of target scale distributions and image quality metrics across different background categories. This evaluation aims to verify whether the dataset adequately addresses the following three major challenges highlighted in the introduction: large variations in target scale, motion blur, and complex background environments.
As shown in
Figure 6, agricultural UAV monitoring scenarios exhibit pronounced extreme scale variation characteristics. The figure compares visual differences under identical background conditions at varying observation distances. In the close-range case in
Figure 6a, UAV targets occupy bounding boxes of 102 × 128 pixels, representing 5.67% of the total image area; in contrast, as shown in
Figure 6b, the targets occupy only 37 × 38 pixels, corresponding to 0.61% of the total image area. Within the same background category, the area ratio between near and far targets differs by a factor of up to 9.3. This confirms that the dataset can simulate the perceptual requirements of fixed monitoring equipment in actual spraying operations, where continuous target detection is necessary throughout the operational trajectory from close-range approach to distant observation. However, such significant scale differences pose challenges to conventional feature extraction pipelines. Near-range UAV targets retain sufficient structural details. Conversely, distant targets degrade into approximately point-like objects, causing feature extraction to become ineffective and preventing accurate target recognition.
Figure 6 presents only one set of examples. Building on this,
Figure 7 combines the target area distribution statistics with the two-dimensional width and height distribution to reveal the small object dominance and long tail characteristics of the dataset from both one-dimensional and two-dimensional perspectives.
Figure 7a shows the frequency distribution of annotated bounding box areas across the entire dataset. The overall distribution exhibits a pronounced long tail. A large number of instances cluster in the small area range, and the frequency decays rapidly as the area increases, with only a few extremely large targets forming the right tail. Moreover, the distribution indicates that targets with an area smaller than 10,000 pixels
2 constitute the majority, whereas the proportion of targets larger than 100,000 pixels
2 is very low. This suggests that most detections occur under medium-to-long-range viewing conditions, where UAVs typically appear as small-scale targets in the image.
Figure 7b further characterizes the target size distribution in the two-dimensional space of width and height. The color encodes the logarithmic scale of target density to prevent high-frequency small targets from masking information from low-frequency large targets. The heatmap shows that the high-density region is mainly concentrated in the lower left corner, namely the range where both width and height are small, and gradually spreads toward the upper right along an approximately positive correlation between width and height. This pattern indicates continuous scale variation across different shooting distances rather than concentration at a single scale. Meanwhile, only a small number of sparse samples appear in the upper-right region, corresponding to larger targets observed under close-range or high-resolution conditions, which is consistent with the long tail phenomenon in
Figure 7a.
These scale distribution characteristics directly affect algorithm evaluation. Because most instances are small targets with substantial cross-scale variation, overall model performance depends largely on small object feature extraction and multi-scale representation. If a detection framework lacks effective multi-scale feature fusion or is insensitive to small targets, performance degradation typically first emerges in the small target region, which accounts for the largest proportion of instances.
To investigate how the visual complexity of this dataset varies across different scenarios, we analyze full-frame images. Specifically, we first convert each original image to grayscale and then compute the following four sharpness metrics over the full resolution frame: Laplacian variance, Tenengrad, Brenner, and SMD2. This setup does not rely on cropping based on UAV bounding boxes. We compute these metrics at the full-frame level because this section focuses on differences in overall visual complexity and imaging quality across background categories, including the combined effects of texture complexity and potential motion blur. Consequently, the results better reflect the degree to which backgrounds interfere with detection in real monitoring footage. Therefore, the metric distributions shown in
Figure 8 and
Figure 9 can be interpreted as differences in overall image-level sharpness and edge response strength across background categories, where larger values typically indicate stronger edge or gradient responses. The specific formulas are given as follows:
Laplacian Variance computes the variance of the response values obtained by applying a discrete Laplacian operator to the image, reflecting the overall edge sharpness:
Here, represents the result of the Laplacian filtering, denotes its mean, and corresponds to the image dimensions;
The Tenengrad gradient function computes the gradient magnitude energy based on the Sobel operator, emphasizing local contrast:
Here, and represent the Sobel gradient in the horizontal and vertical directions, respectively;
The Brenner gradient function calculates the sum of squares gray-level differences separated by two pixels, and is sensitive to blur in the horizontal directions:
The SMD2 constructs the enhanced Laplacian responses in the horizontal and vertical directions separately, and then computes the square of the sum of their absolute values:
As shown in the box plots in
Figure 8, images with sky backgrounds exhibit the lowest median values across all four clarity metrics. This indicates that the texture complexity of sky backgrounds is generally low, and there are virtually no motion-induced artifacts.
In contrast, images with green cropland and forest backgrounds show higher median values and a wider distribution range. This discrepancy is primarily caused by two factors. On one hand, dense vegetation introduces abundant high-frequency texture details, which elevates the clarity scores computed based on gradient measures; on the other hand, to maintain operational efficiency, drones fly at high speed over croplands, which exacerbates motion blur, causing edges to become blurred, thus interfering with clarity assessment.
The scatter plot matrix in
Figure 9 employs cross-metric correlation analysis to provide supplementary insights for data validation. In the figure, color coding is used to cluster different backgrounds, clearly revealing that the background categories are spatially separated. Specifically, points representing sky backgrounds (brown) cluster in the lower-left quadrant across all pairwise plots, indicating consistently low-complexity characteristics. In contrast, points corresponding to green cropland (orange) and woodland (dark blue) are widely dispersed in the upper-right region with larger variance. Notably, all pairs of metrics exhibit strong positive correlations, confirming that these indicators converge when evaluating image quality degradation. However, the persistent clustering among backgrounds also demonstrates that texture complexity introduces systematic bias beyond the effects of motion blur.
This finding is crucial for interpreting performance differences observed in subsequent experiments. Compared with the high signal-to-noise-ratio sky scenarios, algorithms lacking adaptive background suppression mechanisms are likely to exhibit higher false-positive rates in green cropland and woodland scenes. The heterogeneity in image quality documented herein can serve as a diagnostic baseline in benchmark evaluations. This enables the assessment of whether an algorithm possesses adaptive background suppression capabilities.
3.4. Experimental Results and Analysis
Building on the dataset construction and quality assessment described above, this section focuses on benchmarking four mainstream object detection algorithms to analyze how agriculture-specific challenges affect detection performance in practice. To ensure fair model comparisons and to mitigate overfitting risks on background categories with relatively fewer samples, we apply a unified training configuration to all benchmarked models. We split the dataset into training, validation, and test sets at a ratio of 7:2:1, while maintaining consistent distributions of background categories and operation state labels across the subsets. We train all models using the Adam optimizer with an initial learning rate of 0.00001. When the validation loss shows no improvement for 10 consecutive epochs, we trigger cosine annealing decay. We set the batch size to 16 or 8 depending on each model’s GPU memory consumption. The maximum number of training epochs is 150, and we enable early stopping to prevent overfitting. Weight decay is enabled by default for the YOLO series models, while for Faster R-CNN, we introduce Dropout in the fully connected layers.
The test results in
Figure 10 show that Faster R-CNN performs poorly under the green cropland and bare farmland backgrounds, where precision drops markedly to 0.61 and 0.55, respectively. In sharp contrast, YOLOv5n maintains high precision in the same scenarios, reaching 0.98 and 0.96. Further analysis indicates that this gap is not caused by differences in the number of model parameters. Instead, it stems from inherent limitations of the two-stage detection pipeline in Faster R-CNN. Specifically, when processing complex leaf vein textures, the region proposal network tends to misinterpret high-frequency edge responses as foreground anchors. This misjudgment generates a large number of spurious candidate boxes during the proposal selection stage. These erroneous proposals are then incorrectly activated by the downstream classifier because it lacks sufficient constraints from global contextual information, which ultimately leads to a systematic and substantial degradation in precision. This phenomenon directly corresponds to the image quality analysis in
Figure 8 and
Figure 9, where green cropland exhibits high Tenengrad gradient response values. The results confirm that dense vegetation textures pose a fundamental challenge to region-proposal-based detection methods.
Figure 10 clearly shows that the three YOLO models achieve precision above the 0.90 reference line in most backgrounds. Among them, YOLOv5n performs particularly well. It reaches a peak precision of 0.99 under the sky background and maintains a high precision of 0.98 under the orchard background. This advantage is further confirmed by the overall evaluation in
Table 6. YOLOv5n achieves the highest overall precision of 97.86% among all tested models, and its mAP@50 reaches 98.30%. These results indicate that single-stage detectors offer structural advantages when handling small targets and complex backgrounds.
However, one phenomenon deserves attention. Although YOLOv8l has a deeper network and a larger parameter scale, its precision is only 95.63%, which is 2.23 percentage points lower than that of YOLOv5n. This result suggests that, in agricultural scenarios, excessive overparameterization may increase the risk of overfitting. When the background distribution of the training samples exhibits long tail characteristics, deeper networks may memorize texture details from dominant categories, which can lead to degraded generalization in minority scenarios such as bare farmland.
In recall analysis by background category, in agricultural spraying monitoring applications, the cost of missing spraying targets is usually much higher than that of falsely reporting non-spraying targets. A missed detection means that illegal spraying behavior cannot be recorded or traced, potentially causing pesticide abuse, environmental pollution, and supervision failure. Therefore,
Table 7 further reports the recall of each model by background category and summarizes the corresponding values to evaluate the completeness with which each model captures spraying behavior under different background conditions.
As shown in
Table 7, YOLOv5n maintains a recall above 0.95 in most backgrounds and is particularly stable under the sky and green cropland backgrounds, indicating strong capability in capturing spraying targets. In contrast, Faster R-CNN exhibits a notably low recall under the mountainous terrain background, suggesting that it is prone to missed detections in scenarios with complex terrain and severe occlusion. This phenomenon relates to a mechanism-level limitation: under complex textured backgrounds, it tends to generate many redundant proposals, which can cause true targets to be mistakenly removed by non-maximum suppression.
It is noteworthy that all models exhibit recall degradation to varying degrees under the mountainous terrain and woodland backgrounds. For mountainous terrain, the recall of YOLOv8n decreases to 0.885, while that of Faster R-CNN drops to 0.644. Under the woodland background, the recall of YOLOv8n is 0.878, which is lower than that under relatively simple backgrounds such as sky and green cropland. This trend relates to irregular occlusions, low illumination, and perspective distortions induced by steep terrain, which further weaken small target features and increase the risk of missed detections. For practical deployment in supervision systems, we recommend moderately lowering the confidence threshold or introducing cost-sensitive learning strategies in high-risk backgrounds such as mountainous terrain and woodland, trading a small loss in precision for higher recall to reduce missed detection risk.
Another noteworthy phenomenon is that YOLOv8l achieves a recall of 0.992 under the green cropland background, the highest among all models, but drops to 0.937 under the woodland background. This indicates that deeper networks can better exploit their capacity advantages in high-quality large-sample backgrounds, but their generalization is limited in small-sample, highly occluded backgrounds. This observation further supports our earlier analysis that overparameterization may introduce overfitting risks under long tail background distributions.
From the perspective of background categories, the sky background yields the best performance for all models. Its clean color space and simple texture pattern create highly favorable conditions for separating targets from the background. This finding also aligns with, and corroborates, the earlier observation that the sky category exhibits the lowest Laplacian variance. However, even under the relatively simplified sky background, performance differences across models remain evident. Faster R-CNN achieves a precision of only 0.92, which is substantially lower than the 0.99 of YOLOv5n. This result indicates that the region proposal mechanism in two-stage architectures can introduce redundant computation and accumulate errors even in low-noise environments.
In contrast, the bare farmland background poses a major challenge for all models. Faster R-CNN reaches its lowest performance, while YOLOv8n remains relatively robust, but its precision still drops to 0.85. Our analysis suggests a possible explanation: the yellow-brown tones of exposed soil partially overlap with the reflective spectrum of the UAV body, and bare farmland does not provide the strong color contrast advantage observed in green cropland. As a result, feature extractors based on red–green–blue (RGB) channels may struggle to learn sufficiently discriminative representations, thereby degrading detection performance under this background.
Under the woodland background, YOLOv5n and YOLOv8n show a clear divergence. The precision of YOLOv5n is 0.96, whereas YOLOv8n achieves only 0.88, a gap of 8 percentage points. This difference can be attributed to edge fragmentation caused by randomly distributed branch and leaf occlusions, which affect networks of different depths in different ways. Shallower networks tend to rely more on local strong gradient cues, whereas deeper networks rely more on global semantic consistency. When occlusions make the target contour incomplete, long-range dependency mechanisms in deeper networks may instead introduce more spurious activations. This finding provides practical evidence for incorporating attention mechanisms or explicit partial occlusion modeling in future algorithms. It also confirms the necessity of retaining motion blur samples in the dataset. In real agricultural monitoring, image quality degradation is not simply removable noise, but an inherent characteristic that algorithms must learn to handle.
Overall, the experimental results confirm that our dataset effectively captures the core technical challenges encountered in agricultural UAV monitoring and provides a reliable basis for subsequent algorithm analysis and optimization. The systematic recall-based analysis further reveals how missed detection risks distribute across backgrounds, offering data support for adopting differentiated detection strategies for high-risk backgrounds in real-world supervision deployments.