Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture

Meng, Chang; Shu, Lei; Bai, Leijing

doi:10.3390/jsan15010014

Open AccessData Descriptor

Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture

by

Chang Meng

¹

,

Lei Shu

^1,2,*

and

Leijing Bai

³

¹

College of Smart Agriculture, Nanjing Agricultural University, Nanjing 210031, China

²

School of Engineering and Physical Sciences, University of Lincoln, Lincoln LN6 7TS, UK

³

The School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2026, 15(1), 14; https://doi.org/10.3390/jsan15010014

Submission received: 21 November 2025 / Revised: 18 December 2025 / Accepted: 19 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of precision agriculture has accelerated the deployment of plant protection unmanned aerial vehicles (UAVs). However, reliable data resources for vision-based intelligent supervision of operational states, such as whether a UAV is currently spraying, remain limited. Most publicly available UAV detection datasets target urban security and surveillance scenarios, where annotations emphasize object localization rather than agricultural operation state recognition, making them insufficient for farmland spraying supervision. Therefore, agricultural-oriented data resources are needed to cover diverse backgrounds and include operation state labels, thereby supporting both academic research and practical deployment. In this study, we construct and release the first multi-background dataset dedicated to agricultural UAV spraying behavior recognition. The dataset contains 9548 high-quality annotated images spanning the following six typical backgrounds: green cropland, bare farmland, orchard, woodland, mountainous terrain, and sky. For each UAV instance, we provide both a bounding box and a binary operation state label, namely spraying and flying without spraying. We further conduct systematic benchmark evaluations of mainstream object detection algorithms on this dataset. The dataset captures agriculture-specific challenges, including a high proportion of small objects, substantial scale variation, motion blur, and complex dynamic backgrounds, and can be used to assess algorithm robustness in real-world agricultural settings. Benchmark results show that YOLOv5n achieves the best overall performance, with an accuracy of 97.86% and an mAP@50 of 98.30%. This dataset provides critical data support for automated supervision of plant protection UAV spraying operations and precision agriculture monitoring platforms.

Data Set: Dataset has been deposited in IEEE DataPort as the following DOI: https://doi.org/10.21227/dp61-fy61, link: https://ieee-dataport.org/documents/multi-background-uav-spraying-behavior-recognition-dataset-precision-agriculture, accessed on 15 December 2025.

Data Set License: CC-BY

Keywords:

precision agriculture; multi-background dataset; unmanned aerial vehicle; small object detection; spraying behavior recognition; agricultural monitoring

1. Introduction

In recent years, the rapid development of precision agriculture has driven the large-scale deployment of plant protection unmanned aerial vehicles (UAVs) in agricultural production [1]. Prior research indicates [2] that plant protection UAVs can improve field operation efficiency while reducing inputs of pesticides and fertilizers, lowering water consumption, and decreasing overall production costs, thereby becoming an important type of equipment and a key technological direction for modern crop protection. The study in [3] further shows that, under certain conditions, pesticide use can potentially be reduced substantially when UAV-based spraying is adopted, for example, by approximately 30%, which highlights the value of promoting UAVs for green and efficient crop protection.

Despite the rapid pace of equipment adoption, capabilities for process supervision and quality verification of UAV spraying operations remain clearly insufficient, particularly in terms of objective, continuous, and scalable validation of operational states such as whether the UAV is currently spraying. In practice, crop protection service providers are often independent contractors, and operational plans frequently rely on experience-based judgment or simplified scheduling. This can lead to mismatches between spraying timing and crop needs, repeated spraying, missed spraying, or inadequate control effectiveness. These issues not only undermine crop protection outcomes and the input–output ratio, but also adversely affect pesticide compliance, ecological risk management, and trustworthy decision making in smart agriculture platforms. Accordingly, establishing intelligent supervision capabilities for the spraying process is both rational and important.

Computer vision and deep learning provide a practical technical pathway for automated supervision of plant protection UAV spraying operations. Specifically, monitoring images or videos can be used to detect UAV targets and further identify their operational states, thereby providing a data basis for supervision, verification, and traceable records. However, for this pathway to be deployed reliably in agricultural settings, a key prerequisite is a high-quality dataset tailored to agricultural environments, containing operational state annotations and supporting fair benchmarking. Most existing public UAV detection datasets target urban security or general complex environments, and their task definitions typically focus on UAV localization and tracking or on determining whether a UAV is present, rather than on the operational state recognition required for agricultural spraying supervision, namely spraying and flying without spraying [4,5,6,7].

To address the critical gap in publicly available UAV datasets for agricultural spraying operation state recognition, and to meet the practical need in precision agriculture supervision for verifiable spraying behaviors, this paper constructs and releases a multi-background dataset for plant protection UAV spraying behavior recognition. The dataset contains 9548 high-quality annotated images and covers the following six typical agricultural and sky background environments: green cropland, bare farmland, orchard, woodland, mountainous terrain, and sky. Each UAV instance is annotated with a bounding box and is further assigned a binary operation state label, namely spraying and flying without spraying, thereby directly supporting spraying behavior recognition as the core supervision task. On this basis, we further benchmark mainstream object detection algorithms to establish a reproducible performance reference. The experimental results show that YOLOv5n achieves the best overall performance.

In releasing this dataset, we do not aim to stop at merely providing data. Instead, we conduct reproducible benchmark experiments and analyses to address several key scientific and engineering questions, thereby clarifying why the dataset is necessary, how it should be used, and what types of follow-up research it can support. Specifically, this paper investigates the following research questions:

1.: Data and task gap: Compared with existing public UAV detection resources, in which essential elements for agricultural spraying supervision have been persistently missing, particularly regarding operation state annotations and multi-background agricultural scene coverage, how does our dataset address these gaps through its task definition and annotation paradigm?
2.: Benchmark performance and learnability: Under the spraying behavior recognition task and annotation scheme defined in this paper, can mainstream object detection models learn and recognize spraying operational states reliably, what baseline performance levels can be achieved, and how do model differences manifest across evaluation metrics?
3.: Scenario difficulty and diagnostic value: Under multi-background conditions, which background categories and imaging conditions are more likely to cause false positives or missed detections, thereby making spraying state recognition more challenging, and can these differences provide diagnostic evidence to support subsequent robustness improvements for small objects, complex backgrounds, and imaging degradation conditions?

2. Related Works

2.1. UAV Datasets for Aviation Safety

The Aircraft Context Dataset (AC Dataset) [8] is a large-scale image dataset constructed to address the scarcity of training data for data-driven research in the aviation domain, such as object detection, classification, and semantic segmentation. The dataset focuses on two primary categories of aerial targets, manned aerial vehicles (MAVs) and unmanned aerial vehicles (UAVs), covering the full spectrum of aircraft from light planes to large transport aircraft and from consumer drones to military UAVs, with the goal of supporting robust model development for airport surveillance and autonomous flight scenarios. The dataset adopts a dual-subset structure, where the MAV subset includes 14 classes of manned aircraft and the UAV subset includes 9 classes of unmanned aircraft, both sharing a unified label space to enable cross-task training. Its core innovation lies in a multimodal annotation framework, which, in addition to 51,276 bounding boxes and 3903 semantic masks, introduces, for the first time, ten environmental meta-parameters, such as flight state, weather, background type, image degradation level, and illumination condition, to quantify the influence of environmental factors on model performance. The data were collected from 4260 publicly available videos posted by the aviation enthusiast community on YouTube, which were manually filtered to cover diverse aircraft types, lighting conditions, and geographic environments, and then segmented into 13,837 sequences from which 47,370 key frames were extracted. The annotation process followed a two-stage pipeline, where bounding boxes and meta-parameters were first labeled using the open-source tool Scalabel, after which semantic masks were generated based on these annotations and class imbalance was mitigated through balanced sampling. However, this dataset is primarily designed for airport surveillance and airspace management scenarios, with annotations emphasizing aircraft type recognition rather than operational state assessment.

2.2. UAV Datasets for Security Surveillance

The Real World Object Detection Dataset [9] is a computer vision dataset specifically designed for UAV object detection tasks, aiming to meet the real-time detection requirements of UAVs in security surveillance scenarios using low-cost and low-performance hardware. The dataset contains 56,821 real-world RGB images with a uniform resolution of 640 × 480, along with 55,539 manually annotated bounding boxes, covering UAVs of various types, sizes, positions, and illumination conditions in diverse environments such as sky, urban, and rural scenes. The dataset is divided into a training set with 51,446 images containing 52,676 UAV instances and a test set with 5375 images containing 2863 instances, among which small targets (area ratio < 0.3%) account for more than 40%, consistent with the fact that UAVs typically appear at long distances in real security scenarios. The dataset was constructed using an innovative semi-automatic annotation process, in which a small portion of data was first manually labeled to train an initial CNN model that generated UAV candidate boxes, which were then manually corrected, reducing the average annotation time from 45 s to 22.5 s and saving approximately 293 h of total labor. All data were collected from publicly available UAV videos, with one frame extracted every 50 frames, ensuring the inclusion of real-world challenges such as occlusion and illumination variation.

2.3. High-Resolution UAV Datasets for Air Traffic Control

TartanAviation [10] is an open-source multimodal aviation dataset focusing on terminal airspace operations, designed to provide essential support for the application of artificial intelligence and machine learning technologies in air traffic control and autonomous aviation. The dataset was constructed by deploying customized data acquisition systems within airports to synchronously collect three core modalities—images, voice recordings, and ADS-B trajectory data—covering the following two representative regional airports: the towered Allegheny County Airport (KAGC) and the non-towered Pittsburgh-Butler Regional Airport (KBTP). The data span from 2020 to 2023, encompassing various seasons, weather conditions, and aircraft types, ensuring both diversity and authenticity.

The dataset is extensive, containing 3.1 million high-resolution images, 3374 h of air traffic control voice data (VHF band communications), and 661 days of ADS-B trajectory data. The dataset construction process is highly automated: image data were captured using an array of four Sony IMX264 cameras (Sony Corporation, Tokyo, Japan) triggered by ADS-B signals, and aircraft bounding boxes were generated using the AirTrack deep learning model combined with manual verification. Trajectory data were obtained through a Stratux receiver (Stratux Inc., Seattle, WA, USA) that captured broadcasts in the 1090/978 MHz bands and were processed through coordinate transformation, anomaly filtering, and interpolation. Voice data were recorded using a Bearcat SR30C receiver (Uniden America Corporation, Irving, TX, USA) and filtered based on a sound pressure threshold to remove invalid segments. All data were strictly synchronized and post-processed, providing both raw and refined versions, along with meteorological data and metadata such as aircraft type and registration number.

2.4. Trajectory-Guided UAV Detection Datasets

The Multi-MAV-Multi-Domain (M3D) dataset [11] is a large-scale benchmark dataset constructed for cross-domain visual detection research on micro aerial vehicles (MAVs), aiming to address the significant performance degradation of existing MAV detection models when applied to unseen domains. The dataset contains 83,999 images and 86,415 annotated bounding boxes and is divided into the following two core subsets: M3D-Sim (simulation subset) includes 28,740 images automatically generated using the Unreal Engine and AirSim platforms, covering 22 types of MAV models and 16 simulated environments such as mountains and buildings. It supports multi-view data acquisition (ground-to-air, air-to-air, and air-to-ground) and is further divided into normal scenes (M3D-Sim-Normal) and texture-randomized scenes (M3D-Sim-Texture). The M3D-Real (real-world subset) contains 55,259 images captured using a DJI H20T camera (SZ DJI Technology Co., Ltd., Shenzhen, China) across eight types of natural environments, including mountains, buildings, rivers, and deserts, in two cities. It covers ten real MAV types, with multi-view dynamic acquisition achieved through moving cameras and flight trajectories spanning an area of one square kilometer. The dataset is further divided into a source domain (M3D-Real-Source) and a target domain (M3D-Real-Target) to support cross-scene adaptation studies. By integrating multi-source and multi-domain data, the dataset emphasizes the challenge of small-object detection for MAVs, with targets occupying, on average, only 0.10% of the image area. It establishes three standardized cross-domain adaptation tasks—simulation-to-real, cross-scene, and cross-camera—as benchmark evaluation platforms, achieving superior openness and comprehensiveness compared with existing MAV datasets.

2.5. Summary

The above review indicates that existing UAV vision datasets and related studies primarily serve UAV detection and tracking in urban security, airspace management, or general complex environments. Their annotations typically emphasize generic attributes such as object localization, category recognition, and trajectory information. However, in agricultural spraying supervision scenarios, the key question is not merely whether a UAV is present, but whether it is in the spraying operational state, which is essential for process verification and supervision decision making [12]. Due to the lack of public data resources covering representative agricultural ecological backgrounds and providing operation state labels such as spraying and flying without spraying, existing methods cannot obtain targeted training data or a unified, reproducible benchmarking basis for horizontal comparison. This limitation has become a major factor hindering both research progress and real-world deployment of agricultural spraying operation state recognition. Therefore, it is necessary to build a dataset for agricultural spraying supervision tasks and to report the baseline performance of mainstream detection models under a unified experimental setting, thereby providing a reproducible reference for subsequent algorithm development and engineering deployment.

3. Materials and Methods

3.1. Research Design and Positioning

This study focuses on dataset construction and benchmark evaluation for precision agriculture supervision applications. Specifically, it targets the data resources and reproducible experimental baselines required for agricultural plant protection UAV spraying operation state recognition, rather than the engineering development process of a specific platform system. In terms of research depth and scope, our work is positioned at the application and engineering support level within the research pyramid. The goal is to provide a publicly reusable dataset, a clear annotation paradigm, and reproducible performance baselines to facilitate subsequent algorithmic research and system deployment.

From the perspective of research type and methodology, this paper constitutes an application-oriented quantitative empirical study. We adopt a data-driven paradigm to build the dataset through standardized procedures for data collection, cleaning, and annotation, and we conduct objective evaluations using a unified metric system and multi-model comparative experiments. With respect to research objectives, this work pursues the following three main goals: first, to describe and formalize the key data attributes required by agricultural spraying supervision tasks; second, to establish performance baselines of mainstream object detection models for this task; and third, to provide diagnostic insights for subsequent robustness improvements through cross-background comparisons and quality assessment analyses. The primary reasoning approach in this study is inductive inference.

Regarding data sources and the temporal dimension, our data are derived from publicly available online video materials collected from multiple platforms. We retrieve raw videos by keyword-based searches and maximize diversity in scenarios and imaging conditions without restricting geographic regions or time periods. We then perform frame extraction, screening, and verification to form the final data resource. In terms of research design, we follow a sequential workflow comprising demand and gap analysis, dataset design rationale, data acquisition and quality control, annotation scheme design, data statistics and quality assessment, and benchmark evaluation. We explicitly document reproducible technical details at each stage.

3.1.1. Dataset Design Rationale

In agricultural spraying supervision scenarios, the key challenges for visual recognition mainly include substantial target scale variation, the frequent occurrence of motion blur during imaging, and complex agricultural backgrounds. These factors directly affect the visibility of cues for spraying state identification and the model’s ability to generalize across scenarios. Therefore, we treat these challenges as design constraints and sampling criteria for the dataset, and we summarize them in Figure 1.

1.: Large target scale variation: During agricultural UAV operations, the shooting distance can range from a few meters to several hundred meters, resulting in highly uneven pixel occupancy of the target in images [13,14,15]. Small object detection has long been a challenging problem in computer vision, and it becomes even more pronounced in agricultural scenarios. Conventional feature extraction methods often fail when dealing with extremely small targets, which motivates the need for dedicated datasets to train and optimize algorithms;
2.: Presence of motion blur: When performing spraying operations, UAVs typically maintain high flight speeds. Together with wind disturbances in farmland environments and camera shake, the collected images commonly exhibit motion blur [16,17,18,19]. This blur not only degrades bounding box localization accuracy, but, more importantly, interferes with correct spraying state recognition, because the spray plume appearance itself is a key visual cue for determining the operational state;
3.: Complex and diverse background environments: Agricultural production environments include green cropland, bare farmland, orchard, woodland, mountainous terrain, and other scenarios, with substantial differences in illumination conditions, texture characteristics, and color distributions across backgrounds. Complex backgrounds not only increase the difficulty of object detection, but, more critically, affect model generalization ability [20,21,22]. A model trained in a single scenario often struggles to adapt to diverse agricultural environments.

Based on these considerations, we adopt a dataset design strategy that emphasizes multi-scale targets, multi-background coverage, and the preservation of real-world degradations. Meanwhile, the core supervision question addressed in this work is whether the UAV is performing effective spraying at a given moment. Therefore, we model the operational state as a binary variable with the following two classes: spraying and flying without spraying. This binary formulation offers two advantages. First, spray plumes and droplet trajectories constitute relatively stable and observable visual evidence in most public video viewpoints, which can substantially reduce annotation subjectivity and improve inter-annotator consistency. Second, finer-grained states, such as about to start spraying, just stopped spraying, loitering while waiting, or transitioning between fields, often lack reliable visible cues under long-distance imaging, small target sizes, and motion blur. Forcing such fine categorization would introduce high-noise labels and weaken the reproducibility of the dataset as a benchmark.

It should be noted that flying without spraying semantically covers all non-spraying flight phases, such as transiting, returning, repositioning, and waiting for refill. This definition aligns with the supervision task while leaving room for future versions to extend toward more fine-grained behavior labels within the binary framework.

3.1.2. Dataset Gap Analysis

To clarify the necessity of the proposed dataset for the task of agricultural spraying operation state recognition, we compare representative public UAV and aerial vision datasets with our dataset in terms of key attributes, as shown in Table 1. We further provide intuitive evidence using example images, as shown in Figure 2.

The comparison in Table 1 indicates that existing datasets differ in their emphasis on dataset scale, data sources, and multi-view capabilities. For instance, TartanAviation [10] and M3D [11] offer larger-scale data or cross-domain settings. Nevertheless, their annotation schemes still primarily target general detection and localization. Even when these datasets provide bounding boxes or multimodal information, they generally lack the operation state annotations required for agricultural spraying supervision. In other words, such datasets can effectively answer the questions of whether a UAV exists and where it is located, but they cannot directly address the supervision-oriented questions that matter in agriculture, such as whether the UAV is spraying and whether unauthorized or abnormal operations occur. This limitation at the data attribute level constrains both model training and reproducible benchmarking for operation state recognition.

Beyond annotation attributes, Table 1 also reveals differences in application scenario suitability. Existing datasets are mostly designed for security surveillance, airspace management, or aviation operations, and their background compositions systematically deviate from agricultural ecological backgrounds. As a result, the feature representations learned from these data do not necessarily transfer to farmland spraying scenarios. Figure 2 further illustrates this mismatch from a visual perspective. For example, sample images in the AC dataset [8] tend to focus on backgrounds such as sky and airport runways, where textures are relatively simple and illumination is comparatively stable. Although the Real World dataset [9] includes a rural category, its rural backgrounds are still dominated by manmade structures such as buildings and roads. The examples from TartanAviation [10] and M3D [11] similarly reflect their emphasis on aviation operations or cross-domain micro aerial vehicle detection in terms of both backgrounds and task definitions. In contrast, agricultural spraying scenes often involve high-frequency crop canopy textures, occlusions by orchard branches and leaves, undulating mountainous terrain, and strong illumination variations. These factors can produce distractors that resemble local UAV appearances, substantially increasing the risk of false detections and imposing higher requirements on cross-scene generalization.

Based on the attribute comparisons in Table 1 and the intuitive evidence in Figure 2, the design necessity of our dataset can be summarized into three points, which directly motivate the subsequent sections. First, agricultural supervision requires elevating the recognition target from UAV presence to whether a spraying operational state occurs. Therefore, we must introduce two state semantic labels, spraying and flying without spraying, and annotate them jointly with bounding boxes, so that the dataset can directly support training and evaluation for operation state recognition. Second, to reduce model reliance on background-specific statistical patterns and to improve cross-scene generalization, data sampling must cover multiple representative agricultural ecological backgrounds. Accordingly, we conduct systematic sampling across green cropland, bare farmland, orchard, woodland, mountainous terrain, and sky. Third, because long-distance observation, small targets, and imaging degradations are common in agricultural operations, dataset construction should retain task-relevant degraded samples and report reproducible experimental baselines through unified benchmark evaluations. This approach provides a diagnostic baseline for subsequent robustness improvements.

3.2. Dataset Construction

To construct a dedicated dataset for agricultural UAV spraying behavior recognition, a systematic approach was adopted. As shown in Figure 3, the dataset construction process is divided into the following four sequential stages: multi-source data collection, quality-oriented preprocessing, data annotation, and data augmentation and evaluation.

3.2.1. Data Collection and Preliminary Processing

The dataset was obtained from multiple online platforms, and no restrictions were imposed on geographic regions or time periods during collection to maximize diversity. The primary search keywords during data collection included terms such as “agricultural UAV spraying”, “plant protection UAV operations”, and “pesticide spraying”.

As shown in Table 2, a total of 71 videos covering various agricultural scenarios were obtained, from which 240 sets of valid independent image sequences were extracted. The extracted frames retained their original resolution without standardization, deliberately preserving heterogeneity introduced by different devices and shooting conditions to ensure that the trained models can adapt to the equipment diversity of real-world agricultural monitoring systems. In the preprocessing stage, frames without UAV targets were first removed through manual screening. Frames with severe quality defects were subsequently excluded; however, blurred samples were deliberately retained, as shown in Figure 4b. This is because motion blur represents a critical feature of high-speed spraying operations rather than an artifact to be eliminated. Among the 65,838 frames extracted from the source videos, 9798 frames were retained after preliminary screening, and 9548 frames were ultimately annotated following verification and review.

3.2.2. Background Classification

As shown in Table 3, the dataset encompasses the following six types of ecological backgrounds: green cropland with 3197 images (33.48%), sky with 1931 images (20.22%), orchard with 1481 images (15.51%), mountainous terrain with 1321 images (13.84%), woodland with 907 images (9.50%), and bare farmland with 711 images (7.45%).

These backgrounds reflect the measured operational frequency in precision agriculture, as farmland scenes dominate spraying activities due to active pest–crop interactions during the growth period, as shown in Table 4. Green cropland backgrounds exhibit complex leaf vein textures and predominantly green tones, which differ from urban surveillance scenes and may introduce color-based interference. In contrast, sky backgrounds have relatively uniform colors with minimal interference, allowing for the separation of requirements between target detection and background suppression. In orchard scenarios, regular occlusion patterns occur; for example, periodically arranged tree canopies produce predictable shadow shapes. Woodland scenarios, however, involve irregular occlusion patterns, where randomly distributed branches and leaves are highly affected by illumination changes. Mountainous backgrounds feature steep slopes and terraced structures, which can induce perspective distortion and pose challenges for scale-invariant detection mechanisms.

This multi-background classification systematically samples the visual feature space of Chinese agricultural landscapes, encompassing both plain and hilly terrains. The low proportion of bare farmland (7.45%) aligns with the agricultural cycle, as exposed soil corresponds to fallow periods during which spraying operations are reduced. This ecological environment-based data imbalance brings training closer to real-world conditions, enabling the model to learn statistical weighting experience that allocates computational resources according to the frequency of different scene occurrences.

3.2.3. Data Annotation

The annotation strategy employs a binary classification framework distinguishing “spraying” and “flying without spraying” states, implemented according to visual standards. The “spraying” label is assigned only when visible motion traces of UAV nozzle droplets or atomized particles appear in the frame, while accommodating variations in spray visibility caused by illumination and viewing angles. Figure 4 presents typical samples, as follows: (a) small target samples observed from a long-distance aerial perspective; (b) motion-blurred samples resulting from relative movement between the camera and UAV; and (c) complex environmental samples across six background categories. All samples are simultaneously annotated with rectangular bounding box localization and semantic state labels.

To ensure consistent decisions among different annotators under occlusion and near the start or end of a sequence, we define unified rules for spray evidence visibility and transition frames as follows:

1.: Partial visibility of spray: If the spray is visible only in part of the image, or appears as a weak yet still recognizable spray texture and trajectory, the frame is still annotated as spraying;
2.: Occluded nozzle: When the UAV body, including the nozzle area, is partially occluded by branches, leaves, or facilities, the annotation follows the visible spray evidence. If the spray plume remains visible, the frame is annotated as spraying. If neither spray evidence nor the nozzle area can be reliably recognized, the frame is annotated as flying without spraying;
3.: Transition frames at start and stop: We do not introduce an additional transition state category. Instead, we assign frame-level binary labels. For the same image sequence, the first frame in which recognizable spray evidence appears is defined as the spraying start frame. The first frame in which spray evidence no longer appears is defined as the first frame after spraying stops. This rule avoids subjective inference based on invisible factors, such as valve opening or closing, and improves annotation reproducibility;
4.: Handling uncertain frames: If severe occlusion, exposure issues, or extreme blur prevent reliable identification of spray evidence, we label the frame as an uncertain sample during the review stage and exclude it from the final annotated set. This strategy controls the impact of label noise on model learning and on the determination of sequence transitions.

As shown in Table 2, the final dataset contains 5687 bounding boxes labeled as “spraying” and 4027 labeled as “flying without spraying,” with a ratio of approximately 1.41:1. This nearly balanced distribution offers dual advantages. On one hand, it effectively mitigates the common class imbalance problem in supervised learning; on the other hand, it realistically reflects UAV operational conditions, where substantial flight time is consumed during non-spraying activities such as transit, repositioning, and pesticide refilling. Moreover, this dual-state annotation directly addresses the deficiencies in operational state recognition identified in existing datasets in Table 1, elevating the detection task from mere spatial localization to the higher-level perception and recognition of UAV operational behaviors.

To ensure the authenticity and reliability of the annotated data, the preliminary annotations were subjected to random sampling review, and any inconsistent samples identified during the review were re-annotated across their entire sequences.

3.2.4. Data Augmentation and Dataset Structure

To enhance the generalization capability of the model, geometric augmentation techniques were applied after annotation. Specifically, rotation was used to simulate UAV tilted flight states. It is noteworthy that Gaussian blur or chromatic distortion augmentation methods were deliberately avoided. This is because their use could confound the inherent motion blur and illumination variations present in the real samples. During augmentation, coordinate transformation matrices were used to preserve annotation integrity, ensuring that bounding boxes remained accurately aligned with geometrically transformed targets.

Figure 5 illustrates the three-level directory structure of the dataset. The top-level directory is named “Drone_Spraying_Dataset”, which contains the following two subdirectories: “Raw_Data” and “Annotations”. The “Raw_Data” directory stores the original video frames organized by background category; for example, frames corresponding to the “Green_Cropland” background are placed in the respective folder. The “Annotations” directory contains label files in JSON and TXT formats that comply with the field specifications listed in Table 5.

Table 5 provides a detailed and comprehensive specification of the annotation format. It covers image metadata, including file paths, image dimensions, and Base64 pixel encoding; object attributes, including bounding box coordinates, UAV state labels, and shape descriptors; and global parameters such as the version of the annotation tool and the default color scheme. The annotation files are provided in JSON and TXT formats to ensure cross-platform compatibility and seamless integration with common frameworks such as PyTorch 2.9.0 and TensorFlow v2.16.1. A key innovation of this annotation scheme is the inclusion of Base64 image encoding within the annotation file, creating a self-contained label document. Consequently, there is no dependency on external image directories, which simplifies dataset distribution and version control.

In summary, the dataset construction methodology integrates ecological sampling strategies, rigorous quality control, and behavior-aware annotation to create a dedicated resource tailored for UAV monitoring in agriculture. The final 9548 annotated frames encompass six background types and dual operational states, accompanied by complete metadata encoding, providing a solid foundation for developing precision agriculture monitoring systems capable of real-time spraying state recognition and temporal compliance assessment.

3.3. Dataset Evaluation

In this section, the quality and diversity of the agricultural UAV spraying behavior recognition dataset are systematically evaluated through quantitative analysis of target scale distributions and image quality metrics across different background categories. This evaluation aims to verify whether the dataset adequately addresses the following three major challenges highlighted in the introduction: large variations in target scale, motion blur, and complex background environments.

As shown in Figure 6, agricultural UAV monitoring scenarios exhibit pronounced extreme scale variation characteristics. The figure compares visual differences under identical background conditions at varying observation distances. In the close-range case in Figure 6a, UAV targets occupy bounding boxes of 102 × 128 pixels, representing 5.67% of the total image area; in contrast, as shown in Figure 6b, the targets occupy only 37 × 38 pixels, corresponding to 0.61% of the total image area. Within the same background category, the area ratio between near and far targets differs by a factor of up to 9.3. This confirms that the dataset can simulate the perceptual requirements of fixed monitoring equipment in actual spraying operations, where continuous target detection is necessary throughout the operational trajectory from close-range approach to distant observation. However, such significant scale differences pose challenges to conventional feature extraction pipelines. Near-range UAV targets retain sufficient structural details. Conversely, distant targets degrade into approximately point-like objects, causing feature extraction to become ineffective and preventing accurate target recognition.

Figure 6 presents only one set of examples. Building on this, Figure 7 combines the target area distribution statistics with the two-dimensional width and height distribution to reveal the small object dominance and long tail characteristics of the dataset from both one-dimensional and two-dimensional perspectives.

Figure 7a shows the frequency distribution of annotated bounding box areas across the entire dataset. The overall distribution exhibits a pronounced long tail. A large number of instances cluster in the small area range, and the frequency decays rapidly as the area increases, with only a few extremely large targets forming the right tail. Moreover, the distribution indicates that targets with an area smaller than 10,000 pixels² constitute the majority, whereas the proportion of targets larger than 100,000 pixels² is very low. This suggests that most detections occur under medium-to-long-range viewing conditions, where UAVs typically appear as small-scale targets in the image. Figure 7b further characterizes the target size distribution in the two-dimensional space of width and height. The color encodes the logarithmic scale of target density to prevent high-frequency small targets from masking information from low-frequency large targets. The heatmap shows that the high-density region is mainly concentrated in the lower left corner, namely the range where both width and height are small, and gradually spreads toward the upper right along an approximately positive correlation between width and height. This pattern indicates continuous scale variation across different shooting distances rather than concentration at a single scale. Meanwhile, only a small number of sparse samples appear in the upper-right region, corresponding to larger targets observed under close-range or high-resolution conditions, which is consistent with the long tail phenomenon in Figure 7a.

These scale distribution characteristics directly affect algorithm evaluation. Because most instances are small targets with substantial cross-scale variation, overall model performance depends largely on small object feature extraction and multi-scale representation. If a detection framework lacks effective multi-scale feature fusion or is insensitive to small targets, performance degradation typically first emerges in the small target region, which accounts for the largest proportion of instances.

To investigate how the visual complexity of this dataset varies across different scenarios, we analyze full-frame images. Specifically, we first convert each original image to grayscale and then compute the following four sharpness metrics over the full resolution frame: Laplacian variance, Tenengrad, Brenner, and SMD2. This setup does not rely on cropping based on UAV bounding boxes. We compute these metrics at the full-frame level because this section focuses on differences in overall visual complexity and imaging quality across background categories, including the combined effects of texture complexity and potential motion blur. Consequently, the results better reflect the degree to which backgrounds interfere with detection in real monitoring footage. Therefore, the metric distributions shown in Figure 8 and Figure 9 can be interpreted as differences in overall image-level sharpness and edge response strength across background categories, where larger values typically indicate stronger edge or gradient responses. The specific formulas are given as follows:

Laplacian Variance computes the variance of the response values obtained by applying a discrete Laplacian operator to the image, reflecting the overall edge sharpness:

$LV (I) = \frac{1}{M N} \sum_{x = 1}^{M} \sum_{y = 1}^{N} {[L (x, y) - μ_{L}]}^{2}$

(1)

Here, $L = I * [\begin{matrix} 0 & 1 & 0 \\ 1 & - 4 & 1 \\ 0 & 1 & 0 \end{matrix}]$ represents the result of the Laplacian filtering, $μ_{L}$ denotes its mean, and $M \times N$ corresponds to the image dimensions;
The Tenengrad gradient function computes the gradient magnitude energy based on the Sobel operator, emphasizing local contrast:

$Tenengrad (I) = \sum_{x = 1}^{M} \sum_{y = 1}^{N} [G_{x} {(x, y)}^{2} + G_{y} {(x, y)}^{2}]$

(2)

Here, $G_{x}$ and $G_{y}$ represent the Sobel gradient in the horizontal and vertical directions, respectively;
The Brenner gradient function calculates the sum of squares gray-level differences separated by two pixels, and is sensitive to blur in the horizontal directions:

$Brenner (I) = \sum_{x = 1}^{M} \sum_{y = 1}^{N - 2} {[I (x, y + 2) - I (x, y)]}^{2}$

(3)
The SMD2 constructs the enhanced Laplacian responses in the horizontal and vertical directions separately, and then computes the square of the sum of their absolute values:

$\begin{matrix} SMD 2 (I) = \sum_{x = 1}^{M} \sum_{y = 1}^{N} (| 2 I (x, y) - I (x, y - 1) - I (x, y + 1) | \\ + | 2 I (x, y) - I (x - 1, y) - I (x + 1, y) |)^{2} \end{matrix}$

(4)

As shown in the box plots in Figure 8, images with sky backgrounds exhibit the lowest median values across all four clarity metrics. This indicates that the texture complexity of sky backgrounds is generally low, and there are virtually no motion-induced artifacts.

In contrast, images with green cropland and forest backgrounds show higher median values and a wider distribution range. This discrepancy is primarily caused by two factors. On one hand, dense vegetation introduces abundant high-frequency texture details, which elevates the clarity scores computed based on gradient measures; on the other hand, to maintain operational efficiency, drones fly at high speed over croplands, which exacerbates motion blur, causing edges to become blurred, thus interfering with clarity assessment.

The scatter plot matrix in Figure 9 employs cross-metric correlation analysis to provide supplementary insights for data validation. In the figure, color coding is used to cluster different backgrounds, clearly revealing that the background categories are spatially separated. Specifically, points representing sky backgrounds (brown) cluster in the lower-left quadrant across all pairwise plots, indicating consistently low-complexity characteristics. In contrast, points corresponding to green cropland (orange) and woodland (dark blue) are widely dispersed in the upper-right region with larger variance. Notably, all pairs of metrics exhibit strong positive correlations, confirming that these indicators converge when evaluating image quality degradation. However, the persistent clustering among backgrounds also demonstrates that texture complexity introduces systematic bias beyond the effects of motion blur.

This finding is crucial for interpreting performance differences observed in subsequent experiments. Compared with the high signal-to-noise-ratio sky scenarios, algorithms lacking adaptive background suppression mechanisms are likely to exhibit higher false-positive rates in green cropland and woodland scenes. The heterogeneity in image quality documented herein can serve as a diagnostic baseline in benchmark evaluations. This enables the assessment of whether an algorithm possesses adaptive background suppression capabilities.

3.4. Experimental Results and Analysis

Building on the dataset construction and quality assessment described above, this section focuses on benchmarking four mainstream object detection algorithms to analyze how agriculture-specific challenges affect detection performance in practice. To ensure fair model comparisons and to mitigate overfitting risks on background categories with relatively fewer samples, we apply a unified training configuration to all benchmarked models. We split the dataset into training, validation, and test sets at a ratio of 7:2:1, while maintaining consistent distributions of background categories and operation state labels across the subsets. We train all models using the Adam optimizer with an initial learning rate of 0.00001. When the validation loss shows no improvement for 10 consecutive epochs, we trigger cosine annealing decay. We set the batch size to 16 or 8 depending on each model’s GPU memory consumption. The maximum number of training epochs is 150, and we enable early stopping to prevent overfitting. Weight decay is enabled by default for the YOLO series models, while for Faster R-CNN, we introduce Dropout in the fully connected layers.

The test results in Figure 10 show that Faster R-CNN performs poorly under the green cropland and bare farmland backgrounds, where precision drops markedly to 0.61 and 0.55, respectively. In sharp contrast, YOLOv5n maintains high precision in the same scenarios, reaching 0.98 and 0.96. Further analysis indicates that this gap is not caused by differences in the number of model parameters. Instead, it stems from inherent limitations of the two-stage detection pipeline in Faster R-CNN. Specifically, when processing complex leaf vein textures, the region proposal network tends to misinterpret high-frequency edge responses as foreground anchors. This misjudgment generates a large number of spurious candidate boxes during the proposal selection stage. These erroneous proposals are then incorrectly activated by the downstream classifier because it lacks sufficient constraints from global contextual information, which ultimately leads to a systematic and substantial degradation in precision. This phenomenon directly corresponds to the image quality analysis in Figure 8 and Figure 9, where green cropland exhibits high Tenengrad gradient response values. The results confirm that dense vegetation textures pose a fundamental challenge to region-proposal-based detection methods.

Figure 10 clearly shows that the three YOLO models achieve precision above the 0.90 reference line in most backgrounds. Among them, YOLOv5n performs particularly well. It reaches a peak precision of 0.99 under the sky background and maintains a high precision of 0.98 under the orchard background. This advantage is further confirmed by the overall evaluation in Table 6. YOLOv5n achieves the highest overall precision of 97.86% among all tested models, and its mAP@50 reaches 98.30%. These results indicate that single-stage detectors offer structural advantages when handling small targets and complex backgrounds.

However, one phenomenon deserves attention. Although YOLOv8l has a deeper network and a larger parameter scale, its precision is only 95.63%, which is 2.23 percentage points lower than that of YOLOv5n. This result suggests that, in agricultural scenarios, excessive overparameterization may increase the risk of overfitting. When the background distribution of the training samples exhibits long tail characteristics, deeper networks may memorize texture details from dominant categories, which can lead to degraded generalization in minority scenarios such as bare farmland.

In recall analysis by background category, in agricultural spraying monitoring applications, the cost of missing spraying targets is usually much higher than that of falsely reporting non-spraying targets. A missed detection means that illegal spraying behavior cannot be recorded or traced, potentially causing pesticide abuse, environmental pollution, and supervision failure. Therefore, Table 7 further reports the recall of each model by background category and summarizes the corresponding values to evaluate the completeness with which each model captures spraying behavior under different background conditions.

As shown in Table 7, YOLOv5n maintains a recall above 0.95 in most backgrounds and is particularly stable under the sky and green cropland backgrounds, indicating strong capability in capturing spraying targets. In contrast, Faster R-CNN exhibits a notably low recall under the mountainous terrain background, suggesting that it is prone to missed detections in scenarios with complex terrain and severe occlusion. This phenomenon relates to a mechanism-level limitation: under complex textured backgrounds, it tends to generate many redundant proposals, which can cause true targets to be mistakenly removed by non-maximum suppression.

It is noteworthy that all models exhibit recall degradation to varying degrees under the mountainous terrain and woodland backgrounds. For mountainous terrain, the recall of YOLOv8n decreases to 0.885, while that of Faster R-CNN drops to 0.644. Under the woodland background, the recall of YOLOv8n is 0.878, which is lower than that under relatively simple backgrounds such as sky and green cropland. This trend relates to irregular occlusions, low illumination, and perspective distortions induced by steep terrain, which further weaken small target features and increase the risk of missed detections. For practical deployment in supervision systems, we recommend moderately lowering the confidence threshold or introducing cost-sensitive learning strategies in high-risk backgrounds such as mountainous terrain and woodland, trading a small loss in precision for higher recall to reduce missed detection risk.

Another noteworthy phenomenon is that YOLOv8l achieves a recall of 0.992 under the green cropland background, the highest among all models, but drops to 0.937 under the woodland background. This indicates that deeper networks can better exploit their capacity advantages in high-quality large-sample backgrounds, but their generalization is limited in small-sample, highly occluded backgrounds. This observation further supports our earlier analysis that overparameterization may introduce overfitting risks under long tail background distributions.

From the perspective of background categories, the sky background yields the best performance for all models. Its clean color space and simple texture pattern create highly favorable conditions for separating targets from the background. This finding also aligns with, and corroborates, the earlier observation that the sky category exhibits the lowest Laplacian variance. However, even under the relatively simplified sky background, performance differences across models remain evident. Faster R-CNN achieves a precision of only 0.92, which is substantially lower than the 0.99 of YOLOv5n. This result indicates that the region proposal mechanism in two-stage architectures can introduce redundant computation and accumulate errors even in low-noise environments.

In contrast, the bare farmland background poses a major challenge for all models. Faster R-CNN reaches its lowest performance, while YOLOv8n remains relatively robust, but its precision still drops to 0.85. Our analysis suggests a possible explanation: the yellow-brown tones of exposed soil partially overlap with the reflective spectrum of the UAV body, and bare farmland does not provide the strong color contrast advantage observed in green cropland. As a result, feature extractors based on red–green–blue (RGB) channels may struggle to learn sufficiently discriminative representations, thereby degrading detection performance under this background.

Under the woodland background, YOLOv5n and YOLOv8n show a clear divergence. The precision of YOLOv5n is 0.96, whereas YOLOv8n achieves only 0.88, a gap of 8 percentage points. This difference can be attributed to edge fragmentation caused by randomly distributed branch and leaf occlusions, which affect networks of different depths in different ways. Shallower networks tend to rely more on local strong gradient cues, whereas deeper networks rely more on global semantic consistency. When occlusions make the target contour incomplete, long-range dependency mechanisms in deeper networks may instead introduce more spurious activations. This finding provides practical evidence for incorporating attention mechanisms or explicit partial occlusion modeling in future algorithms. It also confirms the necessity of retaining motion blur samples in the dataset. In real agricultural monitoring, image quality degradation is not simply removable noise, but an inherent characteristic that algorithms must learn to handle.

Overall, the experimental results confirm that our dataset effectively captures the core technical challenges encountered in agricultural UAV monitoring and provides a reliable basis for subsequent algorithm analysis and optimization. The systematic recall-based analysis further reveals how missed detection risks distribute across backgrounds, offering data support for adopting differentiated detection strategies for high-risk backgrounds in real-world supervision deployments.

4. Discussion

The agricultural UAV spraying behavior recognition dataset constructed in this study advances the task definition from determining UAV presence to determining whether spraying is occurring. By introducing binary operation state annotations, namely spraying and flying without spraying, under six typical agricultural backgrounds, the dataset fills a key gap in existing UAV vision datasets, which generally lack operation state labels. Compared with other UAV detection datasets, our dataset better matches precision agriculture supervision scenarios. It can directly support automated recognition of spraying behavior and process traceability, rather than only target localization.

Benchmark experiments show that YOLOv5n achieves 97.86% precision and 98.30% mAP@50 on this dataset. This result indicates that, under conditions of a high proportion of small targets, substantial scale variation, and complex agricultural backgrounds, a compact single-stage detector can reach a favorable balance between robustness and efficiency. The results also reveal that algorithm performance is sensitive to background conditions. Under the sky background, where textures are simple and the signal-to-noise ratio is relatively high, all models perform stably. In contrast, under high-texture or heavily occluded backgrounds such as green cropland, woodland, and mountainous terrain, two-stage detectors are more likely to misclassify high-frequency leaf textures and terrain edges as UAV targets, leading to a substantial drop in precision. In addition, deeper networks exhibit clear recall degradation under long tail background distributions. These observations align with the earlier statistical analysis of image quality using metrics such as Laplacian variance and Tenengrad, jointly indicating that texture complexity and imaging degradation in agricultural scenes are not merely noise. Instead, they represent key environmental variables that should be explicitly modeled and accommodated in detector design.

From an application perspective, this dataset can provide reproducible benchmarks and real samples for smart agriculture supervision platforms. It can support regulators and farm managers in selecting detection strategies that are more suitable for high-risk areas such as mountainous terrain and woodland, for example, prioritizing higher recall under critical backgrounds and introducing cost-sensitive learning or attention mechanisms to suppress background interference. It should be noted that the current dataset is still primarily based on red–green–blue (RGB) images and remains incomplete in terms of geographic coverage, seasonal variation, emerging UAV types, multimodal sensors, and economic parameters such as operational cost and input–output. These limitations constrain its direct use in more advanced agricultural decision tasks, including spraying timing assessment, cost–benefit analysis, and cross-modal yield prediction. In future work, it will be necessary to expand the sampling scope and progressively develop the dataset into an integrated infrastructure for precision agriculture supervision and operational decision making by fusing meteorological and agronomic data and introducing operation and economic metadata.

5. Conclusions

This study constructs and publicly releases a multi-background visual dataset for precision agriculture spraying supervision. It contains 9548 images, and each UAV instance is annotated with both a bounding box and a binary operation state label, namely spraying and flying without spraying, which directly supports spraying behavior recognition. Under a unified benchmark setting, YOLOv5n achieves 97.86% precision and 98.30% mAP@0.5, validating its usability and robustness under small-target and complex agricultural background conditions. This dataset fills the gap in existing UAV datasets that lack agricultural operation state annotations and provides cross-background, reproducible evaluation baselines to support spraying operation supervision based on visual evidence.

This study also has several limitations. The dataset scale remains substantially smaller than that of large-scale general purpose datasets, and its generalization under extreme weather conditions, new crop types, and emerging UAV models requires further verification. In addition, the dataset currently includes only the RGB modality and adopts binary state labels, which is insufficient to support comprehensive analyses of spray dosage, missed spraying, repeated spraying patterns, and operational economics. Future work will proceed in three directions. First, we will expand geographic and seasonal coverage and introduce multimodal imagery such as multispectral and thermal infrared data to improve model adaptability across regions and conditions. Second, we will spatiotemporally register this dataset with soil moisture, meteorological observations, crop phenology, and yield data to build cross-modal benchmarks that support yield prediction, disease early warning, and spraying strategy optimization. Third, we will supplement operational parameters and economic metadata to establish standardized benchmark tasks for cost–benefit evaluation and operational management, promoting the evolution of this dataset from an algorithm evaluation resource into a comprehensive data platform that serves precision agriculture supervision and operational decision making.

Author Contributions

Conceptualization, L.S. and C.M.; methodology, L.S. and C.M.; software, C.M. and L.B.; validation, C.M. and L.B.; formal analysis, C.M. and L.B.; investigation, C.M. and L.B.; resources, L.S.; data curation, C.M. and L.B.; writing—original draft preparation, C.M.; writing—review and editing, C.M.; visualization, C.M. and L.B.; supervision, L.S.; project administration, L.S.; funding acquisition, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://doi.org/10.21227/dp61-fy61.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Toscano, F.; Fiorentino, C.; Capece, N.; Erra, U.; Travascia, D.; Scopa, A.; Drosos, M.; D’Antonio, P. Unmanned Aerial Vehicle for Precision Agriculture: A Review. IEEE Access 2024, 12, 69188–69205. [Google Scholar] [CrossRef]
Canicattì, M.; Vallone, M. Drones in Vegetable Crops: A Systematic Literature Review. Smart Agric. Technol. 2024, 7, 100396. [Google Scholar] [CrossRef]
García-Munguía, A.; Guerra-Ávila, P.L.; Islas-Ojeda, E.; Flores-Sánchez, J.L.; Vázquez-Martínez, O.; García-Munguía, A.M.; García-Munguía, O. A Review of Drone Technology and Operation Processes in Agricultural Crop Spraying. Drones 2024, 8, 674. [Google Scholar] [CrossRef]
Singh, J.; Sharma, K.; Wazid, M.; Das, A.K.; Vasilakos, A.V. An Ensemble-Based IoT-Enabled Drones Detection Scheme for a Safe Community. IEEE Open J. Commun. Soc. 2023, 4, 1946–1956. [Google Scholar] [CrossRef]
Liu, Z.; An, P.; Yang, Y.; Qiu, S.; Liu, Q.; Xu, X. Vision-Based Drone Detection in Complex Environments: A Survey. Drones 2024, 8, 643. [Google Scholar] [CrossRef]
Sarwar Awan, M.; Azhar Ali Zaidi, S.; Mir, J. UETT4K Anti-UAV: A Large Scale 4K Benchmark Dataset for Vision-Based Drone Detection in High-Resolution Imagery. IEEE Access 2025, 13, 73553–73568. [Google Scholar] [CrossRef]
Tran, T.M.; Vu, T.N.; Nguyen, T.V.; Nguyen, K. UIT-ADrone: A Novel Drone Dataset for Traffic Anomaly Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5590–5601. [Google Scholar] [CrossRef]
Steininger, D.; Widhalm, V.; Simon, J.; Kriegler, A.; Sulzbacher, C. The Aircraft Context Dataset: Understanding and Optimizing Data Variability in Aerial Domains. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 3816–3825. [Google Scholar] [CrossRef]
Pawełczyk, M.Ł.; Wojtyra, M. Real World Object Detection Dataset for Quadcopter Unmanned Aerial Vehicle Detection. IEEE Access 2020, 8, 174394–174409. [Google Scholar] [CrossRef]
Patrikar, J.; Dantas, J.; Moon, B.; Hamidi, M.; Ghosh, S.; Keetha, N.; Higgins, I.; Chandak, A.; Yoneyama, T.; Scherer, S. Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations. Sci. Data 2025, 12, 468. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, J.; Liu, P.; Li, W.; Zhao, S. Domain Adaptive Detection of MAVs: A Benchmark and Noise Suppression Network. IEEE Trans. Autom. Sci. Eng. 2025, 22, 1764–1779. [Google Scholar] [CrossRef]
Lu, Y.; Young, S. A Survey of Public Datasets for Computer Vision Tasks in Precision Agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]
Shang, J.; Wang, J.; Liu, S.; Wang, C.; Zheng, B. Small Target Detection Algorithm for UAV Aerial Photography Based on Improved YOLOv5s. Electronics 2023, 12, 2434. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
Genze, N.; Wirth, M.; Schreiner, C.; Ajekwe, R.; Grieb, M.; Grimm, D. Improved Weed Segmentation in UAV Imagery of Sorghum Fields with a Combined Deblurring Segmentation Model. Plant Methods 2023, 19, 87. [Google Scholar] [CrossRef] [PubMed]
Genze, N.; Ajekwe, R.; Güreli, Z.; Haselbeck, F.; Grieb, M.; Grimm, D.G. Deep Learning-Based Early Weed Segmentation Using Motion Blurred UAV Images of Sorghum Fields. Comput. Electron. Agric. 2022, 202, 107388. [Google Scholar] [CrossRef]
Yun, C.; Kim, Y.H.; Lee, S.J.; Im, S.J.; Park, K.R. WRA-Net: Wide Receptive Field Attention Network for Motion Deblurring in Crop and Weed Image. Plant Phenomics 2023, 5, 0031. [Google Scholar] [CrossRef]
Lee, J.-H.; Gwon, G.-H.; Kim, I.-H.; Jung, H.-J. A Motion Deblurring Network for Enhancing UAV Image Quality in Bridge Inspection. Drones 2023, 7, 657. [Google Scholar] [CrossRef]
Fan, P.; Zheng, C.; Sun, J.; Chen, D.; Lang, G.; Li, Y. Enhanced Real-Time Target Detection for Picking Robots Using Lightweight CenterNet in Complex Orchard Environments. Agriculture 2024, 14, 1059. [Google Scholar] [CrossRef]
Khan, Z.; Shen, Y.; Liu, H. Object Detection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
Zhu, Z.; Gao, Z.; Zhuang, J.; Huang, D.; Huang, G.; Wang, H.; Pei, J.; Zheng, J.; Liu, C. MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture 2025, 15, 1653. [Google Scholar] [CrossRef]

Figure 1. Three salient features exhibited by image data of UAV spraying operations in agricultural scenarios.

Figure 2. (a) Example of drone images in the Aircraft Context Dataset; (b) example of drone images in the Real World Object Detection Dataset; (c) example of drone images in the TartanAviation Dataset; (d) example of drone images in the Multi-MAV-Multi-Domain Dataset. The red frames indicate the locations of the drones, and the green frames represent zoomed-in views of the small targets.

Figure 3. Workflow of dataset construction.

Figure 4. The dataset includes (a) small target samples, (b) blurred samples, and (c) complex background samples, all annotated according to “spraying” and “flying without spraying” states. The red frames indicate the locations of the drones.

Figure 5. Folder structure of the dataset. Folder structure of the dataset.

Figure 6. Examples of target scale variation in agricultural UAV spraying scenarios (resolution of 480 × 480 pixels). (a) Under a close-range view, the UAV target size is 102 × 128 pixels, accounting for approximately 5.67% of the total image area. (b) Under a long-range view, the UAV target is only 37 × 38 pixels, accounting for approximately 0.61% of the image area and appearing as a point-like object.

Figure 7. Heatmap of UAV target size distribution in the dataset. (a) Histogram of the UAV target area distribution, exhibiting a typical long tail distribution pattern; (b) heatmap of the UAV target size distribution, where the horizontal axis denotes width, the vertical axis denotes height, and the color indicates target density.

Figure 8. Boxplot analysis of image quality metrics across background categories: (a) Laplacian Variance; (b) Tenengrad; (c) Brenner; (d) SMD2. The horizontal axis shows the six background categories. The vertical axis represents the dimensionless sharpness or edge strength scores computed on the corresponding images. The boxplots report the median, interquartile range, and outlier distribution for each metric in each background category, where larger values indicate stronger edge or gradient responses. The circles beyond the whiskers represent outliers.

Figure 9. Pairwise scatter matrix of the four image sharpness metrics, color-coded by background category. (a) Tenengrad vs. Laplacian Variance; (b) Brenner vs. Laplacian Variance; (c) SMD2 vs. Laplacian Variance; (d) Brenner vs. Tenengrad; (e) SMD2 vs. Tenengrad; (f) SMD2 vs. Brenner. Each point represents the metric values of one image, and the point color indicates one of the six background categories.

Figure 10. Accuracy comparison of mainstream object detection models across six background categories. The horizontal axis denotes the background category, and the vertical axis denotes precision. (a) Faster R-CNN, (b) YOLOv5n, (c) YOLOv8l, (d) YOLOv8n. The value shown above each bar indicates the corresponding precision, and the red dashed line denotes the reference threshold of

0.90

. Colors are used only for visual distinction between subfigures.

Figure 10. Accuracy comparison of mainstream object detection models across six background categories. The horizontal axis denotes the background category, and the vertical axis denotes precision. (a) Faster R-CNN, (b) YOLOv5n, (c) YOLOv8l, (d) YOLOv8n. The value shown above each bar indicates the corresponding precision, and the red dashed line denotes the reference threshold of

0.90

. Colors are used only for visual distinction between subfigures.

Table 1. Comparison of different drone datasets.

Attributes	AC [8]	Real-World [9]	TartanAviation [10]	M3D [11]	Ours
Images	42,067	56,821	3.1 M	83,999	9548
Boxes	51,276	55,539	>700,000	86,415	9716
Data Source	YouTube	Internet + Real footage	Real footage	Simulation + Real footage	Internet
Air-to-Air	✓	✕	✕	✓	✓
Ground-to-Air	✓	✓	✓	✓	✓
Air-to-Ground	✓	✕	✕	✓	✕
Operation state labeled	✕	✕	✕	✕	✓
Agriculture-Specific	✕	✕	✕	✕	✓

✓ indicates that the dataset includes this attribute; ✕ indicates that it does not.

Table 2. Statistical information of the entire dataset.

Statistical Metric	Spraying	Flying Without Spraying	Total
Video clips	45	26	71
Image sequences	182	58	240
Total frames	36,198	29,640	65,838
Extracted frames	6033	4490	9798
Annotated frames	5533	4015	9548
Annotated boxes	5687	4027	9714
Background categories	6	6	6

Table 3. Number and proportion of images for each background type.

Background Category	Image Quantity	Proportion (%)
Green Cropland	3197	33.48%
Sky	1931	20.22%
Orchard	1481	15.51%
Mountainous Terrain	1321	13.84%
Woodland	907	9.50%
Bare Farmland	711	7.45%

Table 4. Data background categories and visual features.

Background Type	Description
Green Cropland	Dense vegetation, complex textures, dominant green tones
Sky	Exposed soil, simple texture, yellow-brown hues
Orchard	Structured tree rows, dense canopy, strong shadows
Mountainous Terrain	Irregular trees, low illumination, cluttered background
Woodland	Steep slopes, terraces, dramatic scale and perspective shifts
Bare Farmland	Uniform background (blue/clouds), low clutter

Table 5. Field specification of the dataset.

Field Category	Field Name	Data Type	Description
Meta-information	`imagePath`	string	Relative path of the image file with respect to the annotation file
	`imageData`	string	Base64-encoded string of the image
	`imageHeight`	int	Image height in pixels
	`imageWidth`	int	Image width in pixels
Annotation object	`shapes`	array	List of all annotated objects
	`shapes[].label`	string	“drone_spraying” (drone is spraying) or “no_drone_spraying” (drone is flying but not spraying)
	`shapes[].points`	array	Bounding box coordinates as an array of two points: [top-left, bottom-right]
	`shapes[].shape_type`	string	Shape type, fixed as “rectangle” (rectangular bounding boxes are used)
	`shapes[].line_color`	array	Bounding box line color
	`shapes[].fill_color`	array	Bounding box fill color
	`shapes[].flags`	object	An additional boolean flag (represented as an empty object in this dataset)
Global information	`version`	string	Version number of the annotation tool used to create the file
	`flags`	object	A global boolean flag (represented as an empty object in this dataset)
	`lineColor`	array	Default line color used by the annotation tool
	`fillColor`	array	Default fill color used by the annotation tool

Table 6. Comparative analysis of experimental results for various object detection algorithms.

Model	Precision	Recall	F1-Score	mAP50
Fast R-CNN	72.50	86.17	77.36	86.58
SSD	85.88	82.35	71.21	80.25
YOLOv5n	97.86	98.32	97.75	98.30
YOLOv8l	95.63	96.11	95.87	98.24
YOLOv8n	91.90	92.47	92.32	95.74

Table 7. Recall comparison of each model across the six background categories.

Background Type	Faster R-CNN	YOLOv5n	YOLOv8l	YOLOv8n
Green cropland	0.819	0.985	0.946	0.923
Sky	0.893	0.988	0.992	0.966
Orchard	0.644	0.982	0.945	0.885
Mountainous terrain	0.943	0.988	0.974	0.952
Woodland	0.982	0.993	0.991	0.980
Bare Farmland	0.889	0.967	0.937	0.878

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Meng, C.; Shu, L.; Bai, L. Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture. J. Sens. Actuator Netw. 2026, 15, 14. https://doi.org/10.3390/jsan15010014

AMA Style

Meng C, Shu L, Bai L. Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture. Journal of Sensor and Actuator Networks. 2026; 15(1):14. https://doi.org/10.3390/jsan15010014

Chicago/Turabian Style

Meng, Chang, Lei Shu, and Leijing Bai. 2026. "Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture" Journal of Sensor and Actuator Networks 15, no. 1: 14. https://doi.org/10.3390/jsan15010014

APA Style

Meng, C., Shu, L., & Bai, L. (2026). Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture. Journal of Sensor and Actuator Networks, 15(1), 14. https://doi.org/10.3390/jsan15010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Background UAV Spraying Behavior Recognition Dataset for Precision Agriculture

Abstract

1. Introduction

2. Related Works

2.1. UAV Datasets for Aviation Safety

2.2. UAV Datasets for Security Surveillance

2.3. High-Resolution UAV Datasets for Air Traffic Control

2.4. Trajectory-Guided UAV Detection Datasets

2.5. Summary

3. Materials and Methods

3.1. Research Design and Positioning

3.1.1. Dataset Design Rationale

3.1.2. Dataset Gap Analysis

3.2. Dataset Construction

3.2.1. Data Collection and Preliminary Processing

3.2.2. Background Classification

3.2.3. Data Annotation

3.2.4. Data Augmentation and Dataset Structure

3.3. Dataset Evaluation

3.4. Experimental Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI