1. Introduction
In recent years, agricultural production has undergone profound transformations, with the autonomous operation of intelligent agricultural machinery emerging as a key trend in modern agriculture [
1]. In this context, developing a reliable visual perception system is essential for enabling autonomous navigation and path planning in agricultural equipment [
2,
3]. Recent studies on field obstacle perception have further emphasized that accurate recognition and localization of pedestrians, agricultural machinery, utility poles, and other obstacles are fundamental to the safe autonomous operation of agricultural equipment [
4,
5]. However, real-world agricultural environments are typically highly unstructured, characterized by complex background interference (e.g., abrupt illumination changes and heavy dust) alongside diverse, irregular obstacles. Failure to accurately and efficiently detect such obstacles on resource-constrained platforms can lead to significant economic losses and safety risks. Therefore, designing an obstacle detection framework that ensures high accuracy while reducing model complexity remains a critical research focus in agricultural artificial intelligence.
Existing agricultural obstacle detection systems primarily rely on sensors such as 3D LiDAR and vision-based devices [
6]. Although LiDAR can acquire precise three-dimensional point cloud data through active sensing [
2], its high hardware cost, fragile mechanical structure, and substantial computational burden limit its deployment in small- and medium-sized agricultural machinery. To alleviate the dependence on conventional LiDAR, recent studies have also explored lower-cost depth-assisted perception schemes. For example, Zhang et al. [
4] combined binocular vision with an improved You Only Look Once (YOLO)-based detector to achieve field obstacle detection and spatial localization, while Ren et al. [
7] employed an RGB-D camera to construct three-dimensional point cloud representations for orchard vehicle object detection. In contrast, vision-based approaches have attracted increasing attention due to their cost-effectiveness and rich environmental information. With the rapid advancement of deep learning, convolutional neural networks (CNNs) have been widely adopted for agricultural perception tasks.
Recent studies have applied deep object detectors to various agricultural perception tasks, including crop detection, fruit detection, weed recognition, pest monitoring, and agricultural machinery perception [
8,
9,
10,
11,
12]. These studies demonstrate the effectiveness of deep learning-based object detection in agricultural environments. However, many of them are designed for task-specific targets and are evaluated on private or specialized datasets with different object categories, acquisition platforms, imaging distances, and environmental conditions. In contrast, the AO-Dataset in this study focuses on obstacle detection for agricultural machinery navigation in unstructured environments, where the target categories, scale distribution, and background complexity differ from many crop- or fruit-centered agricultural datasets.
Vision-based object detection methods are commonly divided into two-stage detectors, such as Faster R-CNN [
13], and one-stage detectors, including the YOLO family [
14,
15,
16]. Two-stage approaches generally provide strong detection accuracy but require substantial computation, which limits their applicability in real-time and resource-constrained scenarios. To improve efficiency, recent work has explored lightweight YOLO-based variants for agricultural applications, achieving favorable performance on embedded platforms [
6]. Despite these advances, CNN-based detectors remain constrained by their limited receptive fields, restricting their ability to capture long-range contextual information. This limitation becomes more pronounced in unstructured agricultural environments, where obstacles often exhibit irregular shapes and long-tail distributions, including dynamically moving tumbleweeds, distant poles and transmission towers with extreme aspect ratios, and small-scale aerial targets such as agricultural drones. In addition, complex backgrounds, such as soil textures and dense vegetation, can introduce feature ambiguity, increasing the likelihood of missed detections and false positives, particularly for small objects.
Transformer-based end-to-end detection frameworks offer an alternative to CNN-based approaches by enabling global modeling through self-attention mechanisms. This capability allows the model to capture long-range dependencies, which is beneficial for detecting multi-scale and irregularly shaped objects in complex environments [
17,
18]. Detection Transformer (DETR) [
19] introduces this paradigm into object detection by directly predicting object categories and bounding boxes, removing the need for hand-crafted post-processing steps such as non-maximum suppression (NMS). To improve training efficiency and computational cost, Real-Time Detection Transformer (RT-DETR) [
20] incorporates a series of optimized designs, achieving faster inference while preserving the benefits of global feature modeling.
Despite these advancements, directly applying RT-DETR to agricultural perception systems still faces critical challenges. On the one hand, small-scale targets such as drones and tumbleweeds exhibit weak feature representations, and the static feature aggregation mechanism of RT-DETR struggles to adaptively focus on fine-grained regions, leading to missed detections. On the other hand, the parameter scale of RT-DETR remains relatively high for resource-constrained edge devices, making it unsuitable for long-term, low-power deployment in agricultural applications.
To address these challenges, this paper presents Agri-DETR, a parameter-efficient detection framework tailored for unstructured agricultural environments. Built upon the RT-DETR architecture, the proposed method introduces a systematic co-design to improve parameter efficiency while enhancing global perception capability for complex and long-tail obstacles. The main contributions of this work are summarized as follows:
A dedicated agricultural obstacle dataset (AO-Dataset) is constructed to fill the gap of existing public datasets in domain-specific agricultural scenarios. The dataset includes challenging long-tail targets such as tumbleweeds, agricultural drones, and distant transmission towers, providing a realistic benchmark for intelligent agricultural perception systems.
A parameter-efficient backbone based on an improved StarNet architecture is employed, with an additional high-resolution feature branch to preserve fine-grained spatial details and enhance the detection of small-scale obstacles.
A large-kernel selective feature fusion module (LSR-C3) is introduced to strengthen feature representation by reducing background interference and highlighting target regions, while an attention-enhanced dynamic upsampling mechanism (EMA-CARAFE) further improves high-resolution feature reconstruction, particularly for boundary and detail recovery of small objects.
A geometry-aware bounding box regression loss (SSGIoU) is employed to provide more informative geometric supervision, leading to improved localization accuracy in complex scenarios and for objects with extreme shapes.
Agri-DETR achieves strong performance with only 14.5 M parameters, outperforming lightweight RT-DETR baselines while remaining competitive with YOLO12m. This framework offers an efficient visual perception solution for intelligent agricultural machinery under resource-constrained conditions.
2. Materials and Methods
2.1. Dataset Construction
The effectiveness of deep learning-based object detection models is closely related to the availability of large-scale, high-quality datasets [
21,
22]. Widely used benchmarks such as COCO [
23] and KITTI [
24] are mainly developed for urban or general-purpose scenarios, and their category definitions and data distributions do not sufficiently reflect the obstacles encountered in unstructured agricultural environments. This mismatch limits their suitability for agricultural perception tasks and motivates the need for a dedicated dataset that better represents real-world operating conditions.
To better reflect real-world agricultural conditions, an Agricultural Obstacle Dataset (AO-Dataset) is developed for complex unstructured environments, with a particular focus on saline–alkali farmland. The dataset includes six representative categories: Person, Agricultural Machinery (denoted as “Agri-Machinery”), Drone, Pole, Iron Tower, and Tumbleweed, as illustrated in
Figure 1. These categories encompass both dynamic and static obstacles and exhibit significant variations in scale, such as small drones and large transmission towers, as well as diverse structural characteristics, including slender poles and irregular tumbleweeds. This design enables a more realistic representation of the challenges encountered in practical agricultural perception tasks.
The category design explicitly considers the characteristics of saline–alkali environments, which are common in arid and semi-arid regions. Unlike conventional farmland, these environments are often characterized by strong winds, sparse vegetation, and loose soil, conditions that readily give rise to dynamic obstacles such as tumbleweeds. These objects exhibit irregular shapes, large scale variations, and highly uncertain motion patterns, making them difficult to detect and track in visual perception systems. Existing datasets and prior studies rarely account for such environment-specific targets, leading to frequent missed detections and false positives in practical applications. To better capture these challenges, the Tumbleweed category is explicitly included, providing a more realistic representation of harsh agricultural operating conditions.
Intra-class diversity is also taken into account to improve the practical relevance of the dataset. For instance, the Person category includes a variety of agricultural working postures, such as bending and wearing hats, while the Agricultural Machinery category covers different types of equipment, including tractors, harvesters, and transport vehicles. The Drone category further reflects emerging scenarios involving collaboration between ground machinery and aerial platforms. Such configurations allow the dataset to remain broadly applicable while better representing real-world agricultural conditions.
To increase data diversity and reduce the risk of overfitting caused by homogeneous data sources, a multi-source data acquisition strategy is adopted. Data collected from a single farmland environment tends to bias the model toward specific conditions, such as soil color, vegetation texture, or illumination patterns. To alleviate this issue, the dataset incorporates heterogeneous sources, including high-quality public image platforms and cross-platform web data, as summarized in
Table 1. These sources cover a wide range of environmental conditions across different regions, seasons, and weather scenarios. All images undergo manual screening, cross-validation, deduplication, and quality control to ensure the overall reliability and diversity of the dataset.
2.2. Data Augmentation and Processing
To better align high-quality collected images with the degraded visual conditions encountered in real-world unstructured agricultural environments, a class-aware data augmentation strategy is applied to improve robustness and mitigate class imbalance [
25].
During data collection, large-scale objects such as agricultural machinery and pedestrians appear more frequently due to their higher visibility, while slender structures like poles are less likely to be captured and are therefore underrepresented. This imbalance can bias the model toward dominant categories during training. To reduce this effect without altering the overall data distribution, data augmentation is applied to expand the feature space of minority classes and improve class balance. The augmentation strategies and their corresponding physical interpretations are summarized in
Table 2, with representative examples shown in
Figure 2.
All images are annotated by multiple trained annotators using the LabelImg (v1.8.6) tool, following a two-stage process that includes initial labeling and cross-validation to maintain annotation quality. Annotations are provided in COCO format, and inter-annotator consistency, measured by Intersection over Union (IoU > 0.8), exceeds 92%.
It should be noted that
Table 1 reports the number of raw images collected from different sources before augmentation. Specifically, the raw AO-Dataset contains 10,300 original images, including 6200 images from self-collected and web-based sources and 4100 images selected from public datasets after filtering and re-annotation. After offline augmentation and preprocessing, the final experimental image files used in this study contain 14,305 images with 26,895 annotated instances. Therefore, the 10,300 images refer to raw collected images before augmentation, whereas the 14,305 images refer to the final experimental image files after augmentation and preprocessing.
The final experimental image files are divided into training, validation, and test sets with ratios of 70%, 20%, and 10%, respectively. To maintain fair evaluation, scenes are not shared across different subsets.
The class distribution after augmentation and preprocessing is shown in
Table 3. Despite the use of class-aware augmentation, the AO-Dataset still preserves a moderate and realistic imbalance across both category and scale, with large objects appearing more frequently and small or rare categories remaining relatively scarce. This distribution reflects real agricultural conditions, where certain obstacles appear only occasionally, and provides a practical benchmark for evaluating detection robustness in complex unstructured agricultural environments.
2.3. Construction of the Agri-DETR Model
To improve detection accuracy and parameter efficiency under resource constraints, Agri-DETR is developed based on the RT-DETR architecture as an efficient visual perception framework for intelligent agricultural machinery. It targets challenges commonly observed in unstructured agricultural environments, including weak feature representation of rare obstacles (e.g., tumbleweeds and drones), complex background interference, and limited computational capacity. Compared with conventional CNN-based approaches that rely on local receptive fields, the framework leverages Transformer-based global context modeling and adopts a coordinated design across backbone, feature fusion, upsampling, and bounding box regression. The overall architecture is illustrated in
Figure 3.
Compared with the original RT-DETR, Agri-DETR retains the advantages of end-to-end detection while being specifically optimized for two primary objectives: fine-grained target perception and high parameter efficiency. To this end, a series of architectural enhancements are introduced. To begin with, a parameter-efficient backbone based on an improved StarNet is constructed to significantly reduce model complexity. Additionally, a high-resolution feature perception branch is incorporated to enhance the representation of small-scale and slender obstacles. Furthermore, a multi-scale contextual feature fusion module (LSR-C3) is designed to improve feature discriminability and robustness under complex agricultural backgrounds. Finally, an attention-enhanced dynamic upsampling mechanism based on EMA-CARAFE is introduced to accurately reconstruct fine-grained spatial details, particularly for object boundaries.
2.3.1. Improved StarNet Backbone
The backbone network of the original RT-DETR primarily relies on classical architectures such as ResNet [
26] and HGNetv2 [
27]. While ResNet alleviates gradient vanishing through residual connections and HGNetv2 enhances performance via carefully designed hierarchical structures, both architectures heavily depend on standard full-channel convolutions in deeper layers. This design leads to substantial parameter redundancy and memory overhead, making them unsuitable for continuous, low-power deployment on resource-constrained agricultural edge devices.
A common strategy for lightweight design is to replace the backbone with efficient architectures such as MobileNetV3 [
28], ShuffleNetV2 [
29], and EfficientNet [
30], which reduce computational cost through mechanisms such as channel shuffling and compound scaling. However, in complex unstructured agricultural environments, where obstacles exhibit large scale variations and highly heterogeneous shapes, directly adopting these general-purpose lightweight backbones often results in insufficient global semantic representation and degraded fine-grained feature modeling. Consequently, achieving a balance between parameter efficiency and fine-grained perception remains challenging.
To improve feature efficiency, a lightweight convolutional architecture, StarNet [
31], is adopted as the backbone, as shown in
Figure 4a. Its fundamental unit, the Star Block (
Figure 4b), serves as a parameter-efficient alternative to conventional convolutional structures such as the ResNet block (
Figure 4c). Compared with standard Conv–BN–ReLU–Conv designs, which often introduce redundant parameters, the Star Block improves efficiency through a channel-splitting and feature reconstruction mechanism. Specifically, the input feature is divided into two branches: one preserves spatial information, while the other generates modulation weights via depthwise separable convolution and
convolution. The two branches are then fused through element-wise multiplication and channel shuffling, enabling efficient cross-channel interaction with enhanced nonlinearity.
The parameter efficiency of the Star Block stems from its decomposition of standard convolution. For a standard convolution, the parameter count and floating-point operations (FLOPs) can be expressed as
whereas depthwise separable convolution significantly reduces both metrics to
where
H and
W denote the spatial height and width of the input feature map,
K is the kernel size, and
and
represent the numbers of input and output channels, respectively.
When
, the parameter compression ratio can be expressed as
This result suggests that when the channel dimension is large, the efficiency gain is mainly determined by the kernel size K, allowing StarNet to achieve substantial parameter reduction while maintaining sufficient representational capacity.
Although StarNet is parameter-efficient, directly applying it may still lead to the loss of fine-grained geometric information due to progressive downsampling. To address this issue, a high-resolution feature perception branch is introduced into the backbone. In addition to the conventional multi-scale feature hierarchy (S2–S4), a shallow high-resolution feature map S1 is incorporated. As illustrated by the HRF (High-Resolution Fusion) module in
Figure 4a, the S1 feature is first downsampled and then fused with S2 through the subsequent LSR-C3 module. This design preserves rich edge and texture details, enabling effective integration of high-resolution spatial information with deep semantic representations.
Relative to deeper abstracted features, the S1 feature map preserves richer spatial details, which are important for detecting small and slender objects such as drones and distant poles. This modification alleviates the loss of spatial information caused by repeated downsampling and improves fine-grained feature discrimination in complex backgrounds. The detailed configuration and parameter distribution of the reconstructed backbone are presented in
Table 4.
Overall, the improved StarNet backbone achieves a favorable balance between computational efficiency and representational capability through the synergistic design of lightweight convolutional structures and high-resolution feature compensation. This design improves multi-scale detection, particularly for small objects, and yields more discriminative features for subsequent modules.
2.3.2. Hybrid Encoder Optimization: Large Selective Feature Fusion Module (LSR-C3)
In RT-DETR, the hybrid encoder aggregates multi-level features (S2, S3, S4) from the backbone through the cross-scale feature fusion (CCFF) module, acting as a bridge between spatial details and semantic representations. The stacked RepC3 modules [
32] maintain efficient inference through re-parameterization, but their reliance on standard convolutions introduces limitations when handling high-dimensional features in unstructured agricultural environments. The restricted receptive field makes it difficult to capture global geometric structures and long-range dependencies of large-scale or slender objects, such as transmission towers and poles. In addition, standard convolutions lack adaptive spatial selection, which reduces their ability to suppress background interference under challenging conditions, including low-contrast targets and complex environments with dust or vegetation clutter, leading to ambiguous feature representations.
To improve feature representation under these constraints, a parameter-efficient and spatially-aware feature extraction module, Large Selective Rep-C3 (LSR-C3), is introduced. Its core unit, the LSR-Block, follows a dual-branch complementary architecture, as illustrated in
Figure 5.
The input multi-scale features are first processed by a CSPRepLayer, a lightweight feature aggregation module in RT-DETR, for initial integration of local textures and high-level semantics. Cross-branch channel partitioning and residual aggregation reduce redundant computation while preserving fine-grained details, resulting in more discriminative features for subsequent spatial enhancement.
The fused features are then fed into the LSR-Block for large-kernel selective spatial modeling. As shown in
Figure 5, the block contains two complementary spatial modeling paths. A
depthwise convolution is used to capture local textures and mid-scale spatial patterns, while a dilated
depthwise convolution with dilation rate 3 further enlarges the effective receptive field to model broader contextual dependencies. In this way, the module enhances both local detail perception and large-scale structural representation while introducing only limited additional parameters.
To adaptively select informative spatial responses, the features from different receptive-field paths are compressed by convolutions. Channel-wise average pooling and max pooling are then applied to generate spatial descriptors, which are processed by a convolution followed by a sigmoid activation to produce spatial selection weights. These weights are used to recalibrate the receptive-field responses through weighted aggregation. Finally, a convolution restores the channel dimension, and a residual connection is introduced to preserve the original feature representation. This selective reweighting mechanism helps enhance obstacle-related regions while suppressing redundant background responses, improving feature representation in complex agricultural scenes.
Compared with conventional full-convolution-based feature fusion, the proposed LSR-C3 introduces large-kernel spatial modeling through depthwise convolutions and lightweight channel transformations. This design enlarges the effective receptive field and enhances spatial selectivity while keeping the additional computational overhead relatively low. As validated in the experimental section, the proposed module achieves better accuracy–efficiency trade-off than the baseline design.
2.3.3. Dynamic Decoding Upsampling: Efficient Multi-Scale Feature Reassembly Module (EMA-CARAFE)
In RT-DETR, upsampling in the hybrid encoder and feature fusion path plays an important role in cross-scale information interaction. Conventional upsampling operations, such as nearest-neighbor or bilinear interpolation, are content-agnostic and determine interpolation weights only according to geometric distance. Such fixed interpolation strategies ignore semantic context and local structural variations. In unstructured agricultural environments, this may lead to aliasing artifacts, discontinuous fine structures, and blurred boundaries between obstacles and complex backgrounds, especially for slender objects and small targets.
CARAFE [
33] alleviates this problem by predicting content-aware reassembly kernels for feature upsampling. However, the standard CARAFE module mainly relies on local convolutional encoding during kernel generation, which limits its ability to model broader contextual dependencies and suppress background interference. In agricultural scenes with vegetation clutter, dust, and low-contrast targets, the generated reassembly weights may be affected by noisy responses, resulting in less accurate boundary reconstruction.
To enhance content-aware upsampling, an efficient multi-scale feature reassembly module, termed EMA-CARAFE, is introduced. As shown in
Figure 6, the proposed module first employs an EMA-style attention mechanism to recalibrate the input feature before CARAFE-based dynamic upsampling. In this way, the feature representation used for dynamic kernel generation and neighborhood reassembly is enhanced by spatial and channel-aware attention.
The input feature is first processed by an EMA-style attention module. Specifically, a convolution is applied to unify channel representations and reduce redundant responses, followed by channel grouping to facilitate efficient feature interaction. The grouped features are then processed through two complementary branches: a directional pooling branch, which applies average pooling along the spatial dimensions (height and width) to capture long-range dependencies, and a local spatial refinement branch to enhance fine-grained spatial relationships, resulting in more discriminative feature representations. The outputs of these branches are fused and projected through a convolution to obtain a recalibrated feature representation. Notably, this attention mechanism preserves the spatial resolution of the input feature.
The recalibrated feature is subsequently fed into the CARAFE dynamic upsampling module, which transforms the feature map into , where denotes the upsampling scale factor. The CARAFE module consists of two key stages: dynamic kernel generation and neighborhood reassembly. In the kernel generation stage, content-aware reassembly kernels are predicted through channel compression and a lightweight content encoder. In the reassembly stage, these kernels are applied to local neighborhoods via feature unfolding and weighted aggregation to reconstruct high-resolution features. The upsampled feature is fused with the recalibrated feature from the EMA-style attention module to produce the final output.
Compared with conventional interpolation-based upsampling, EMA-CARAFE combines attention-enhanced feature recalibration with content-aware feature reassembly. The EMA-style attention improves the discriminability of the input feature, while CARAFE dynamically adjusts the reconstruction process according to local structures and object boundaries. Since the module mainly relies on lightweight convolutions, grouped attention, and depthwise convolutions with region-specific reassembly operations, it improves feature reconstruction quality without introducing excessive computational overhead. This design is particularly beneficial for complex agricultural scenes, where small objects, slender structures, and cluttered backgrounds require more adaptive multi-scale feature fusion.
2.3.4. Bounding Box Regression Optimization Based on SSGIoU
Bounding box regression loss plays a key role in localization accuracy and convergence behavior in object detection. In RT-DETR, a combination of
loss and GIoU (Generalized Intersection over Union) [
34] is used for box optimization. This formulation constrains coordinate errors and overlap differences effectively in general scenarios, but shows limitations in unstructured agricultural environments.
The loss treats the four bounding box parameters independently, which couples center offset and scale variation in practice. This formulation is sensitive to small perturbations and can lead to unstable localization for small objects such as drones and tumbleweeds. GIoU provides gradients even in non-overlapping cases, but shows limited sensitivity to shape discrepancies, especially for objects with extreme aspect ratios, such as slender poles and transmission towers.
To address these issues, a scale- and shape-aware auxiliary penalty term, denoted as , is introduced. By integrating this term into the original regression objective, a new bounding box optimization scheme, termed SSGIoU, is formulated. The proposed approach explicitly decouples positional alignment from shape variation, thereby enhancing the model’s capability to reconstruct geometric structures under challenging conditions.
Under the SSGIoU framework, the overall regression loss is defined as
To formally define the auxiliary term , let the predicted bounding box be and the ground-truth box be , where denotes the center coordinates and w, h denote width and height, respectively.
First, to decouple positional alignment from scale variation and improve sensitivity to small spatial shifts, a center alignment loss is defined as
The original loss includes coordinate supervision, but its gradients are shared between position and scale. An explicit center alignment term separates this coupling and improves positional optimization, which is especially useful for small-object localization.
To improve shape modeling for slender and irregular objects, a logarithmic shape consistency loss is introduced:
where
is a small constant to avoid numerical instability. By operating in the logarithmic domain, this formulation provides smooth and scale-invariant gradients for aspect ratio alignment across objects of different sizes.
To account for large variations in object scale, an adaptive weighting mechanism based on object area is incorporated. Each positive sample is assigned a weight according to its relative area , and is grouped into small, medium, or large categories. To reduce the dominance of large objects during loss aggregation, different weights are applied across scale groups. Empirically, the weights for small, medium, and large objects are set to 2.0, 1.5, and 1.0, respectively, based on observations from the AO-Dataset.
Combining the above components, the auxiliary penalty term is defined as
where
N is the number of positive samples, and
is a balancing coefficient between center and shape terms, empirically set to 0.5.
The proposed SSGIoU extends the original regression objective with additional geometric constraints. Scale- and shape-aware supervision improves localization accuracy under complex conditions. Combined with the high-resolution backbone and the dynamic upsampling module, this formulation provides a consistent optimization framework that enhances fine-grained localization performance in unstructured agricultural environments.
3. Results
3.1. AO-Dataset
AO-Dataset is developed to evaluate the proposed method in complex agricultural environments. It focuses on obstacle detection in unstructured farmland scenarios and includes representative categories such as Drone, Person, Tumbleweed, Pole, Iron Tower, and Agri-Machinery.
Figure 7 shows the distributions of bounding-box sizes and object center positions, providing an overview of the dataset’s statistical characteristics.
As shown in
Figure 7a, the object size distribution is highly imbalanced, with a large proportion of samples concentrated in the small-scale region. In particular, most objects have normalized widths and heights below 30%, indicating that small objects dominate the dataset. In addition, certain categories, such as poles and transmission towers, exhibit pronounced elongated structures with significant aspect ratio variations, which impose additional challenges for accurate shape modeling.
Figure 7b shows that object locations are not uniformly distributed across the image and tend to cluster near the center, indicating a clear center bias. This distribution highlights the importance of accurate center localization, as small spatial deviations can lead to noticeable performance degradation, especially for small objects.
The AO-Dataset introduces several challenges for object detection. The prevalence of small-scale objects requires fine-grained feature representation and precise localization, while elongated structures with large aspect ratio variations increase the difficulty of shape modeling. The concentration of object centers further emphasizes the need for accurate positional modeling. These factors motivate the design of the proposed Agri-DETR framework.
3.2. Evaluation Metrics
To comprehensively evaluate the performance of the proposed method in unstructured agricultural environments, the COCO evaluation protocol is adopted. Considering that the target scenarios involve a large number of small objects, such as drones, tumbleweeds, and distant persons, special emphasis is placed on scale-aware evaluation metrics.
Intersection over Union (IoU) is used to measure the overlap between predicted and ground-truth bounding boxes, defined as
where
and
denote the predicted and ground-truth bounding boxes, respectively.
Following the COCO protocol, detection performance is evaluated using Average Precision (AP) at different IoU thresholds, including and , corresponding to IoU thresholds of 0.50 and 0.75, respectively. In addition, scale-aware metrics are reported, including , , and , which measure detection performance for small, medium, and large objects.
To provide a more comprehensive evaluation, Average Recall (AR) is also reported. AR reflects the model’s ability to retrieve ground-truth objects under different detection thresholds.
In the COCO evaluation framework, small, medium, and large objects are defined as those with areas less than , between and , and greater than , respectively.
Overall, the model is evaluated using COCO-style metrics (, , , , , ) and AR, ensuring a thorough assessment of detection performance, particularly for small objects in complex agricultural environments.
3.3. Experimental Environment and Training Settings
All models are trained and evaluated under a unified experimental setup to maintain fairness and reproducibility. Experiments are conducted on a Windows 11 platform with an AMD Ryzen 9 9950X CPU and an NVIDIA RTX 4090 GPU (24 GB memory). The software environment includes Python 3.8.20, PyTorch 2.0.1, and CUDA 11.8, with VS Code 1.115.0 used for development. The detailed configuration is summarized in
Table 5.
All models are trained and evaluated under a unified experimental setup to maintain fairness and reproducibility. Experiments are conducted on a Windows 11 platform (Microsoft Corporation, Redmond, WA, USA)with an AMD Ryzen 9 9950X CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA) and an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB memory. The software environment includes Python 3.8.20, PyTorch 2.0.1, and CUDA 11.8, with VS Code 1.115.0 used for development. The detailed configuration is summarized in
Table 5.
All models are trained using the same data split, with identical training, validation, and test sets. The input resolution is set to , and the batch size is 16. The learning rate schedule and training strategy follow the official or original configuration of each model family as closely as possible.
Different optimization strategies are employed for different model families. For RT-DETR and its variants (including the proposed Agri-DETR), the AdamW optimizer is used with an initial learning rate of . For other comparative models, the SGD optimizer is adopted with an initial learning rate of 0.01 and a momentum of 0.9. All other hyperparameters of the compared models follow their original implementations as closely as possible to ensure a fair comparison.
To further ensure comparison fairness, all compared models were trained using their official or widely adopted implementations. The main hyperparameters of each model family, including the optimizer, learning rate, training schedule, and default training strategy, followed the official recommendations or original configurations as closely as possible. Meanwhile, the dataset split, input resolution, batch size, hardware platform, and evaluation metrics were kept consistent across all models. No additional model-specific tuning was applied only to Agri-DETR beyond the proposed architectural modifications.
Regarding training epochs, RT-DETR-based models are trained for a maximum of 150 epochs, while other models are trained for 300 epochs. This setting follows the training protocols of the original implementations to ensure fair comparison. In addition, an early stopping strategy is employed to prevent overfitting and reduce training time, where training is terminated if no improvement is observed on the validation set for 50 consecutive epochs.
3.4. Comparison with Representative Methods
To objectively evaluate the performance of the proposed Agri-DETR for agricultural obstacle detection, comparisons are conducted with several representative detectors under a unified experimental setup. Although many agricultural detection methods have been reported in recent studies [
35,
36], their direct numerical comparison with this study is challenging because most of them are evaluated on task-specific or private datasets with different object categories, acquisition platforms, imaging distances, environmental conditions, and evaluation protocols. Therefore, directly comparing their reported results with those obtained on the AO-Dataset may be unfair and may not accurately reflect model differences.
For this reason, this study evaluates representative detection architectures on the same AO-Dataset with the same data split, input resolution, training protocol, hardware platform, and evaluation metrics. The compared methods include the classical two-stage detector Faster R-CNN [
37], one-stage detectors from the YOLO family [
38,
39], and the end-to-end Transformer-based detector RT-DETR [
20]. These baselines cover the main technical routes used in both general object detection and agricultural visual perception. Such a unified evaluation protocol provides a fairer basis for assessing the effectiveness of Agri-DETR in agricultural obstacle detection.
Table 6 shows that Transformer-based detectors (RT-DETR series) achieve higher recall (AR) than most YOLO models, reflecting the benefit of global feature modeling in complex environments. RT-DETR-R18 and RT-DETR-R50 obtain AP values of 65.3% and 65.5%, respectively, indicating comparable detection accuracy. The gain from increasing model capacity is limited, as RT-DETR-R50 improves AP by only 0.2% over RT-DETR-R18 while requiring substantially higher computational cost (41.5 M parameters and 131.8 GFLOPs). This result indicates that scaling model size alone does not lead to proportional performance improvements.
Figure 8 compares AP, GFLOPs, and parameter size across all models, providing a direct view of the trade-off between detection accuracy and computational complexity. Faster R-CNN shows the highest computational cost but much lower accuracy, reflecting inefficient use of model capacity. The YOLO series operates in a relatively low-complexity regime, where performance gains are mainly achieved by increasing model size. For example, YOLO12m reaches 65.9% AP with 20.1 M parameters and 67.1 GFLOPs.
In contrast, the RT-DETR series achieves relatively higher accuracy but at increased computational cost, especially for RT-DETR-R50. Compared with these methods, Agri-DETR demonstrates a more favorable balance between accuracy and efficiency. Specifically, Agri-DETR achieves the highest AP of 66.0% among all compared methods, while maintaining only 14.5 M parameters and 58.9 GFLOPs. Compared with RT-DETR-R18, the proposed method improves detection accuracy without increasing computational cost, while reducing the model size by approximately 25%.
These results indicate that the performance gain of Agri-DETR is attributed to more effective feature representation and multi-scale modeling, rather than brute-force scaling of model capacity. Moreover, as reflected by AP
50, AP
75, and AR in
Table 6, Agri-DETR maintains accurate localization at higher IoU thresholds and exhibits stable recall in complex scenarios.
In addition to parameter size and FLOPs, inference speed is also evaluated to further analyze practical efficiency. Since YOLO-family detectors are highly optimized one-stage architectures, their inference speed, measured in frames per second (FPS), is generally higher than that of Transformer-based detectors. Therefore, as shown in
Table 7, the speed comparison is mainly discussed within the RT-DETR-based framework. Under the same PyTorch inference setting with batch size 1 and an input resolution of
, RT-DETR-R18, RT-DETR-R50, and Agri-DETR achieve 76.89 FPS, 49.61 FPS, and 74.08 FPS, respectively. Although Agri-DETR is slightly slower than RT-DETR-R18, it achieves higher AP with fewer parameters and comparable FLOPs. Compared with RT-DETR-R50, Agri-DETR achieves both higher accuracy and faster inference speed with substantially lower model complexity. This indicates that the proposed method maintains a practical efficiency level while improving detection performance.
To further assess real-world performance, qualitative results of several representative models are presented in
Figure 9. Yellow circles and arrows are added to highlight representative false positives (FP), false negatives (FN), and duplicate boxes (DB), so that key differences among detectors can be more clearly observed. In relatively simple scenarios (
Figure 9a,b), all models achieve comparable detection results, indicating similar baseline recognition capability. However, performance differences become more pronounced in complex scenes. In dense-object scenarios (
Figure 9c), RT-DETR-R18 produces duplicate detections, as highlighted by the marked regions, reflecting limitations in handling closely spaced objects. Under challenging conditions such as dust and background clutter (
Figure 9d), false positives are more likely to occur. In road scenes (
Figure 9e), both YOLO12m and RT-DETR-R18 misclassify background structures as poles, while Faster R-CNN suffers from missed detections. In scenarios involving occlusion and pose variation (
Figure 9f), YOLO12m misses partially occluded persons, and RT-DETR-R18 and Faster R-CNN tend to merge adjacent instances. In contrast, Agri-DETR consistently demonstrates more robust discrimination between foreground and background and better distinguishes neighboring objects, indicating improved robustness under complex conditions.
Overall, Agri-DETR demonstrates advantages over existing representative detectors from both quantitative and qualitative perspectives. As shown in
Table 6 and
Figure 8, Agri-DETR achieves the highest AP among all compared methods while maintaining a compact model size and moderate computational cost, indicating a favorable accuracy–complexity trade-off. Compared with Faster R-CNN, Agri-DETR provides substantially higher detection accuracy with far fewer parameters and FLOPs. Compared with YOLO-family detectors, it achieves higher AP and AR while preserving the end-to-end set prediction advantage of the DETR-based framework. Compared with RT-DETR-R18, Agri-DETR further improves detection accuracy while reducing the parameter count by approximately 25%. In addition, the inference speed comparison in
Table 7 shows that Agri-DETR maintains a practical FPS level within the RT-DETR-based framework, while substantially reducing model complexity compared with RT-DETR-R50. Furthermore, the qualitative results in
Figure 9 show that Agri-DETR improves robustness in suppressing false positives and distinguishing neighboring objects under challenging conditions, including cluttered backgrounds, dense layouts, occlusion, and pose variation. These results indicate that the advantage of Agri-DETR comes from coordinated lightweight feature extraction, multi-scale feature fusion, dynamic upsampling, and geometry-aware localization, rather than simply increasing model capacity.
3.5. Ablation Study
To validate the effectiveness of each proposed component, ablation experiments are conducted under identical experimental settings, and the results are reported in
Table 8. Starting from a baseline model built upon RT-DETR with a StarNet backbone, different modules are incrementally introduced, including the S1 high-resolution feature branch, the LSR-C3 feature enhancement module, the EMA-CARAFE upsampling module, and the SSGIoU loss. The contributions of each component are evaluated in terms of overall detection performance and small-object detection capability, where AP reflects overall performance and
specifically measures small-object accuracy.
Effect of backbone replacement and S1 branch. Replacing the ResNet-18 backbone with the lightweight StarNet substantially reduces the model size and computational cost, with the parameter count decreasing from 19.62 M to 11.73 M and FLOPs decreasing from 59.39 G to 34.03 G. Although the overall AP slightly decreases from 65.31% to 64.97%, the model becomes much more lightweight, providing a more suitable baseline for subsequent structural optimization. Introducing the S1 high-resolution branch increases the parameter count from 11.73 M to 13.00 M and FLOPs from 34.03 G to 50.07 G because higher-resolution spatial features are preserved for feature fusion. This additional computational cost brings an improvement in APS from 26.95% to 27.83%, indicating that the S1 branch is effective in enhancing fine-grained spatial representation for small-object detection.
Effect of LSR-C3. Incorporating the LSR-C3 module further improves both AP and APS in a stable manner. This confirms that LSR-C3 plays a crucial role in enhancing feature representation by introducing large receptive fields and adaptive feature selection. Compared with using S1 alone, LSR-C3 achieves a more balanced improvement across overall performance and small-object detection, highlighting its contribution to both global context modeling and local feature enhancement.
Effect of EMA-CARAFE. Adding the EMA-CARAFE module slightly decreases overall AP but significantly improves , achieving the best small-object performance among all configurations. This suggests that EMA-CARAFE mainly benefits fine-grained feature reconstruction, especially for small objects, while introducing minor perturbations that may affect medium and large object detection. The results demonstrate a clear trade-off between global accuracy and small-object sensitivity.
Effect of SSGIoU. Introducing SSGIoU yields the best overall performance, achieving the highest AP, AP50, and AP75 among all configurations without increasing Params or FLOPs, since SSGIoU only modifies the regression loss during training and does not change the inference structure. Compared with the original GIoU loss, SSGIoU improves bounding box regression by incorporating scale-, shape-, and geometry-aware constraints, leading to more accurate localization under complex conditions. The improvement is more evident in overall AP and AP75 than in APS, suggesting that the gain mainly comes from better regression quality rather than enhanced feature representation.
Figure 10 shows the convergence curves of AP and AP
S. The AP curves highlight clear differences in overall performance across models: all configurations converge rapidly at early stages, while their performance gradually diverges as training proceeds. The inclusion of S1 and LSR-C3 leads to smoother convergence and higher AP plateaus, indicating improved feature representation. After adding EMA-CARAFE, AP shows slight fluctuation, while AP
S is further improved, suggesting that the module mainly enhances small-object feature reconstruction. With SSGIoU, the model achieves the best final AP and more stable late-stage convergence, indicating more effective refinement of bounding box regression. From the AP
S curves, improvements in small-object detection are evident from early training stages and persist throughout training. Compared with overall AP, AP
S is more sensitive to structural modifications, confirming the effectiveness of high-resolution features and feature reassembly mechanisms for small-object detection. The zoomed-in curves show that SSGIoU continues to improve AP during the late training stage, indicating more effective localization refinement.
Overall analysis. The ablation results reveal a clear division of roles among the proposed components. The S1 branch enhances high-resolution spatial information and primarily benefits small-object detection. LSR-C3 serves as the core module for improving overall detection performance through enhanced feature representation. EMA-CARAFE focuses on detail reconstruction and further boosts small-object performance. Finally, SSGIoU improves localization accuracy through better regression optimization. These components jointly form a complementary system, enabling Agri-DETR to achieve consistent improvements in both overall accuracy and small-object detection capability. Although the S1 branch introduces additional FLOPs due to the use of higher-resolution features, the final Agri-DETR still maintains comparable FLOPs to RT-DETR-R18 while using fewer parameters and achieving higher AP, indicating an acceptable overall accuracy–efficiency trade-off. Considering that memory usage and training time are highly dependent on hardware configuration, batch size, mixed-precision settings, and implementation details, this study mainly reports Params, FLOPs, and FPS as more reproducible efficiency indicators.
3.6. Experiments on Key Innovations
To further validate the effectiveness of the proposed key innovations and provide deeper insight into the sources of performance improvement, additional experiments are conducted from the perspectives of feature extraction, target-region response, and regression optimization. Unlike the previous ablation study, which focuses on the incremental performance gains brought by each module, this section emphasizes the specific roles of the proposed techniques in feature representation, target attention, and loss optimization.
To this end, three groups of experiments are performed: backbone comparison, feature-response heatmap analysis, and loss-function comparison. The backbone comparison is designed to evaluate the capability of different feature extractors for multi-scale representation and to justify the final selection of StarNet as the backbone. The feature-response heatmap analysis is used to illustrate how different module combinations affect attention to target regions, thereby revealing the roles of S1, LSR-C3, and EMA-CARAFE in small-object detection. The loss-function comparison further evaluates the effectiveness of the proposed SSGIoU in bounding box regression optimization.
3.6.1. Backbone Comparison Experiment
In object detection, the feature extraction capability of the backbone largely determines the representational upper bound of the subsequent neck and detection head. Therefore, selecting an appropriate backbone is crucial to overall model performance. To evaluate the suitability of different lightweight feature extractors in the proposed framework, multiple backbone networks are integrated into the detector and compared in terms of both feature extraction capability and adaptation potential. Quantitative results are reported in
Table 9, while
Figure 11 visualizes the feature responses from shallow to deep layers for different backbones.
On the AO-Dataset validation set, different backbones show clear trade-offs between detection accuracy, parameter count, and computational complexity. The baseline, R18vd, which denotes the ResNet-18-vd backbone variant used in RT-DETR, achieves 65.31% AP with stable detection performance, but requires relatively high model capacity (19.62 M parameters and 59.39 G FLOPs), which limits its suitability for lightweight settings. Lightweight backbones such as MobileNetV3-Large and ShuffleNetV2-x2.0 reduce computational cost, but do not surpass the baseline in accuracy. In particular, MobileNetV3-Large shows weaker performance on small objects. These results indicate that existing lightweight designs improve efficiency at the expense of accuracy, and still face challenges in balancing performance and complexity.
Among the compared backbones, EfficientNet achieves the highest AP of 65.91%, reflecting strong feature extraction capability. StarNet yields slightly lower baseline AP, but maintains competitive accuracy with a lower model complexity, indicating a favorable balance between efficiency and performance.
Feature visualization in
Figure 11 reveals distinct behaviors across backbones in shallow detail preservation, semantic aggregation, and background suppression. R18vd shows clear responses to machinery contours and local edges in early layers, but also activates strongly on background regions such as field textures and ridge patterns. In deeper layers, target regions become more prominent, yet the responses remain scattered, indicating limited suppression of complex backgrounds.
By comparison, MobileNetV3-Large and ShuffleNetV2-x2.0 offer lower complexity but are more easily affected by dense field textures and repetitive crop structures in shallow layers, leading to weaker target saliency, especially for distant small-scale machinery. EfficientNet preserves object boundaries and local structural details more effectively in shallow layers and forms more concentrated semantic responses in deeper layers, resulting in clearer separation between foreground objects and background regions. StarNet shows slightly weaker response intensity than EfficientNet, but its multi-level feature distribution is more balanced. It preserves texture details in shallow layers while maintaining effective semantic aggregation in deeper layers, enabling efficient lightweight feature representation.
Based on the above results, EfficientNet and StarNet exhibit different advantages and therefore require further analysis. EfficientNet achieves the highest baseline AP in the isolated backbone comparison, indicating strong feature extraction capability. However, the backbone selection for Agri-DETR is not determined only by isolated backbone accuracy, but also by the adaptability of the backbone to the subsequent high-resolution branch, feature fusion module, and dynamic upsampling module.
Based on the above results, different lightweight backbones exhibit distinct accuracy–complexity characteristics. Although EfficientNet achieves the highest baseline AP in the isolated backbone comparison, the backbone selection for Agri-DETR is not determined solely by backbone-level accuracy. Instead, the final selection considers the overall accuracy–efficiency trade-off, compatibility with the proposed improvement modules, improvement potential after structural optimization, and lightweight deployment requirements. StarNet has a simple and parameter-efficient structure, and its multi-level feature responses are more balanced across shallow and deep layers, which makes it suitable for integration with the high-resolution S1 branch, LSR-C3 feature fusion module, and EMA-CARAFE upsampling module.
To further compare the adaptability of EfficientNet and StarNet to the proposed modules, both backbones are integrated into the same improved framework. As shown in
Table 10, EfficientNet improves from 65.91% to 66.21%, with gains of +0.30 AP and +1.09 AP
S. In contrast, StarNet improves from 64.97% to 65.99%, yielding larger gains of +1.02 AP and +1.44 AP
S. These results indicate that although EfficientNet has a stronger isolated baseline, StarNet benefits more from the proposed high-resolution representation and multi-scale feature enhancement modules. Therefore, StarNet was selected as the final backbone because it provides a better balance among module adaptability, improvement potential, parameter efficiency, and lightweight deployment requirements in the complete Agri-DETR framework.
3.6.2. Feature Response Heatmap Analysis
To provide a more intuitive understanding of how the proposed modules influence target perception and feature enhancement,
Figure 12 visualizes feature response heatmaps for six representative object categories from the AO-Dataset. Starting from the StarNet backbone, the heatmaps are generated by progressively incorporating the proposed components.
To balance spatial localization and semantic discrimination, the feature maps immediately before the detection head are selected for visualization. These high-level fused features integrate both shallow spatial details and deep semantic information, thereby offering a reliable representation of the model’s attention distribution at the final detection stage.
With the addition of the S1 high-resolution feature branch, the heatmap responses become more spatially uniform and less fragmented. The model produces smoother and more globally consistent attention distributions, replacing scattered local activations. For categories such as Drone, Pole, and Tumbleweed, background noise is reduced and responses over target regions become more stable. This behavior suggests that the S1 branch helps stabilize low-level spatial representations and provides a cleaner foundation for subsequent feature refinement.
After adding the LSR-C3 module, the heatmaps show more structured and semantically aligned responses. Relative to the S1-only configuration, activations become more concentrated on object regions and better follow their geometric structures. For elongated objects such as poles and transmission towers, responses form continuous patterns along the object body, indicating improved modeling of long-range spatial dependencies. For small or low-saliency targets such as drones and persons, activations focus more clearly on key regions instead of spreading across the background. These changes suggest that LSR-C3 improves the spatial organization of feature responses and enhances discriminative representation.
With EMA-CARAFE, the attention distributions become more compact and spatially coherent. High-response regions concentrate more around object centers, while boundary continuity is better preserved. For categories such as Drone, Tumbleweed, and Agricultural Machinery, attention is more localized and structurally complete. For elongated structures such as Pole and Iron Tower, responses show stronger continuity along the object body, with clearer emphasis on boundaries and main structural components. These changes suggest that EMA-CARAFE improves feature reconstruction and spatial consistency, leading to more precise and stable attention over target regions.
Taken together, the heatmap results reveal a clear progressive refinement process across the proposed modules. The S1 branch stabilizes spatial representations, LSR-C3 enhances discriminative structure-aware responses, and EMA-CARAFE further refines spatial coherence and completeness. This progressive improvement leads to more accurate target localization and improved background suppression across all object categories, with visual results consistent with the quantitative gains observed in the ablation study.
3.6.3. Loss Function Comparison
This section further investigates the impact of different bounding box regression losses on detection performance. To provide a more systematic evaluation of the proposed SSGIoU loss, experiments are conducted under identical network architecture, training strategy, dataset split, and evaluation settings. In addition to GIoU and WIoU [
40], three widely used advanced IoU-based losses, including CIoU [
41], EIoU [
42], and
-IoU [
43], are introduced as additional comparison baselines. Quantitative results are summarized in
Table 11. To further analyze the contribution of different components in SSGIoU, a component-level ablation study is reported in
Table 12. In addition, qualitative comparisons of predicted bounding boxes are provided in
Figure 13 to visually assess regression quality under different loss functions.
As shown in
Table 11, SSGIoU achieves the best overall AP, AP
50, and AP
75 among all compared loss functions. Compared with GIoU, SSGIoU improves AP from 65.41% to 65.99%, and improves AP
75 from 71.61% to 72.60%. Compared with CIoU, EIoU, and
-IoU, SSGIoU also obtains higher overall AP and stricter localization accuracy. This indicates that the proposed scale–shape–geometry-aware regression design provides more effective bounding box supervision for the AO-Dataset, especially under stricter IoU thresholds.
The scale-wise results also show that different loss functions exhibit different behaviors across object sizes. GIoU obtains the highest APS, while EIoU achieves the highest APM. However, these losses do not provide the best overall AP or AP75. In contrast, SSGIoU achieves the highest APL and the best overall localization performance, suggesting that its main advantage lies in improving global regression quality rather than optimizing only a single object scale. This property is important for agricultural obstacle detection, where objects present large variations in size, shape, and aspect ratio.
To further identify the source of the performance gain,
Table 12 decomposes SSGIoU into two variants. SSGIoU-Geometry retains only the geometric constraint terms, including center alignment and logarithmic shape constraints, while SSGIoU-ScaleWeight retains only the adaptive scale weighting term. The results show that neither the geometric constraint term nor the scale weighting term alone achieves the performance of the complete SSGIoU. In contrast, the complete SSGIoU obtains the best AP, AP
75, AP
S, AP
M, and AP
L among the three variants. This indicates that the improvement does not originate from a single isolated component, but from the complementary effect between geometric constraints and scale-aware weighting. Specifically, the geometric terms regularize center alignment and shape consistency, while the scale weighting term adjusts the contribution of regression samples with different object sizes. When jointly applied, these two mechanisms provide more stable and accurate bounding box regression supervision.
The above results also explain why SSGIoU outperforms general-purpose IoU losses such as GIoU, CIoU, EIoU, WIoU, and
-IoU. Standard GIoU improves IoU optimization by introducing the smallest enclosing box, CIoU further considers center distance and aspect-ratio consistency, EIoU explicitly decouples width and height errors, WIoU adjusts sample contributions through dynamic weighting, and
-IoU modifies the IoU term through a power transformation to improve convergence behavior. However, these losses do not explicitly consider the combined influence of scale variation, slender-object shape characteristics, and geometric alignment in agricultural obstacle detection. In contrast, SSGIoU jointly models scale sensitivity, shape consistency, and geometric alignment, making the regression process more suitable for complex agricultural targets such as slender poles, iron towers, and distant small obstacles. This explains the improvement in AP and AP
75 shown in
Table 11.
Figure 13 presents qualitative comparisons of localization results under representative bounding box regression losses, including GIoU, CIoU,
-IoU, and the proposed SSGIoU. Yellow circles and arrows are used to highlight representative localization errors for visual comparison. For clarity, only representative losses are shown in the figure, while the complete quantitative comparison, including EIoU and WIoU, is provided in
Table 11. These qualitative examples further illustrate that SSGIoU can produce more complete and geometrically consistent bounding boxes under challenging agricultural scenarios involving slender structures, small targets, and mixed object scales.
In
Figure 13a, the iron tower has a slender structure and its lower part is severely occluded by background vegetation, making complete localization difficult. GIoU and
-IoU detect the iron tower but fail to fully cover the lower occluded region, resulting in incomplete bounding boxes, as indicated by IB. In contrast, SSGIoU provides more complete coverage of the elongated structure and better preserves the geometric extent of the target. This suggests that the shape- and geometry-aware constraints in SSGIoU help improve the localization completeness of slender agricultural obstacles such as poles and iron towers.
In
Figure 13b, multiple drones appear at relatively small scales and are distributed close to each other. CIoU produces duplicate detections for neighboring drone targets, as indicated by DB, while
-IoU tends to merge two adjacent drones into a single detection, as indicated by MO. These results show that small-object localization and instance separation are sensitive to the choice of regression loss. Compared with these losses, SSGIoU better separates neighboring drones and produces more stable bounding boxes, suggesting improved localization consistency for small agricultural targets.
In
Figure 13c, the image contains a large drone, a slender pole, and a person, which increases the difficulty of scale-aware localization. GIoU and CIoU generate incomplete bounding boxes for the drone, as indicated by IB. For
-IoU, the pole is missed, as indicated by FN, and the drone is also incompletely localized. In contrast, SSGIoU maintains more complete drone localization while preserving the detection of the slender pole and the person. Overall, the qualitative results show that SSGIoU improves bounding box completeness, reduces duplicate or merged detections, and enhances localization robustness for agricultural targets with different scales and shapes.
Overall, the quantitative and qualitative results consistently demonstrate the effectiveness of SSGIoU for bounding box regression in agricultural obstacle detection. The quantitative comparison shows that SSGIoU achieves the best AP, AP
50, and AP
75 among the compared loss functions, while the component-level ablation confirms that the geometric constraints and scale-aware weighting mechanism work in a complementary manner. The qualitative results in
Figure 13 further show that SSGIoU improves bounding box completeness, reduces duplicate or merged detections, and enhances localization robustness for slender structures, small targets, and mixed-scale objects. These results indicate that the scale–shape–geometry-aware design provides more stable and accurate localization supervision for complex agricultural scenarios.
3.7. Generalization Study
Additional experiments are conducted on the COCO2017 dataset to assess the applicability of the proposed method to general object detection tasks and its cross-domain generalization capability. The same network architecture and training configuration are used as described previously. The model is trained on the COCO2017 train2017 set and evaluated on the val2017 set.
Several representative state-of-the-art detectors are included for comparison, covering YOLO series models, RT-DETR variants, and classical DETR-based approaches [
19,
44]. The evaluation considers both detection accuracy and model complexity. A scatter plot in
Figure 14 further illustrates the trade-off between parameter count and AP, where each point represents a specific detector.
As reported in
Table 13, Agri-DETR achieves 48.3% AP on the COCO2017 validation set with only 14.5 M parameters, suggesting effective generalization under a lightweight design. Compared with lightweight YOLO models, Agri-DETR outperforms YOLOv8s, YOLO11s, and YOLO12s, indicating that the proposed design maintains competitive feature representation ability even at a relatively small model scale.
To further illustrate the relationship between model accuracy and parameter complexity,
Figure 14 visualizes the AP–Params trade-off of different detectors on COCO2017 validation set. As shown in the figure, Agri-DETR is located in a relatively favorable region with fewer parameters than most medium-scale detectors while achieving higher AP than several lightweight models. In particular, compared with RT-DETR-R18, Agri-DETR obtains better accuracy with a smaller parameter scale, demonstrating that the proposed architectural improvements improve parameter efficiency within the end-to-end detection framework.
Although Agri-DETR does not surpass some medium-scale models in absolute AP, such as YOLO12m and RT-DETR-R50, these models rely on considerably larger parameter budgets. For example, RT-DETR-R50 achieves higher AP but requires 41.5 M parameters, while Agri-DETR uses only 14.5 M parameters. This trade-off indicates that Agri-DETR is not designed to maximize accuracy by simply increasing model capacity, but instead aims to achieve a more balanced accuracy–complexity relationship.
Furthermore, compared with classical DETR-based methods, including DETR-R50 and Deformable DETR, Agri-DETR achieves better detection performance under a much smaller parameter scale. This suggests that the proposed method is not only effective on the domain-specific agricultural obstacle dataset, but also retains strong generalization ability on large-scale and diverse benchmarks such as COCO2017. Overall, the quantitative results and the trade-off visualization jointly demonstrate that Agri-DETR maintains a favorable balance between accuracy and efficiency across datasets.
4. Discussion
Agri-DETR achieves competitive performance for obstacle detection in unstructured agricultural environments while reducing the dependence of high accuracy on large computational resources. This is mainly attributed to the coordinated design of multiple components. The lightweight StarNet backbone improves efficiency without sacrificing representational capacity, while the LSR-C3 module enhances spatial selectivity and multi-scale contextual modeling under complex backgrounds. In addition, the EMA-CARAFE module refines high-resolution feature reconstruction during upsampling, and the SSGIoU loss improves localization accuracy by incorporating geometric constraints. Together, these designs enable a favorable balance between detection accuracy and computational efficiency. The inference speed analysis further indicates that Agri-DETR maintains a practical FPS level within the RT-DETR-based framework under the current desktop GPU setting, suggesting its potential for real-time agricultural perception.
As demonstrated in the experimental results, Agri-DETR consistently outperforms both the baseline and competing methods across multiple evaluation metrics on the self-constructed AO-Dataset. In particular, APS improves by 1.4% compared with the baseline, indicating enhanced sensitivity to small objects, which is critical in agricultural environments. Meanwhile, the gains in AP75 suggest more accurate localization under stricter IoU thresholds. The stable performance in AR further reflects robust recall under complex conditions. In addition, cross-dataset generalization experiments on the COCO dataset show that Agri-DETR consistently outperforms the baseline and remains competitive with mainstream detection models. These results verify that the proposed improvements contribute to more effective feature representation and spatial modeling, rather than relying on increased model capacity.
Despite these advantages, several issues should be further considered for real-world agricultural deployment. Although the current validation is mainly based on benchmark experiments, the AO-Dataset was constructed to reflect representative visual challenges encountered in agricultural environments, including pedestrians, agricultural machinery, drones, poles, iron towers, tumbleweeds, cluttered field backgrounds, scale variation, small objects, slender structures, and irregular obstacles. Therefore, the results on the AO-Dataset provide benchmark-level evidence for evaluating the potential applicability of Agri-DETR in real agricultural scenarios. In addition, the cross-dataset evaluation on COCO2017 indicates that the proposed framework has a certain degree of generalization beyond the self-constructed dataset. Nevertheless, benchmark performance cannot fully represent long-term field deployment performance. When the model is applied to different regions, crop types, growth stages, seasons, soil conditions, obstacle scales, or rare field events, domain shift may occur because of changes in background texture, obstacle appearance, illumination, occlusion patterns, and object frequency. Such distribution changes may lead to reduced detection accuracy, especially for rare obstacles, distant small objects, heavily occluded targets, or scenarios that are underrepresented in the training data. This issue may be further amplified by the long-tailed class distribution of the AO-Dataset. Although class-aware augmentation is used to increase the diversity of minority-class samples, categories such as Drone still contain fewer training instances than dominant categories, which may increase the risk of overfitting and reduce the generalization ability of the detector for rare obstacles. To alleviate the influence of obstacle size variation, Agri-DETR introduces a high-resolution S1 branch for preserving fine-grained spatial details, LSR-C3 for contextual and large-receptive-field modeling, EMA-CARAFE for multi-scale feature reconstruction, and SSGIoU for scale-, shape-, and geometry-aware bounding box regression. These designs help improve multi-scale detection, as reflected by the improvement in APS compared with the baseline. However, extremely small or visually ambiguous obstacles may still remain challenging. Therefore, large-scale field experiments and continuous dataset expansion across diverse regions, crops, seasons, obstacle scales, and rare events are still required to further validate and improve the robustness and reliability of Agri-DETR under practical operating conditions. Future work will also consider synthetic data generation, semi-supervised learning, active learning, and rare-category field data collection to enrich minority categories and reduce the influence of class imbalance.
In addition, the current framework relies solely on RGB imagery, and its performance may degrade under low-visibility conditions such as nighttime scenes, low illumination, heavy dust, rain, fog, strong backlight, or severe overexposure. This is because RGB images provide limited visual cues when object boundaries, texture details, and contrast are weakened. The data augmentation strategies used in this study, including color jitter, Gaussian blur, and salt-and-pepper noise, can partially improve robustness to illumination variation, blur, and sensor noise. Therefore, Agri-DETR is expected to maintain relatively stable performance under normal and moderately degraded visual conditions, but its detection accuracy may decrease under severe poor-visibility conditions. Incorporating multi-modal sensing, such as LiDAR, depth cameras, thermal imaging, or radar, could further improve perception robustness under adverse weather and low-light environments. Although Agri-DETR is relatively lightweight, with 14.5 M parameters and an inference speed of 74.08 FPS under the current desktop GPU evaluation setting, this result mainly indicates its potential for real-time agricultural perception rather than direct proof of embedded deployment performance. Future work will further evaluate Agri-DETR on agricultural edge devices, such as NVIDIA Jetson platforms or industrial embedded computers, combined with TensorRT acceleration, FP16/INT8 inference, model pruning, quantization, and hardware-aware optimization. Furthermore, the current Agri-DETR framework follows a fixed-category detection setting and is trained on six predefined obstacle categories. Therefore, when new or unexpected obstacles appear in real-world fields, the model may classify them as visually similar known categories or fail to detect them if their appearance differs significantly from the training distribution. This is a common limitation of closed-set object detectors. To improve adaptability to dynamic and unseen agricultural scenarios, future work will extend Agri-DETR toward open-set detection, open-vocabulary detection, anomaly detection, incremental learning, and continuous dataset expansion.
5. Conclusions
This study proposed Agri-DETR, an efficient end-to-end visual obstacle detection framework for intelligent agricultural machinery operating in unstructured agricultural environments. By integrating a lightweight backbone with a high-resolution feature branch, the LSR-C3 feature fusion module, the EMA-CARAFE dynamic upsampling module, and the SSGIoU regression loss, the proposed framework improves feature representation, multi-scale spatial reconstruction, and bounding box localization while maintaining a compact model structure. In addition, the constructed AO-Dataset provides a dedicated benchmark for evaluating agricultural obstacle detection under complex field-like conditions.
Experimental results demonstrate that Agri-DETR achieves 66.0% AP, 91.5% AP50, and 72.6% AP75 on the AO-Dataset with only 14.5 M parameters. Compared with RT-DETR-R18, Agri-DETR reduces the parameter count by approximately 25% while achieving higher detection accuracy and maintaining a comparable inference speed of 74.08 FPS. The improvement in APS further indicates that the proposed design enhances the detection of small agricultural obstacles. Cross-dataset evaluation on COCO2017 also shows that Agri-DETR achieves 48.3% AP, suggesting favorable generalization capability beyond the agricultural domain.
Although promising results have been obtained, this study still has several limitations. The current evaluation is mainly based on benchmark experiments, and large-scale long-term field deployment under diverse weather, illumination, and crop growth conditions has not yet been conducted. Future work will focus on improving cross-domain robustness, validating the model on real agricultural machinery platforms, and extending the framework toward multi-modal perception by integrating vision, LiDAR, and positioning sensors for autonomous agricultural navigation.