Farmland Obstacle Detection from the Perspective of UAVs Based on Non-local Deformable DETR

: In precision agriculture, unmanned aerial vehicles (UAVs) are playing an increasingly important role in farmland information acquisition and ﬁne management. However, discrete obstacles in the farmland environment, such as trees and power lines, pose serious threats to the ﬂight safety of UAVs. Real-time detection of the attributes of obstacles is urgently needed to ensure their ﬂight safety. In the wake of rapid development of deep learning, object detection algorithms based on convolutional neural networks (CNN) and transformer architectures have achieved remarkable results. Detection Transformer (DETR) and Deformable DETR combine CNN and transformer to achieve end-to-end object detection. The goal of this work is to use Deformable DETR for the task of farmland obstacle detection from the perspective of UAVs. However, limited by local receptive ﬁelds and local self-attention mechanisms, Deformable DETR lacks the ability to capture long-range dependencies to some extent. Inspired by non-local neural networks, we introduce the global modeling capability to the front-end ResNet to further improve the overall performance of Deformable DETR. We refer to the improved version as Non-local Deformable DETR. We evaluate the performance of Non-local Deformable DETR for farmland obstacle detection through comparative experiments on our proposed dataset. The results show that, compared with the original Deformable DETR network, the mAP value of the Non-local Deformable DETR is increased from 71.3% to 78.0%. Additionally, Non-local Deformable DETR also presents great performance for detecting small and slender objects. We hope this work can provide a solution to the ﬂight safety problems encountered by UAVs in unstructured farmland environments.


Introduction
With the development of agricultural robot technology, UAVs are becoming an important part of global agriculture aviation [1]. Specifically, UAVs with high-performance onboard sensors and task-specific action systems have been successfully deployed in farmland information collection and fine management [2][3][4][5]. However, the advantages and performance of UAVs have not been fully realized at present yet. One of the main reasons is that randomly distributed obstacles, such as trees, poles, buildings, people, and power towers pose a serious threat to its flight safety and operational efficiency [6]. Image sensors are widely used as the eyes of UAVs [7], so giving them human-like intelligent environmental awareness is an intuitive solution. How to quickly and accurately detect objects of interest in information-rich images is a technical bottleneck [8].
Previously, researchers have used a monocular camera [9], stereo camera [10], event camera [11] and other sensors to detect the obstacles based on various image processing techniques. Recently, deep learning neural networks have been used in the obstacle Conformer [49] and CNNs meet transformers (CMT [50]). Specifically, DETR is the first end-to-end baseline network for deploying transformer in object detection. Different from the R-CNN and YOLO, DETR regards object detection as a direct set prediction problem, and simplifies the detection pipeline by dropping some hand-crafted components such as anchor generation and non-maximum suppression. DETR uses ResNet [51] to extract image features, then outputs 100 prediction results in parallel based on the transformer encoderdecoder architecture and finally determines the final prediction classes and bounding boxes through bipartite matching. Although DETR significantly outperforms competitive baselines, there are still three problems with DETR. First, compared to existing object detection methods, DETR requires more epochs to converge. Second, insufficient detection performance of DETR for small objects. Lastly, the computational complexity of DETR is still sensitive to the resolution of the image or feature map. To address these issues, Deformable DETR introduces the idea of deformable convolution and multi-scale feature maps to form the so-called Multi-scale Deformable Attention Module. The experimental results show that Deformable DETR not only alleviates the problems of slow convergence and high computational complexity of DETR, but also achieves better performance than DETR.
Random and discrete obstacles in the natural farmland environment pose a direct threat to the flight safety of UAVs. Usually, the images captured by the UAV's onboard camera are filled with a lot of background noise, which increases the difficulty for obstacle detection. In this paper, we try to deploy the modified Deformable DETR for the task of agricultural UAV-based farmland obstacle detection. In Deformable DETR, the ResNetstyle CNN architecture models the spatial and local features of input images, while the transformer builds the long-distance dependencies. However, the global modeling ability of Deformable DETR is still insufficient for detecting the small farmland objects. The motivation of this work is to further improve the global modeling capability of Deformable DETR by introducing the global modeling capability in the front-end CNN. In this work, we achieve this by introducing a Non-Local module into the CNN feature extraction network in the Deformable DETR front-end. The main reason is that non-local operation can capture long-range dependencies by computing the response of a location as a weighted sum of all location features in the input feature map. Our proposed Non-local Deformable DETR combines the local feature extraction ability of CNN, the global modeling ability of non-local and the self-attention mechanism of transformer to improve the object detection accuracy while maintaining the efficiency of the Deformable DETR model.

Dataset
The dataset proposed by our previous work [6] contained 3700 samples served as the basis for this study. Additionally, it can be classified into six categories: tree, wire poles, building, power tower, UAVs and person. In this work, we collected more images containing obstacles through various methods (manual photography, UAV photography and web search) and added them to the raw dataset. In the preprocessing stage, we manually selected the raw dataset through data cleaning to remove some low-quality samples. In addition, we also resize the images of different resolutions to the same resolution through a cropping operation. As shown in Figure 1, our dataset contains six classes of typical obstacles which are common in the farmland. The percentage values of tree, wire poles, building, power tower, UAVs and person are 14.48%, 15.44%, 16.81%, 15.99%, 15.40% and 21.87% respectively. There are a total of 6000 images, each with a resolution of 416 × 416. All 11,578 objects in our dataset were annotated by Labelme [52]. We randomly selected 4800 images as the training set, 600 images as the validation set and 600 images as the test set, with a ratio of 8:1:1.
All 11,578 objects in our dataset were annotated by Labelme [52]. We randomly selected 4800 images as the training set, 600 images as the validation set and 600 images as the test set, with a ratio of 8:1:1.

Deformable DETR
Without the need of hand-designed components such as NMS or anchors, DETR can predict the final set of detections in parallel by combining a common CNN with a transformer architecture. However, DETR requires long training time to converge and has relatively poor performance for small object detection. To solve these two issues, Zhu et al. [40] introduced the idea of deformable convolution and multi-scale features in convolutional neural networks into DETR and proposed the Deformable DETR. Deformable DETR uses ResNet-50 [51] as the backbone to extract the multi-scale features. Deformable transformer (encoder and decoder) extracts and strengthens the feature maps from the output feature maps of stages C3-C5 in ResNet by using multi-scale deformable attention module. The core of Deformable DETR is the deformable attention module and multiscale deformable attention module The deformable attention module is a local attention mechanism, which means it only pays attention to a small set of key sampling points around the reference point, independent of the spatial size of the feature map [40]. Given an input feature map ∈ ℝ , query elements with content features and 2D reference points , the equation of the deformable attention feature is calculated by: where m is the attention head, k is the sampled keys, K is the total sampled keys ≪ , ∆ is the sampling offset and is the attention weight of the sampling point in the attention head. The deformable attention module and multi-scale form the multi-scale deformable attention module. Given the input multi-scale feature maps { , where ∈ ℝ . Let ̂ ∈ 0,1 be the normalized coordinates of the reference point. The equation of multi-scale deformable attention feature can be calculated by:   Without the need of hand-designed components such as NMS or anchors, DETR can predict the final set of detections in parallel by combining a common CNN with a transformer architecture. However, DETR requires long training time to converge and has relatively poor performance for small object detection. To solve these two issues, Zhu et al. [40] introduced the idea of deformable convolution and multi-scale features in convolutional neural networks into DETR and proposed the Deformable DETR. Deformable DETR uses ResNet-50 [51] as the backbone to extract the multi-scale features. Deformable transformer (encoder and decoder) extracts and strengthens the feature maps from the output feature maps of stages C 3 -C 5 in ResNet by using multi-scale deformable attention module. The core of Deformable DETR is the deformable attention module and multi-scale deformable attention module The deformable attention module is a local attention mechanism, which means it only pays attention to a small set of key sampling points around the reference point, independent of the spatial size of the feature map [40]. Given an input feature map x ∈ R C×H×W , query elements with content features z q and 2D reference points p q , the equation of the deformable attention feature is calculated by: where m is the attention head, k is the sampled keys, K is the total sampled keys (K HW), ∆p mqk is the sampling offset and A mqk is the attention weight of the k th sampling point in , where x l ∈ R C×H×W .
Letp q ∈ [0, 1] 2 be the normalized coordinates of the reference point. The equation of multi-scale deformable attention feature can be calculated by: where l is the input feature level,p q is the normalized coordinates of the reference point, ∆p mlqk is the sampling offset of the k th sampling point in the l th feature level and the m th attention head and A mlqk is attention weight of the k th sampling point in the l th feature level and the m th attention head. ∅ l p q rescales the normalized coordinatesp q to the input feature map of the l th level. Compared to DETR, Deformable DETR replaces the multi-head attention module in the transformer encoder with the multi-scale deformable attention module and replaces the cross-attention module in transformer decoder with multi-scale deformable cross-attention module. The self-attention module in the transformer decoder remains unchanged.

ResNet
Both DETR and Deformable DETR use ResNet to extract original feature maps. ResNet is a popular backbone in many state-of-the-art deep learning algorithms. The basic idea of ResNet is to introduce a "shortcut connection" that can skip one or more layers to solve the model degradation problem. As shown in Figure 2, the residual block uses the shortcut connection to perform identity mapping, which connects the input x with the F(x) obtained through the stacked weight layers, without adding additional parameters or increasing the computational complexity.
where l is the input feature level, ̂ is the normalized coordinates of the reference point, ∆ is the sampling offset of the sampling point in the feature level and the attention head and is attention weight of the sampling point in the feature level and the attention head. ∅ ̂ rescales the normalized coordinates ̂ to the input feature map of the level. Compared to DETR, Deformable DETR replaces the multi-head attention module in the transformer encoder with the multi-scale deformable attention module and replaces the cross-attention module in transformer decoder with multi-scale deformable cross-attention module. The self-attention module in the transformer decoder remains unchanged.

ResNet
Both DETR and Deformable DETR use ResNet to extract original feature maps. Res-Net is a popular backbone in many state-of-the-art deep learning algorithms. The basic idea of ResNet is to introduce a "shortcut connection" that can skip one or more layers to solve the model degradation problem. As shown in Figure 2, the residual block uses the shortcut connection to perform identity mapping, which connects the input x with the F(x) obtained through the stacked weight layers, without adding additional parameters or increasing the computational complexity. When x and F are of the same dimension, the output is given by: where x, y are the input and output vector of residual block and , is the residual mapping to be learned. When the dimensions of x and F are different, the input x needs to match the dimensions by: where is the linear mapping function.

Non-Local Neural Networks
Traditional convolution operations lack the ability of global modeling due to the limitation of local receptive fields. Long-range dependencies are usually achieved through hierarchical convolution and pooling. Inspired by the self-attention mechanism in NLP, When x and F are of the same dimension, the output is given by: where x, y are the input and output vector of residual block and F(x, {W i }) is the residual mapping to be learned. When the dimensions of x and F are different, the input x needs to match the dimensions by: where W s is the linear mapping function.

Non-Local Neural Networks
Traditional convolution operations lack the ability of global modeling due to the limitation of local receptive fields. Long-range dependencies are usually achieved through hierarchical convolution and pooling. Inspired by the self-attention mechanism in NLP, non-local neural networks introduce self-attention to CNN to capture long-distance dependencies in the feature extraction process. A generic non-local operation in deep neural networks is defined as: where x is the input feature, y is the corresponding output feature, i is the index of output position, j is the index of all possible positions in feature, f is the function (Embedded Gaussian) that calculates the relationship between i and all j, g is the function that computes Agriculture 2022, 12,1983 6 of 14 the representation of the input signal at position j and C(x) is a factor that normalizes the response. Non-local operations can be implemented in the form of non-local blocks, which means it can be easily plugged into conventional convolutional layers within standard networks. Based on Equation (5), the non-local block is defined as: where "+x i "denotes residual shortcut connection and W z y i represents linear transformation. An example of non-local block is shown in Figure 3. W v , W k , W q and W z are weight matrixes to be learned and "⊕" denotes element-wise sum after shortcut connection, while "⊗" denotes matrix multiplication.
networks is defined as: where x is the input feature, y is the corresponding output feature, i is the index of output position, j is the index of all possible positions in feature, f is the function (Embedded Gaussian) that calculates the relationship between i and all j, g is the function that computes the representation of the input signal at position j and C(x) is a factor that normalizes the response. Non-local operations can be implemented in the form of non-local blocks, which means it can be easily plugged into conventional convolutional layers within standard networks. Based on Equation (5), the non-local block is defined as: , where " "denotes residual shortcut connection and represents linear transformation.
An example of non-local block is shown in Figure 3. Wv, Wk, Wq and Wz are weight matrixes to be learned and "⊕" denotes element-wise sum after shortcut connection, while "⊗" denotes matrix multiplication.

Non-Local Deformable DETR
In Deformable DETR, convolution operations in ResNet architecture capture multiscale local features and the encoder-decoder in the transformer architecture conducts local self-attention. Therefore, Deformable DETR lacks the ability to learn global representations over long distances. Based on the non-local structure, we introduce the global modeling capability to the front-end ResNet to further improve the overall performance of Deformable DETR.
As shown in Figure 4, non-local blocks are inserted into all the residual blocks in Stage 4 and 5 in ResNet-50. Specifically, in each optimized residual block, the non-local

Non-Local Deformable DETR
In Deformable DETR, convolution operations in ResNet architecture capture multiscale local features and the encoder-decoder in the transformer architecture conducts local self-attention. Therefore, Deformable DETR lacks the ability to learn global representations over long distances. Based on the non-local structure, we introduce the global modeling capability to the front-end ResNet to further improve the overall performance of Deformable DETR.
As shown in Figure 4, non-local blocks are inserted into all the residual blocks in Stage 4 and 5 in ResNet-50. Specifically, in each optimized residual block, the non-local block is added after the 3 × 3 convolution layer to establish long-distance dependency and improve the feature extraction ability of the model. block is added after the 3 × 3 convolution layer to establish long-distance dependency and improve the feature extraction ability of the model. The transformer architecture in Deformable DETR remains unchanged. The overall structure of Non-local Deformable DETR is shown in Figure 5.

The Overview of Data Flow
In this paper, we improved the Deformable DETR by Non-local block to enhance the detection accuracy of farmland obstacles; an overview of the data flow is shown in Figure  6. First, the raw dataset was cleaned and cropped into the pre-processed dataset, and then it was divided into training set, validation set and test set with a ratio of 8:1:1. Secondly, we used the training set and validation set to train the proposed Non-local Deformable DETR. Finally, the test set was used to evaluate the model's predicting performance. The transformer architecture in Deformable DETR remains unchanged. The overall structure of Non-local Deformable DETR is shown in Figure 5. block is added after the 3 × 3 convolution layer to establish long-distance dependency and improve the feature extraction ability of the model. The transformer architecture in Deformable DETR remains unchanged. The overall structure of Non-local Deformable DETR is shown in Figure 5.

The Overview of Data Flow
In this paper, we improved the Deformable DETR by Non-local block to enhance the detection accuracy of farmland obstacles; an overview of the data flow is shown in Figure  6. First, the raw dataset was cleaned and cropped into the pre-processed dataset, and then it was divided into training set, validation set and test set with a ratio of 8:1:1. Secondly, we used the training set and validation set to train the proposed Non-local Deformable DETR. Finally, the test set was used to evaluate the model's predicting performance.

The Overview of Data Flow
In this paper, we improved the Deformable DETR by Non-local block to enhance the detection accuracy of farmland obstacles; an overview of the data flow is shown in Figure 6. First, the raw dataset was cleaned and cropped into the pre-processed dataset, and then it was divided into training set, validation set and test set with a ratio of 8:1:1. Secondly, we used the training set and validation set to train the proposed Non-local Deformable DETR. Finally, the test set was used to evaluate the model's predicting performance.

Evaluation Metrics
In this study, AP and mAP were used to evaluate the performance of the model with Equations (7) and (8): where AP indicates the average precision of a single category, mAP indicates the average of multiple category's AP, P represents the accuracy rate which can be calculated by Equation (9), R is the recall rate that can be obtained by Equation (10) and P(R) denotes the mapping function of P and R: where TP (True positive) indicates the number of positive samples that are correctly predicted as positive, FP (False Positive) represents the number of samples that the model predicts as positive, but which are actually negative, FN (False Negative) means the number of misclassified samples that are actually positive but are classified as negative and TN (True Negative) stands for the number of negative samples that are correctly classified as negative.

Implementaion Details
The configuration of the computer used for algorithm development is as follows: the central processing unit (CPU) is Intel Core i9-12900K; the graphics processing unit (GPU)

Evaluation Metrics
In this study, AP and mAP were used to evaluate the performance of the model with Equations (7) and (8): where AP indicates the average precision of a single category, mAP indicates the average of multiple category's AP, P represents the accuracy rate which can be calculated by Equation (9), R is the recall rate that can be obtained by Equation (10) and P(R) denotes the mapping function of P and R: where TP (True positive) indicates the number of positive samples that are correctly predicted as positive, FP (False Positive) represents the number of samples that the model predicts as positive, but which are actually negative, FN (False Negative) means the number of misclassified samples that are actually positive but are classified as negative and TN (True Negative) stands for the number of negative samples that are correctly classified as negative.

Implementaion Details
The configuration of the computer used for algorithm development is as follows: the central processing unit (CPU) is Intel Core i9-12900K; the graphics processing unit (GPU) is an NVIDIA GeForce RTX 3090Ti with 24 GB on-board memory; the physical memory is DDR5 5200 (16 G); the running operation system is Ubuntu 20.04 LTS; the PyTorch deep learning framework and is used to build, train and validate the Non-local Deformable DETR.
Considering the model training effect and experimental conditions, this paper adopts the transfer learning training strategy. The backbone network is initialized with ResNet-50 weights pretrained on ImageNet. Training epochs and iterations are set to 50 and 1200, respectively. In order to avoid the instability of the model caused by large learning rate at the beginning of training, a warmup strategy is adopted to adjust the learning rate. In the initial 500 iterations, the learning rate is gradually adjusted from 2.4 × 10 −4 to 2.5 × 10 −3 . The momentum factor is 0.9 and the weight decay coefficient is 1 × 10 −4 .

Results and Analysis
Focusing on three metrics (AP value, parameters and inference time), we conducted two kinds of comparative experiments based on our farmland obstacle dataset to evaluate Non-local Deformable DETR. Firstly, we reproduced Deformable DETR and its two variants, Deformable DETR-Iterative Bounding Box Refinement and Deformable DETR-Two Stage [40]. Secondly, we repeated some other classic object detection algorithms, such as Faster R-CNN, Mask R-CNN and Swin Transformer. The overall comparison results are shown in Figure 7. Non-local Deformable DETR achieves the best mAP with moderate inference time. Considering the model training effect and experimental conditions, this paper adopts the transfer learning training strategy. The backbone network is initialized with ResNet-50 weights pretrained on ImageNet. Training epochs and iterations are set to 50 and 1200, respectively. In order to avoid the instability of the model caused by large learning rate at the beginning of training, a warmup strategy is adopted to adjust the learning rate. In the initial 500 iterations, the learning rate is gradually adjusted from 2.4 × 10 −4 to 2.5 × 10 −3 . The momentum factor is 0.9 and the weight decay coefficient is 1 × 10 −4 .

Results and Analysis
Focusing on three metrics (AP value, parameters and inference time), we conducted two kinds of comparative experiments based on our farmland obstacle dataset to evaluate Non-local Deformable DETR. Firstly, we reproduced Deformable DETR and its two variants, Deformable DETR-Iterative Bounding Box Refinement and Deformable DETR-Two Stage [40]. Secondly, we repeated some other classic object detection algorithms, such as Faster R-CNN, Mask R-CNN and Swin Transformer. The overall comparison results are shown in Figure 7. Non-local Deformable DETR achieves the best mAP with moderate inference time. As shown in Tables 1 and 2, the overall AP value and the AP value of each category of the two variants are higher than the vanilla Deformable DETR. In terms of the mAP value, Deformable DETR-Iterative Bounding Box Refinement and Deformable DETR-Two Stage are 5.4% and 5.1% higher than the vanilla Deformable DETR, respectively. In particular, the APS value is increased by 8.8% and 18.5%, respectively. Meanwhile, parameters increased slightly, by 0.68 million and 0.99 million, and the inference time increased by 3.8 ms and 14.3 ms, respectively. Compared to Deformable DETR-Iterative Bounding Box Refinement, Deformable DETR-Two Stage achieves a slight performance gain at the cost of introducing larger latency (10.5 ms). This work takes the Deformable DETR-Iterative Bounding Box Refinement as the baseline, and forms Non-local Deformable DETR by inserting non-local blocks on it. As shown in Table 1, Non-local Deformable DETR secures the best mAP (78.0%), with an inference time of 32.0 ms, which is slightly lower than DETR-Iterative Bounding Box Refinement (32.6 ms). Although the detection speed of As shown in Tables 1 and 2, the overall AP value and the AP value of each category of the two variants are higher than the vanilla Deformable DETR. In terms of the mAP value, Deformable DETR-Iterative Bounding Box Refinement and Deformable DETR-Two Stage are 5.4% and 5.1% higher than the vanilla Deformable DETR, respectively. In particular, the AP S value is increased by 8.8% and 18.5%, respectively. Meanwhile, parameters increased slightly, by 0.68 million and 0.99 million, and the inference time increased by 3.8 ms and 14.3 ms, respectively. Compared to Deformable DETR-Iterative Bounding Box Refinement, Deformable DETR-Two Stage achieves a slight performance gain at the cost of introducing larger latency (10.5 ms). This work takes the Deformable DETR-Iterative Bounding Box Refinement as the baseline, and forms Non-local Deformable DETR by inserting non-local blocks on it. As shown in Table 1, Non-local Deformable DETR secures the best mAP (78.0%), with an inference time of 32.0 ms, which is slightly lower than DETR-Iterative Bounding Box Refinement (32.6 ms). Although the detection speed of Non-local Deformable DETR is only one-third that of Faster R-CNN, it achieves an mAP gain of 6.2%. For UAVs-based farmland obstacle detection task, we need a better trade-off between detection accuracy and speed. Therefore, we believe that the current detection speed of Non-Local Deformable DETR is acceptable, although it needs to be further improved.   Table 2 presents the detection results of different algorithms for six classes of farmland obstacles. For power-tower and person detection, our proposed Non-local Deformable DETR achieves the highest AP. For UAVs and buildings detection, Non-local Deformable DETR does not secure the best results (0.04% and 0.01% lower than Deformable DETR-Iterative Bounding Box Refinement respectively), but also performs well. Specifically, in farmland, wire poles and UAVs pose a serious danger to each other. Given the slender shape of wire pole, its detection is more challenging. Fortunately, our model obtains the best outcomes again by outperforms vanilla Deformable DETR by 11.5% in AP. We attribute the benefits to the enhanced global modeling capability for CNN feature extraction by non-local operations. Figure 8 shows some samples containing the detected objects. It can be seen that Nonlocal Deformable DETR can accurately detect different objects with a suitable bounding box. Specially, the detection results of the small power pole in the lower right image are also good. However, as shown in Figure 9, there are also some falsely detected objects. In Figure 9a, our model cannot detect the second person because it is blurred. In Figure 9b, our model wrongly detected the UAV as building, because the number of such kind of UAV in the training set is less, and the feature of the image is closed to the building. In Figure 9c, our model cannot detect the person due to the backlight environment. our model wrongly detected the UAV as building, because the number of such kind of UAV in the training set is less, and the feature of the image is closed to the building. In Figure 9c, our model cannot detect the person due to the backlight environment.

Conclusions
Focusing on the task of UAV-based unstructured farmland obstacle detection, this work proposed the Non-local Deformable DETR to enhancing the performance of the original Deformable DETR. Specially, we introduced the non-local blocks into the frontend ResNet to improve the model's global representation capacity when extracting feature maps. Combing the local self-attention mechanism in deformable transformer, our Non-local Deformable DETR can not only capture local features, but also model long-distance dependencies. Based on our farmland obstacle dataset, we conducted a series of experiments to investigate the performance of our improved model. Compared with Deformable DETR and other high-performance object detection algorithms (Faster R-CNN, Mask R-CNN and Swin Transformer), Non-local Deformable DETR achieved the best mAP (78.0%) with moderate inference time (32.0 ms). Additionally, Non-local Deformable DETR also demonstrated advantages detecting small and slender objects, such as wire poles. Taking detection accuracy and speed into account, the proposed Non-local Deformable DETR has great potential to be deployed in UAVs-based farmland obstacle detection tasks. In the future, we will continue to optimize the model to accelerate the detection speed.  our model wrongly detected the UAV as building, because the number of such kind of UAV in the training set is less, and the feature of the image is closed to the building. In Figure 9c, our model cannot detect the person due to the backlight environment.

Conclusions
Focusing on the task of UAV-based unstructured farmland obstacle detection, this work proposed the Non-local Deformable DETR to enhancing the performance of the original Deformable DETR. Specially, we introduced the non-local blocks into the frontend ResNet to improve the model's global representation capacity when extracting feature maps. Combing the local self-attention mechanism in deformable transformer, our Non-local Deformable DETR can not only capture local features, but also model long-distance dependencies. Based on our farmland obstacle dataset, we conducted a series of experiments to investigate the performance of our improved model. Compared with Deformable DETR and other high-performance object detection algorithms (Faster R-CNN, Mask R-CNN and Swin Transformer), Non-local Deformable DETR achieved the best mAP (78.0%) with moderate inference time (32.0 ms). Additionally, Non-local Deformable DETR also demonstrated advantages detecting small and slender objects, such as wire poles. Taking detection accuracy and speed into account, the proposed Non-local Deformable DETR has great potential to be deployed in UAVs-based farmland obstacle detection tasks. In the future, we will continue to optimize the model to accelerate the detection speed.

Conclusions
Focusing on the task of UAV-based unstructured farmland obstacle detection, this work proposed the Non-local Deformable DETR to enhancing the performance of the original Deformable DETR. Specially, we introduced the non-local blocks into the front-end ResNet to improve the model's global representation capacity when extracting feature maps. Combing the local self-attention mechanism in deformable transformer, our Nonlocal Deformable DETR can not only capture local features, but also model long-distance dependencies. Based on our farmland obstacle dataset, we conducted a series of experiments to investigate the performance of our improved model. Compared with Deformable DETR and other high-performance object detection algorithms (Faster R-CNN, Mask R-CNN and Swin Transformer), Non-local Deformable DETR achieved the best mAP (78.0%) with moderate inference time (32.0 ms). Additionally, Non-local Deformable DETR also demonstrated advantages detecting small and slender objects, such as wire poles. Taking detection accuracy and speed into account, the proposed Non-local Deformable DETR has great potential to be deployed in UAVs-based farmland obstacle detection tasks. In the future, we will continue to optimize the model to accelerate the detection speed.