You are currently viewing a new version of our website. To view the old version click .
Animals
  • Article
  • Open Access

17 January 2025

MDA-DETR: Enhancing Offending Animal Detection with Multi-Channel Attention and Multi-Scale Feature Aggregation

,
,
and
1
School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
2
Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Animal–Computer Interaction: Advances and Opportunities

Simple Summary

As agriculture and human settlements expand, conflicts between humans and animals become more frequent, resulting in resource loss and safety risks. Therefore, accurately identifying and locating offending animals is essential. This study presents a novel method for automatically detecting offending animals, especially in situations where they are obscured or the images are unclear. This research focused on six types of offending animals commonly found in northeastern China. By employing improved image processing techniques, this study enhanced detection accuracy in complex environments. Experimental results demonstrate that the proposed method outperforms existing techniques on the dataset used. This new approach helps improve the accuracy of intelligent monitoring systems, providing better technical support to minimize conflicts between humans and animals, thus protecting agriculture and ensuring the safety of both humans and animals.

Abstract

Conflicts between humans and animals in agricultural and settlement areas have recently increased, resulting in significant resource loss and risks to human and animal lives. This growing issue presents a global challenge. This paper addresses the detection and identification of offending animals, particularly in obscured or blurry nighttime images. This article introduces Multi-Channel Coordinated Attention and Multi-Dimension Feature Aggregation (MDA-DETR). It integrates multi-scale features for enhanced detection accuracy, employing a Multi-Channel Coordinated Attention (MCCA) mechanism to incorporate location, semantic, and long-range dependency information and a Multi-Dimension Feature Aggregation Module (DFAM) for cross-scale feature aggregation. Additionally, the VariFocal Loss function is utilized to assign pixel weights, enhancing detail focus and maintaining accuracy. In the dataset section, this article uses a dataset from the Northeast China Tiger and Leopard National Park, which includes images of six common offending animal species. In the comprehensive experiments on the dataset, the m A P 50 index of MDA-DETR was 1.3%, 0.6%, 0.3%, 3%, 1.1%, and 0.5% higher than RT-DETR-r18, yolov8n, yolov9-C, DETR, Deformable-detr, and DCA-yolov8, respectively, indicating that MDA-DETR is superior to other advanced methods.

1. Introduction

With the rapid advancement of urbanization, human activities are increasingly encroaching on wildlife habitats. This has led to a rise in harmful wildlife incidents, despite efforts to establish national parks and protected areas. Such incidents not only pose a threat to human health and livelihoods but also highlight the urgent need to address wildlife–human conflicts [1]. As a result, mitigating these conflicts has become a global priority for governments and a focus of academic research.
Traditional methods for managing wildlife–human conflicts, such as erecting physical barriers like barbed wire or electric fences around settlements and agricultural areas, have notable limitations. These measures can induce stress responses or even fatalities in animals and pose potential risks to humans [2]. To overcome these challenges, there is a growing demand for intelligent surveillance systems capable of automatically detecting and identifying offending animals in real time.
Object detection, a fundamental yet challenging task in computer vision, involves assigning precise bounding boxes and classification labels to objects in images. Recent advances in deep learning, coupled with reduced hardware costs, have spurred the development of robust object detection algorithms. Two-stage detection frameworks such as Faster R-CNN [3,4] and R-CNN [5], as well as single-stage detectors like YOLO [6,7,8,9], have gained prominence. Moreover, the dominance of Convolutional Neural Networks (CNNs) in this field has recently been challenged by the rise of Transformers, which leverage attention mechanisms to boost model performance. DETR, the first Transformer-based object detection model [10], eliminates the need for manually designed anchor frames and non-maximal suppression (NMS). However, it faces limitations in processing speed. In 2023, Baidu’s Flying Paddle team introduced RT-DETR, a breakthrough in object detection. This model transitions from “dense detection” to “sparse detection”, eliminating threshold filtering and NMS. It delivers an end-to-end, real-time object detection solution [11], making it a promising tool for applications like offending animal detection.
Offending animals frequently appear near planting and breeding areas, causing damage to vegetation and livestock. These activities often obscure the animals within the damaged vegetation. Furthermore, many offending animals are nocturnal, leading to blurry images captured by trap cameras. This study emphasizes the crucial need for accurate detection and recognition of obscured or blurry images of offending animals. While the RT-DETR model exhibits strong robustness and generalization, it has limitations. The AIFI module, relying solely on a self-attention mechanism, provides rich semantic information but lacks precise positional information. Additionally, the CCFF module, despite integrating cross-scale information, struggles with detecting targets in blurry images.
To address these limitations, this paper proposes MDA-DETR, a novel method based on Multi-Channel Coordinated Attention and multi-scale feature aggregation, to enhance the detection of obscured and blurry images of offending animals. Inspired by channel attention mechanisms [12], we introduce a Multi-Channel Coordinated Attention (MCCA) mechanism. This mechanism extracts image features along three spatial directions, capturing long-range dependencies and providing precise positional information while preserving the rich semantic information from deep networks. This significantly improves the accuracy of detecting occluded targets.
Furthermore, inspired by the RepBiPAN module [13], we propose a Multi-Dimension Feature Aggregation Module (DFAM). This module effectively fuses positional information from shallow networks with semantic information from deeper networks, and aggregates features from multiple scales. This enhances the detection accuracy of blurry images. To reduce the overall parameter count, we incorporate the RepNCSPELAN4 module [14] into the backbone network.
MDA-DETR was rigorously evaluated on a subset of the Northeast China Tiger and Leopard National Park dataset [15]. Ablation experiments were conducted to assess the performance of each component, demonstrating the superior performance of MDA-DETR in detecting offending animals. The main contributions of this study are as follows:
(i)
Multi-Channel Coordinate Attention (MCCA) mechanism This mechanism extracts image features along three spatial directions, integrating precise positional and long-range dependency information while preserving the rich semantic information from deep networks.
(ii)
Multi-Dimension Feature Aggregation Module (DFAM) This module effectively fuses multi-scale feature maps, enabling the extraction of complementary and global information.
(iii)
Comparative experiments demonstrate the superiority of MDA-DETR over state-of-the-art methods. Ablation experiments confirm the effectiveness of each key component.

3. Materials and Methods

3.1. Dataset

This study uses the publicly available Northeast China Tiger and Leopard National Park dataset [15]. From this dataset, six common species of offending animals in northeastern China—badger, black bear, leopard cat, red fox, weasel, and wild boar—were selected. The total dataset comprises 9641 images, including 5823 daytime images and 3818 nighttime images. The dataset was divided into training, validation, and test sets in a 7:2:1 ratio. Mosaic data augmentation was applied to the training set to enhance the dataset (see Table 1).
Table 1. Brief information about the dataset used in this article.

3.2. Overall Architecture

The overall structure of MDA-DETR proposed in this paper is shown in Figure 1, which is a Transformer-based encoding–decoding structure. First, the improved ResNet-18 backbone network is used for encoding and extracting image features. This paper acquires three different scales of features with the resolutions: S 4 [ M , N ] , S 5 [ M 2 , N 2 ] , S 7 [ M 4 , N 4 ] . Then, the S 7 deep feature images are inputted into MCCA to obtain accurate position information. In addition, the AIFI self-attention mechanism in the RT-DETR model is retained to extract rich semantic information. Next, the feature maps of three different scales—shallow, intermediate, and deep—are further fused by DFAM to obtain a new feature image with global information. Finally, the Variable Focal Loss function (VariFocal Loss) [34], GIoU loss function [35], and L1-loss [36] are used to measure the error between the model’s prediction results and the actual labels and to adjust the weight parameters accordingly to optimize the overall model further.
Figure 1. The overall structure of MDA-DETR.

3.3. Multi-Channel Coordinate Attention (MCCA)

Generally, channel attention mechanisms play a crucial role in improving model performance. However, traditional channel attention mechanisms focus primarily on extracting semantic information from feature maps, often neglecting the importance of target position [37]. In the field of detecting offending animals, obstructions often occur, making accurate positional information critical for enhancing model predictive performance. Inspired by channel attention mechanisms, this paper proposes the Multi-Channel Coordinated Attention Mechanism (MCCA). It effectively extracts semantic and positional information from feature maps through three spatial directions and captures long-range dependencies (as shown in Figure 2), ultimately improving the accuracy of detecting obscured offending animals.
Figure 2. The structure diagram of MCCA.
Specifically, the input image of size M × N is first decomposed into three one-dimensional feature encoding processes. The three outputs can be expressed as follows:
Z ( h ) = 1 W j = 1 W x ( h , j ) Z ( w ) = 1 H i = 1 H x ( i , w ) Z ( c h ) = i = 1 H j = 1 W x ( i , j )
The features of H and W are then combined to obtain Z c o n c a t = [ Z ( h ) , Z ( w ) ] . These combined features are then fed into a 3 × 1 convolution kernel for convolution, followed by batch normalization and SiLU activation [38]. This process can be described as follows:
F 1 = B c o n v ( Z c o n c a t )
where B c o n v ( · ) represents a series of operations, including 3 × 1 convolution, batch normalization, and SiLU activation functions.
Subsequently, F 1 is decomposed into two separate tensors along the H and W directions to obtain F 1 h and F 1 w . Upsampling and s i g m o i d ( x ) activation operations are then performed on F 1 . The s i g m o i d ( x ) activation function is used here to reduce the complexity and computational overhead of the module. This process can be described as follows:
F 2 = s i g m o i d ( U ( F 1 ) )
where U ( · ) denotes upsampling.
Decomposing F 2 into two separate attentional weights along the H and W directions yields F 2 h and F 2 w for subsequent refinement of the F 1 h and F 1 w tensors. Next, F 2 is batch normalized. Afterward, it is multiplied with Z ( c h ) , which has been processed through a 1 × 1 convolution, to obtain F 3 . This process can be defined as follows:
F 3 = F 2 × C o n v ( Z ( c h ) )
The resulting output F o u t can be written as
F o u t = F 1 h × F 1 w × F 2 h × F 2 w × F 3

3.4. Multi-Dimension Feature Aggregation Module (DFAM)

When detecting offending animals, their rapid movements and nighttime appearances can result in blurry images and unclear boundaries when captured by trap cameras. Utilizing the MCCA module and the AIFI module, the model obtains feature images with precise localization and rich semantic information. However, recognition accuracy for such challenging conditions remains an issue. To leverage the complementarity and correlation between multi-scale features, this paper proposes a multi-scale feature aggregation module (DFAM), inspired by the RepBiPAN module, to fuse feature images from the backbone network with those processed through MCCA and AIFI operations. Figure 1 illustrates the overall structure of DFAM. The main contribution of DFAM is the fusion of multiple feature images at different scales, utilizing both element-wise addition and cascading operations. This approach enhances information at the same scale while benefiting from features at other scales. Additionally, this paper introduces an aggregation block (Fuse) that consolidates feature maps of varying sizes in the DFAM fusion path, followed by a convolution block (RepC3) composed of multiple 1 × 1 convolutional layers. These operations enable the model to integrate feature maps from each layer, yielding feature images with globally essential features and improving the detection accuracy of blurry animal images. Figure 3 shows the structure of Fuse and RepC3 in DFAM.
Figure 3. Structure of Fuse and RepC3 in DFAM.
Specifically, the Fuse module handles three inputs to the aggregation block: f u p , f m i d d l e , and f d o w n . For the shallow feature map, f u p is first processed by a 1 × 1 convolution, followed by batch normalization and the SiLU activation function, and then undergoes downsampling to facilitate subsequent fusion operations. This process can be expressed as follows:
f 1 = D ( B c o n v 1 ( f u p ) )
where B c o n v 1 ( · ) is a series of sequential operations, including 1 × 1 convolution, batch normalization, and SiLU activation functions.
After that, the middle layer feature map f m i d d l e is input into B c o n v 1 ( · ) for a convolution operation with a 1 × 1 kernel, and the deep layer feature map f d o w n is upsampled. This process can be represented as follows:
f 2 = B c o n v 1 ( f m i d d l e ) f 3 = U ( f d o w n )
where U ( · ) denotes upsampling.
Then, f 1 and f 2 are fused using element-wise addition, and the fused feature maps are cascaded with f 1 and f 3 , and finally input into B c o n v 1 ( · ) for a convolution operation. This process can be defined as follows:
f o u t = B c o n v 1 ( c o n c a t ( ( f 1 + f 2 ) , f 1 , f 2 ) )
In the RepC3 convolutional block, the input features first undergo sequential operations to adjust the number of channels. After this, feature extraction is performed on one of the branches using a RepBlock consisting of N RepConvs [39]. Finally, the outputs of the dual paths are fused using element-wise addition. By employing DFAM, the model can fuse features at three different scales and utilize cross-level features to acquire more global and complementary information. This results in a more expressive fused feature image that accurately detects the presence of offending animals in blurred images.

3.5. Backbone Network Improvement Strategy

Although the images processed by MCCA and DFAM retain accurate positional information and rich semantic information, the overall number of model parameters is high. To address this, we use the RepNCSPELAN4 structure to replace the last layer of the ResNet-18 backbone network. The RepNCSPELAN4 structure combines CSPNet with gradient path planning and ELAN, extending ELAN to support GELAN for any computational block. This design is focused on achieving lightweight inference speed and accuracy. This optimization strategy enables the model to reduce the number of parameters while maintaining the original accuracy.

3.6. Loss Function

The Variable Focal Loss function (VariFocal Loss) uses the IoU-aware Classification Score (IACS) to represent the probability of target object ranking of candidate detection frames. The loss function employs an asymmetric weighting approach to improve the model’s classification performance by reducing the weight of negative samples and increasing the weight of positive samples. This allows the model to better focus on training with high-quality positive samples. The functional representation of VariFocal Loss is as follows:
L V F L = y ( i ) ( y ( i ) log ( y ^ ( i ) ) + ( 1 y ( i ) ) log ( 1 y ^ ( i ) ) ) y ( i ) > 0 α y ^ ( i ) γ log ( 1 y ^ ( i ) ) y ( i ) = 0
where y ( i ) denotes the label of the sample i with positive class 1 and negative class 0. y ^ ( i ) denotes the predicted IACS for the sample i. α and y ^ ( i ) γ represent the scalability coefficients of the moderating loss.
Traditional object detection algorithms typically use the IoU loss function to measure the degree of overlap between the prediction frame and the ground truth. However, when the overlap is high, the IoU loss function suffers from the vanishing gradient problem, making further model optimization difficult [40]. To better localize the target and reduce the absolute error between the prediction and ground truth, the GIoU loss function and L1-loss are introduced. Specifically, the GIoU loss function introduces a penalty term on top of the IoU loss function to measure the non-overlap between the prediction and ground truth frames, addressing the vanishing gradient problem and enabling faster convergence. The L1-loss, or mean absolute error (MAE), is the average of the absolute errors between the model’s predicted and true values. Using both the GIoU loss function and L1-loss provides better supervision during model training and ensures the accuracy of its results.
The expressions for the loss functions are as follows:
L G I o U = 1 I o U + | A C U | | A C |
L 1 = 1 N i = 1 N | y i y i |
where I o U denotes the intersection ratio between the predicted and ground truth boxes, A c represents the minimum bounding box containing both the predicted and ground truth boxes, and U represents the union of predicted and real bounding boxes. N is the number of samples, y i is the true value of sample i, and y i is the predicted value of sample i.
Finally, the overall loss function for the proposed model is as follows:
L = L V F L + L G I o U + L 1

4. Results

4.1. Evaluation Metrics

To evaluate the model proposed in this study, six widely used evaluation metrics in object detection tasks were adopted: Precision, Recall, m A P 50 (mean Average Precision), mAP50–95, parameters, and FPS (Frames Per Second). In object detection tasks, the Intersection over Union ( I o U ) between the predicted bounding box and the ground truth box is generally used to assess their overlap, thereby evaluating the model’s detection accuracy on a specific dataset. A higher I o U value indicates greater similarity between the two boxes. If the I o U exceeds a certain threshold (commonly set at 0.5, also used in this paper), the model is considered to have successfully detected the object, and the result is classified as a positive example (P); otherwise, it is treated as a negative example (N). For a positive example, if the model’s classification result is correct, it is considered a true positive ( T P ); otherwise, it is regarded as a false positive ( F P ). If a ground truth box is not detected, it is labeled as a false negative ( F N ). Based on these, Precision and Recall can be calculated using the following formulas:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
However, Precision and Recall do not fully reflect the performance of an object detection model across different thresholds. When Precision increases, Recall often decreases; conversely, when Recall increases, Precision tends to decrease. This is because the stricter the criteria for predicting a sample as a positive example, the more likely the model is to miss some true positive examples. On the other hand, the more lenient the criteria, the more likely the model is to increase the number of false positives. Therefore, this study includes two additional evaluation metrics: m A P 50 and mAP50–95. m A P 50 represents the average precision ( A P ) across multiple categories at an IoU threshold of 0.5. mAP50–95 represents the average of m A P results at 10 I o U thresholds ranging from 0.5 to 0.95, with a step size of 0.05. The higher the m A P value, the better the model’s detection performance. The formulas for calculating AP and mAP are as follows:
A P = k = 1 N P ( s ) · Δ R ( s )
m A P = 1 c i = 1 c A P ( i )
where P represents Precision, R represents Recall, s represents the classification confidence of the detection box, N represents the number of discrete points, and c represents the number of categories.
For model complexity and detection speed, this study uses two metrics: Frames Per Second ( F P S ) and parameters. F P S indicates the number of frames the object detection model can process per second. The higher the F P S value, the faster the detection speed and the better the real-time performance of the model. Parameters reflect the spatial complexity of the model. The expression for F P S is as follows:
F P S = 1000   ms p r e p r o c e s s + i n f e r e n c e + N M S
where 1000 ms refers to 1000 milliseconds, p r e p r o c e s s refers to the time required for the preprocessing stage, i n f e r e n c e refers to the time required for the model’s inference, and N M S refers to the post-processing time required when the model applies Non-Maximum Suppression.

4.2. Parameter Settings

The overall model is implemented using the Pytorch framework and trained on a Titan RTX3090 GPU. During training, the pre-trained weights of the ResNet-18 [31] on ImageNet-1k [41] were used and trained on the public dataset used in this paper. The model used the AdamW optimizer with a weight decay of 0.0001, an initial learning rate set to 0.0001, and a mosaic data enhancement probability of 0.5.

4.3. Comparative Experiments on Backbone Network Improvement Strategies

This paper compares the model with only the backbone network part to the one with the RT-DETR decoder part. The results show that the improved model reduces the number of parameters by 46.75% compared with the original model using the ResNet-18 backbone network. Additionally, it achieves some improvement in accuracy, as shown in Table 2.
Table 2. Comparison of the improved backbone network model with the ResNet-18 backbone network model on the dataset.

4.4. Ablation Study

To evaluate the effectiveness of each key component in MDA-DETR, this paper conducted comprehensive ablation experiments focusing on (1) the effectiveness of backbone network improvement strategies, (2) the effectiveness of MCCA, and (3) the importance of DFAM. During these experiments, two challenging scenarios—occlusion and blurring—were selected to observe and record the performance of these key components under different conditions, demonstrating their effectiveness in various situations.
(1)
Effectiveness of Backbone Network Improvement Strategies: The backbone network employs the RepNCSPELAN4 structure, initially enhancing the backbone’s performance and shows improvement across four indicators compared with the model with only the ResNet-18 backbone network (as shown in the second row of Table 3). However, when animals are occluded, the detection accuracy is low (as shown in the second row and fourth column of Figure 4).
Table 3. Ablation experiments of the three key components on the dataset constructed in this paper.
Figure 4. Comparison of ablation experiments. (The orange animal in the upper right corner of the third column is the dataset icon.)
(2)
Effectiveness of MCCA: This study retained only the improved backbone network and MCCA for training and testing to demonstrate its effectiveness. The results indicate that MCCA significantly improves all four evaluation metrics and further enhances the accuracy of detecting occluded offending animals (as shown in the third row and fourth column of Figure 4).
(3)
The effectiveness of DFAM. DFAM adjusts the number of channels in the feature map generated by the AIFI module and further fuses it with the feature map generated by the improved backbone network. To demonstrate DFAM’s effectiveness, this study retained only DFAM and the improved backbone network for testing. Table 3 shows that adding DFAM leads to a slight decrease in recall. However, the 0.3% decrease in recall is accompanied by a 1.2% increase in precision, making the slight drop in recall acceptable. Furthermore, Figure 4 shows that using this module has enhanced the accuracy of generating the minimum bounding rectangle. Particularly when the offending animal in the image is blurry, it can more accurately locate and identify the offending animal. Finally, by combining the three proposed improvement methods, the model achieved a 2.9% increase in accuracy, a 3.3% increase in recall, a 2.9% improvement in m A P 50 , and a 4.7% improvement in mAP50–95 compared with the original baseline. The detection performance was exceptional, proving the effectiveness of the improvement strategy proposed in this article.

4.5. Comparison with Other Models

To comprehensively evaluate the performance of MDA-DETR, this paper compares it with six state-of-the-art object detection models: RT-DETR-r18 [11], yolov8n [42], yolov9-C [14], DETR [10], Deformable-detr [18], and DCA-yolov8 [43]. The results for all models were generated using the official code or based on published papers.
Table 4 shows the results of comparing MDA-DETR with six state-of-the-art methods, where the red font and blue font indicate the best and second best performance. From the results in Table 4, it can be seen that the proposed model achieves 97.8% m A P 50 metrics on the dataset, outperforming the original RT-DETR-r18 model by 1.3%, and it is 0.6% better than yolov8n, 0.3% better than yolov9-C, 3% better than DETR, 1.1% better than Deformable-detr, and 0.5% better than DCA-yolov8. Overall, the model proposed in this article outperforms other advanced methods in the detection of offending animals on the part of the publicly available Northeast China Tiger and Leopard National Park dataset and can better detect images of accident-prone animals.
Table 4. Comparison of MDA-DETR with 6 state-of-the-art methods.
Figure 5 and Figure 6 show the detection results of MDA-DETR and advanced methods in different scenarios. The images selected in this article include three scenarios: animals being occluded, only having partial animal features, and animals being blurry at night. Among them, DETR and Deformable-detr have better performance in detecting images with only partial animal features, but there are false detections when detecting occluded animal images (leopard cats), and DETR also has false detections when detecting occluded wild boars. Yolov8n and yolov9-C can effectively handle scenes with only partial animal features and animal occlusion, but when detecting blurry animal images at night, yolov8n has lower detection accuracy, while yolov9-C lacks comprehensive detection (badgers). This article selects the DCA-yolov8 model, which can also detect occluded scenes, as the comparison model. From the results, it can be seen that the MDA-DETR proposed in this article has higher detection accuracy than DCA-ylov8 in several complex tasks. Overall, the method proposed in this article outperforms other methods on the publicly available dataset used. MDA-DETR not only accurately classifies the offending animal but also accurately locates and detects the offending animal in scenes where the animal is occluded and blurred at night.
Figure 5. Comparison between ours and advanced methods (badger, black bear, leopard cat).
Figure 6. Comparison between ours and advanced methods (red fox, yellow weasel, wild boar).

5. Discussion

This study proposed MDA-DETR, a novel object detection model, to address the challenges of detecting offending animals, particularly in obscured and blurry nighttime images. Comprehensive evaluations on the Northeast China Tiger and Leopard National Park dataset demonstrated that MDA-DETR significantly outperforms six state-of-the-art models, achieving superior results in mAP50 (+1.3% to +3%) and mAP50-95 (+0.5% to +4.7%). These results validate the effectiveness of the Multi-Channel Coordinated Attention (MCCA) mechanism and the Multi-Dimension Feature Aggregation Module (DFAM) in improving detection accuracy under challenging conditions.
The main findings of this research align with the initial hypothesis that incorporating multi-scale feature aggregation and enhanced attention mechanisms would address limitations in existing models. The observed improvements can be attributed to the following factors: The MCCA mechanism effectively integrates spatial and long-range dependencies, enhancing the accuracy of detecting occluded animals. The DFAM module fuses features across multiple scales, providing global and complementary information to improve the detection of blurry images. The use of VariFocal Loss further enhances classification accuracy by emphasizing high-quality positive samples.
Compared with the YOLO series, DETR, and its variants, MDA-DETR demonstrates significant advantages in handling occlusion and nighttime detection on the dataset used in this study. While the existing YOLO and DETR series methods perform well in most conventional object detection tasks, YOLO models tend to lose crucial information when processing blurred images of animals at night, leading to inaccurate target localization. Although DETR and its variants have improved target detection localization, they often fail to correctly classify objects in complex backgrounds, especially in occlusion or cluttered backgrounds. In contrast, MDA-DETR effectively overcomes these challenges by combining MCCA and DFAM. By incorporating multi-scale feature fusion and integrating the image’s positional information, semantic information, and long-range dependencies, MDA-DETR avoids detection failures caused by the loss of local features. Additionally, this multi-level information integration enhances the model’s ability to detect occluded animals and blurred nighttime animal images, reducing false negatives and false positives, and providing stronger robustness.
Despite the significant advantages demonstrated by MDA-DETR, this study still has certain limitations. First, the dataset used in this study contains only six specific species of offending animals, which limits the model’s applicability to other species and broader scenarios. Second, the dataset does not cover more complex environmental factors, such as changing lighting conditions or motion blur caused by rapid movement, which can often affect the model’s performance in detection tasks. Furthermore, although MDA-DETR has achieved significant accuracy improvements, its computational complexity may pose challenges for real-time deployment on edge devices.
To address these limitations, future research will focus on the following areas: Expanding the dataset: Plans are in place to create a dataset that includes more animal species and more complex environmental factors. Additionally, the use of generative models or trap cameras will be considered to acquire the relevant images for further evaluating the model’s performance in various scenarios, thereby enhancing its generalization ability and applicability to a broader range of species. Optimizing the model architecture: considering the computational complexity involved in real-time deployment, future work will explore lighter model architectures, such as integrating more efficient attention mechanisms or designing low-computation overhead feature extraction modules, to better suit resource-constrained edge devices. Expanding the application scope: Future studies will further explore the potential of MDA-DETR in a wider range of application scenarios, including general wildlife detection and endangered species protection. By expanding its application scope, this study wishes MDA-DETR to play a greater role in protecting ecological environments and biodiversity.

6. Conclusions

We propose a novel model, MDA-DETR, for detecting offending animals. To better handle scenes where animals are occluded, the MCCA module is designed to extract semantic information from feature maps, target location information, and long-distance dependencies. Subsequently, the DFAM module aggregates features at three different scales to obtain global features, enhancing the accuracy of detecting offending animals in blurry images. The data used in this study come from the Northeast Tiger and Leopard National Park dataset. Ablation experiments on this dataset demonstrate the effectiveness of the key components in MDA-DETR. Additionally, comprehensive comparative experimental results on this dataset confirm the robustness and effectiveness of the MDA-DETR model.

Author Contributions

All authors have made significant contributions to this research. Conceptualization, H.L. and H.Z.; methodology, H.L. and H.Z.; validation, H.L.; writing, H.L., H.Z., G.S., and F.Y.; supervision, H.Z., G.S., and F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China (Grant number 2022YFF1302700), The Emergency Open Competition Project of National Forestry and Grassland Administration (Grant number 202303), and Outstanding Youth Team Project of Central Universities (Grant number QNTD202308).

Institutional Review Board Statement

Ethical review and approval were waived for this study because the research involves the development and evaluation of a computer vision model for detecting offending animals. The model was trained and tested on existing datasets of images and videos, such as the NTLNP dataset (https://github.com/myyyyw/NTLNP, (accessed on 9 March 2023)), which does not involve any direct interaction with or harm to animals.

Data Availability Statement

This study uses the publicly available Northeast China Tiger and Leopard National Park dataset, which can be downloaded at https://github.com/myyyyw/NTLNP (accessed on 9 March 2023).

Acknowledgments

We appreciate the hard work of every author, the valuable suggestions of every reviewer, and the patience and meticulousness of academic editors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Huang, Y.; Yang, J.; Zhang, H.; Su, K.W.; Wen, Y.L. A Comparative Study of Human-wildlife Conflicts Management in Nature Reserves at Home and Abroad. World For. Res. 2021, 34, 27–32. [Google Scholar]
  2. Mamat, N.; Othman, M.F.; Yakub, F. Animal intrusion detection in farming area using YOLOv5 approach. In Proceedings of the 2022 22nd International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 27 November–1 December 2022; pp. 1–5. [Google Scholar]
  3. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  5. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  7. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  8. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  9. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  10. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23 August 2020; pp. 213–229. [Google Scholar]
  11. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  12. Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
  13. Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. Yolov6 v3.0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
  14. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  15. GitHub Repository. 2023. Available online: https://github.com/myyyyw/NTLNP (accessed on 2 January 2025).
  16. Ashish, V. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  18. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  19. Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2567–2575. [Google Scholar]
  20. Zeng, W.; Jin, S.; Xu, L.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P.; Wang, X. TCFormer: Visual Recognition via Token Clustering Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9521–9535. [Google Scholar] [CrossRef] [PubMed]
  21. Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
  22. Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
  23. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  24. Delplanque, A.; Foucher, S.; Lejeune, P.; Linchant, J.; Théau, J. Multispecies detection and identification of African mammals in aerial imagery using convolutional neural networks. Remote Sens. Ecol. Conserv. 2022, 8, 166–179. [Google Scholar] [CrossRef]
  25. Cai, Q.; Zheng, B.; Zeng, X.; Hou, J. Wildlife object detection combined with solving method of long-tail data. J. Comput. Appl. 2022, 42, 1284–1291. [Google Scholar]
  26. Roy, A.M.; Bhaduri, J.; Kumar, T.; Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 2023, 75, 101919. [Google Scholar] [CrossRef]
  27. Yang, W.; Liu, T.; Zhou, J.; Hu, W.; Jiang, P. CNN-Swin Transformer Detection Algorithm of Forest Wildlife Images Based on Improved YOLOv5s. Sci. Silvae Sin. 2024, 60, 121–130. [Google Scholar]
  28. Ram, A.V.; Prakash, A.S.; Ahamed, A.I.; Anirudh, K.; Arvindh, M.M.; Nithyavathy, N. A self induced warning system for wild animal trespassing using machine vision system. In Proceedings of the 2018 International Conference on Intelligent Computing and Communication for Smart World (I2C2SW), Erode, India, 14–15 December 2018; pp. 349–353. [Google Scholar]
  29. Ravoor, P.C.; Sudarshan, T.S.B.; Rangarajan, K. Digital Borders: Design of an Animal Intrusion Detection System Based on Deep Learning. In Proceedings of the International Conference on Computer Vision and Image Processing, Prayagraj, India, 4–6 December 2020; pp. 186–200. [Google Scholar]
  30. Chiu, Y.-C.; Tsai, C.-Y.; Ruan, M.-D.; Shen, G.-Y.; Lee, T.-T. Mobilenet-SSDv2: An improved object detection model for embedded systems. In Proceedings of the 2020 International Conference on System Science and Engineering (ICSSE), Kagawa, Japan, 31 August–3 September 2020; pp. 1–5. [Google Scholar]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  32. Lee, J.; Lim, K.; Cho, J. Improved Monitoring of Wildlife Invasion through Data Augmentation by Extract—Append of a Segmented Entity. Sensors 2022, 22, 7383. [Google Scholar] [CrossRef] [PubMed]
  33. Charles, P.S.; Dharmalingam, S.; Baskaran, M.; Manoharan, P.S.; Hemanth, G.R. Intelligence Harmless Animal Avoider in Agriculture Farm. In Proceedings of the 2023 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 23–25 January 2023; pp. 1–4. [Google Scholar]
  34. Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
  35. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  36. Barron, J.T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4331–4339. [Google Scholar]
  37. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  38. Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
  39. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
  40. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
  41. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 1405–1573. [Google Scholar] [CrossRef]
  42. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. GitHub Repository. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 January 2025).
  43. Yang, W.; Wu, J.; Zhang, J.; Gao, K.; Du, R.; Wu, Z.; Firkat, E.; Li, D. Deformable convolution and coordinate attention for fast cattle detection. Comput. Electron. Agric. 2023, 211, 108006. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.