Next Article in Journal
A Central Array Method to Locate Chips in AOI Systems in Semiconductor Manufacturing
Previous Article in Journal
Adaptive PI Controller for Speed Control of Electric Drives Based on Model Reference Adaptive Identification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MACNet: A More Accurate and Convenient Pest Detection Network

1
School of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China
2
School of Big Data and Artificial Intelligence, Anhui Xinhua University, Hefei 230088, China
3
Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(6), 1068; https://doi.org/10.3390/electronics13061068
Submission received: 10 February 2024 / Revised: 8 March 2024 / Accepted: 11 March 2024 / Published: 14 March 2024

Abstract

:
Pest detection: This process is essential for the early warning of pests in the agricultural sector. However, the challenges posed by agricultural pest datasets include but are not limited to species diversity, small individuals, high concentration, and high similarity, which greatly increase the difficulty of pest detection and control. To effectively solve these problems, this paper proposes an innovative object detection model named MACNet. MACNet is optimized based on YOLOv8s, introducing a content-based feature sampling strategy to obtain richer object feature information, and adopts distribution shifting convolution technology, which not only improves the accuracy of detection but also successfully reduces the size of the model, making it more suitable for deployment in the actual environment. Finally, our test results on the Pest24 dataset verify the good performance of MACNet; its detection accuracy reaches 43.1 A P which is 0.5 A P higher than that of YOLOv8s, and the computational effort is reduced by about 30%. This achievement not only demonstrates the efficiency of MACNet in agricultural pest detection, but also further confirms the great potential and practical value of deep learning technology in complex application scenarios.

1. Introduction

Agriculture plays a vital role in maintaining the quality of life of the world’s population and is a key foundation for economic growth. However, insect pests have significantly reduced crop yields and become a major obstacle to agricultural development [1]. At present, pest management mainly relies on the widespread use of chemical pesticides, which often fail to accurately take into account the species and distribution of pests, which may lead to environmental pollution and food safety risks. Therefore, there is an urgent need to accurately grasp pest information in a fast and effective way to achieve more effective control strategies [2]. However, traditional methods of obtaining pest information rely mainly on direct human involvement and physical capture tools, including visual inspection and the use of sticky traps for pest capture and counting, which are often labor-intensive, inefficient, and susceptible to human factors. As a result, automation and digitization technologies, such as image recognition and remote sensing technology, are increasingly being adopted in modern agriculture to improve the efficiency and accuracy of pest monitoring [3].
With the rapid development of deep learning and computer vision technology, object detection algorithms have been gradually applied to the agricultural field, and significant progress has been made [4]. However, in the early research, researchers mainly extracted the texture, color, shape, and other characteristics of pests based on computer vision, and used a support vector machine, K-nearest neighbor, and other algorithms to detect object pests. For example, Hasan et al. [5] used the DCNN model to extract rice disease characteristics and input them into the SVM classifier, and successfully identified and classified nine rice diseases with an accuracy of 97.5%. To mine more abundant classification features, Yalcin et al. [6] experimented with the results of the CNN model with SVM classifiers with different kernels and feature descriptors such as LBP and GIST, and the results showed the effectiveness of the proposed method. However, the complexity of the real field environment greatly affects the design and extraction of pest characteristics. Despite some advances, traditional pest detection methods still face many challenges. Over time, deep learning technology has shown potential in the field of agricultural pests, where researchers use time series forecasting, deep learning, and convolutional neural network models to automatically identify specific pests and help farmers take timely control measures. For example, Rong et al. [7] proposed an improved object detection algorithm for the Mask R-CNN model, which balances the proportion of semantic information and spatial information by increasing the fusion weight coefficients of feature layers at different scales in the feature pyramid, which has been experimentally proven to be helpful for both the accuracy and efficiency of pest identification and counting. Similarly, Wang et al. [8] optimized the feature pyramid structure, increased the receptive field of the lower layer and introduced a bilinear interpolation algorithm and an attention module in the backbone network based on the Faster RCNN detection algorithm to improve the detection performance of small pests.
With the continuous progress of global development, computer technology is increasingly integrated into various industries, including but not limited to the medical field [9,10], the ecological field [11], and the sensing field [12]. This development trend is accompanied by a significant improvement in the accuracy of object detection algorithms, which promotes their wide application in many fields. In particular, the YOLOv8 algorithm provides strong technical support in many fields, including agricultural pest detection, with its high accuracy and wide range of applications. In view of the unique challenges faced by agricultural pest detection, such as the complexity of the field environment and the small size of pests, YOLOv8s was optimized in this study and an object detection model named MACNet (More Accurate and Convenient Network) was designed. MACNet is designed to meet the specific needs of agricultural pest detection scenarios, effectively improving the accuracy and practicability of field pest detection. The specific contributions of this paper are as follows:
1. We introduce a CARMF (Content-Aware Reassembly of Multiple Features) upsampling method, effectively preserving the original information of the sampled feature map while achieving higher detection accuracy with a limited increase in parameters and computational burden.
2. To achieve a faster and more accurate collection of agricultural pest information, we incorporate DSConv (Distribution Shifting Convolution), reducing the model’s parameter count and computational burden during convolution operations. This convolution method is applied to multiple modules in the network structure to maximize model lightweight.
Through these enhancements, we aim to explore superior object detection performance and contribute to the advancement of agricultural pest monitoring and control.

2. Related Work

2.1. YOLO Series Algorithms

General object detection algorithms have been widely used in many fields, such as autonomous driving, smart healthcare, and industrial detection [13,14,15]. Among them, the YOLO algorithm series, as an outstanding representative of single-stage object detection algorithms, significantly optimizes the real-time performance of object detection by integrating the process of feature extraction and classification into a single neural network. As a result, the YOLO series has developed rapidly and has become the core real-time object detection system for applications such as vehicle systems, medical inspection, and human–computer interaction [16,17,18], and has begun to be used in the field of pest detection. For example, Cheng et al. [19] proposed YOLOLite-CSG, which is a simplified version based on YOLOv3 that finally achieves detection performance beyond YOLOv3 on the CP15 pest dataset by improving the generation of prior boxes and optimizing lightweight hourglass blocks as well as coordinate attention, while reducing the number of parameters and calculations thus making it more suitable for field pest detection equipment. In addition, Chu et al. [20] combined YOLOv5 with the ECA attention mechanism to capture the detailed feature information of tiny objects, and introduced BiFPN to fuse low-level and high-level feature information to achieve 98.2% detection accuracy on a granary pest dataset containing 5231 images. For the problem of small-scale pests, Tian et al. [21] proposed an MD-YOLO network to detect three small pests by fusing DenseNet and the Adaptive Attention Module (AAM) to enhance image detail capture and feature use.
In addition, the newer version of YOLOv8s has achieved both speed and accuracy in terms of the detection of speed and accuracy; although YOLOv8s has made remarkable achievements in many aspects, it is still far behind other size objects in the terms of detection accuracy of small-scale objects. Therefore, there is still huge room for improvement in the practical application scenarios of YOLOv8s in specific fields, such as medical image detection [22], agricultural inspection [23], and aerial image detection [24]. For example, Lou et al. [25] adopted a novel downsampling method when processing camera sensor algorithms, which better retained contextual feature information and improved the feature fusion network to effectively combine shallow and deep information, thereby improving the detection accuracy by 0.5%. Similarly, Li et al. [26] improved the neck part of the YOLOv8s network according to the idea of Bi-PAN-FPN, fully considered and effectively reused multi-scale features, and achieved a more advanced and comprehensive feature fusion process while maintaining the parameter cost as much as possible. In addition, Li et al. [27] introduced the MHSA attention mechanism module into the YOLOv8s network to further process the features extracted by the backbone network, improve the feature weights of the object region, and effectively extracting more useful feature information to improve the detection accuracy on the tomato dataset. These improved methods based on YOLOv8s show that by capturing richer object feature information the diversity and complexity of objects can be identified more effectively, thereby improving the performance of the detector in various complex environments.

2.2. Feature Sampling

Feature upsampling involves the fusion of high- and low-resolution feature maps, which is widely used in many state-of-the-art architectures; for example, in feature pyramid networks [28] versus stacked hourglass networks [29]. In general, the common interpolation algorithms in upsampling mainly consider the changes in the spatial position of pixels, which cannot capture the rich semantic information required for object detection tasks and may lead to a series of problems, such as noise amplification, increased computational complexity, and image blur. In addition, commonly used adaptive upsampling methods such as deconvolution [30] operate by learning a set of upsampling kernels, but applying the same kernel to the whole image ignores the differences in the underlying content which limits the sensitivity of the convolutional kernel to local changes to a certain extent, and the number of parameters increases significantly when using larger kernels.
With the progress of semantic segmentation and super-resolution, a variety of deep learning upsampling methods have emerged, and good results have been obtained. For example, the high-efficiency sub-pixel convolutional neural network proposed by Shi et al. [31] focuses on the sub-pixel convolution at the end of the model to better fuse low-resolution feature maps, realize the generation of super-resolution images, and have fast inference speed. Tian et al. [32] proposed a simple and effective data-dependent upsampling method to recover pixel-level segmentation prediction from the rough output of a convolutional decoder, replacing traditional bilinear interpolation. These methods are obtained through data training and are generally more efficient than linear interpolation. However, they need to train different networks to accommodate different magnifications, and the calculation is more complex and less effective than integer magnification while linear interpolation is simpler and more practical. To address these problems, Hu et al. [33] proposed a Meta-SR method to solve the super-resolution problem of arbitrary scale factors, including non-integer scale factors. For different amplification scale factors, the MetaUpscale module takes the scale factor as input and dynamically predicts the weights of the advanced filter, using these weights to generate super-resolution images of any size. At the same time, Wang et al. [34] proposed a CARAFE upsampling method that supports content-aware processing to obtain richer semantic information about the object, which effectively improves the object detection effect.

2.3. Convolution Operator

In recent years, deep convolutional neural networks have become the core algorithms of computer vision and convolution operations are a key component of them, which plays an important role in network performance. Its main purpose is to extract object feature information. This is achieved by a randomly initialized convolutional kernel filter, where the shallow filter captures the underlying features (such as simple shapes such as points, lines, and surfaces), and the deep filter extracts deep and abstract semantic information. Typically, a convolution operation consists of a set of convolution kernels of a fixed size and learnable parameters that slide across the feature map and perform a weighted summation operation to produce a result. This process can be repeated multiple times to obtain more feature mappings. Traditional convolution operations have the beneficial characteristics of sparse interaction, parameter sharing, and isovariable representation.
With the continuous expansion of object detection scenarios, traditional convolution can no longer meet the detection requirements, and many new convolution methods have emerged. For example, Yu et al. [35] introduced the concept of dilated convolution to address the limitations of semantic segmentation encoders in terms of information extraction. It allows multi-scale contextual information to be aggregated without losing resolution or rescaling the image, resulting in a larger receptive field that facilitates richer feature extraction. Although there are also applications in object detection, there are certain challenges in actual training optimization. Chollet et al. [36] used different convolutional kernels to process different channels from the perspective of convolutional space and channels, which have fewer parameters and computational effort than traditional convolutions and can extract richer feature information. Given the drawbacks of traditional convolution limitations and fixed geometry modeling, Dai et al. [37] proposed deformable convolution and deformable pooling operations which allow the free deformation of the sampled mesh by learning the offset to realize the adaptive localization of objects of different shapes. In general, lightweight convolutional neural networks limit the depth and width of the network due to their low computing budget, resulting in limited representation capabilities and degraded performance. To solve this problem, Chen et al. [38] proposed dynamic convolution, which dynamically aggregates multiple convolution kernels based on the attention of multiple parallel convolution kernels to each input, increasing the complexity of the model without increasing the depth or width of the network. Compared with static convolution, the representation ability is significantly enhanced, which improves the accuracy of the model. Nascimento et al. [39] proposed that distributed shift convolution achieves lower memory usage and higher speed by storing integer values in variable cores while maintaining the same output as the original convolution by applying kernel- and channel-based distributed shifts. In addition, there is almost no loss of model accuracy in actual training, and the model is lightweight.

3. Our Approach

The YOLOv8s network consists of three main parts: Backbone, Neck, and Head. In the Backbone section, the C2f unit combines the C3 framework of YOLOv5 with the ELAN architecture of YOLOv7, which makes the module lighter and better able to obtain gradient flow information, thereby improving accuracy. In addition, at the end of Backbone, the widely used SPPF module is still employed. The Neck part continues to adopt the FPN-PAN structure, which is used to fuse and utilize feature information at different scales. In the head part, the decoupled-head structure is adopted to separate the classification and detection tasks, the use of anchor frames is abandoned, and the Anchor-Free design is adopted instead to avoid the complexity of anchor frame hyperparameter adjustment and simplify the algorithm structure.
Based on the above architecture, the MACNet network in Figure 1 introduces the CARMF module, the DSConv module, and the DS_C2f module. First, by improving the traditional neck feature sampling technique, the CARMF module implements upsampling to capture richer and more complex feature details, thereby improving the detection performance of small-scale objects such as pests. Secondly, the convolution method of the network backbone and neck is improved, which improves the efficiency of convolution calculation and is conducive to the deployment and application of the algorithm. In addition, these adjustments are also extended to modules such as C2f in the neck to maintain the training speed.

3.1. A More Accurate Upsampling Operator

As a key operation in convolutional neural networks, feature upsampling plays a crucial role in object detection tasks. Upsampling in YOLOv8s uses the traditional linear interpolation method, which uses the nearest pixel grayscale value to fill in the missing pixels after the image is enlarged, and cannot make full use of other semantic information in the feature map, such as texture, shape, and color information. In addition, YOLOv8s removes the 1 × 1 convolution before upsampling, and the features of different stages of the backbone network are directly input into the upsampling process. Therefore, an effective feature upsampling operator is very important for the YOLOv8s network.
In order to solve the above problems, we proposed a CARMF module based on the Content-Aware ReAssembly of FEatures (CARAFE) feature upsampling idea, and successfully applied it to the YOLOv8s network to achieve good performance improvement. As shown in Figure 2, the CARMF upsampling process consists of two main parts: the kernel prediction module and the content-aware multi-feature reassembly module. The kernel prediction module is similar to CARAFE, predicting a reassembly kernel for each object position, while the content-aware multi-feature reassembly module uses these predicted kernels to reassemble the features after multiple sampling processes to obtain the final output. For an input feature map of H × W × C , the channel number is first compressed to C m using a 1 × 1 convolution, which does not affect the final effect and reduces the complexity of subsequent calculations. Assuming the upsampling kernel size is k u p × k u p , for the feature map of the input content encoding block, a convolutional layer of size k e n c o d e r × k e n c o d e r is used to encode the feature map, generating a feature map of H × W × α 2 × k u p 2 . Here, k u p is related to the receptive field; the larger the k u p , the higher the correlation of the generated weights with the surrounding content, but the computational complexity also increases with the square of the kernel size. The feature map is then resized to α H × α W × k u p 2 and normalized using the Softmax function for all channels of each pixel in the feature map to ensure that the sum of the recombined kernel weights is 1, and the predicted weights for each pixel’s channel k u p 2 are resized to k u p × k u p .
In the content-aware multi-feature reassembly module, we have incorporated the dynamic instance interaction concept from Sparse RCNN [40], which involves one-to-one interaction between candidate features and RoI (Region of Interest) features extracted from candidate boxes. This interaction enhances the obtained features, making them more suitable for object localization and classification. Similarly, for the adjacent k × k regions of different sampling points n, we have introduced an information interaction mechanism between different feature maps. Firstly, we generate feature maps k 1 × k 1 , k 2 × k 2 , and k 3 × k 3 for each sampling point using three different upsampling methods. The choice of upsampling methods may impact the details, spatial context information, and computational efficiency of the feature maps. Therefore, we sequentially employ nearest neighbor, bilinear, and trilinear interpolation sampling methods to achieve a more comprehensive and balanced processing effect. Additionally, drawing inspiration from the residual concept, we perform element-wise multiplication between feature map k 3 and the concatenated feature maps k 1 and k 2 to better preserve and propagate information within the feature maps. Finally, the reassembled feature maps obtained through the kernel prediction module are combined to output feature maps with dimensions of α H × α W × C . Compared to the traditional simple pixel-filling method, our approach fully leverages the interaction between the reassembly kernels generated by the kernel prediction module and the feature maps. The resulting pixels not only contain information from neighboring pixels but also encompass richer feature information. Furthermore, our method can aggregate context information within a wider receptive field range, perform upsampling based on feature content rather than positional distance, obtain richer semantic information, and introduce only a small amount of computational cost.

3.2. Faster Convolution

Convolutional neural networks often require a large amount of memory and computational resources, especially in the convolutional layers. For example, in ResNet50 [41], over 90% of the computational time and memory resources are consumed by the convolutional layers. Therefore, to improve the speed and efficiency of the network, it is necessary to optimize the computational burden of the convolutional layers. At the same time, in order to better adapt to the needs of real-time agricultural pest detection, we adjusted the memory efficiency and speed of the standard convolutional layer in the YOLOv8s network to achieve model lightweighting, and adjusted at different locations in the network according to the balance between accuracy and efficiency. As shown in Figure 1, we replaced the standard convolutional modules of the backbone network and the neck network with a more efficient distributed convolution, such as the DS_CBS module in Figure 3. To further optimize the convolutional computation, we also modified the convolution of the Bottleneck module in the C2f module of the neck network to the DS_Bottleneck module, and the modified DS_C2f module is shown in Figure 3.
The overall concept of DSConv is illustrated in Figure 4, which simulates the behavior of the convolutional layer by introducing quantization and distribution shifts and decomposing the traditional convolutional kernel. First, VQK only stores a non-trainable integer value tensor, serving as prior information and capturing the essence of the feature types to be extracted. At the same time, this quantization component enables lower memory usage and higher speed. Another component consists of two distributed shifting tensors used to move the variable kernel distribution, where KDS is used to shift the convolutional kernel distribution and CDS is used to shift the convolutional channel distribution. They position the weight of the quantized tensor within the range that mimics the original pre-trained network distribution, thereby simulating the distribution of the original convolutional kernel and maintaining the same output as the original convolution. In other words, DSConv achieves improved computational efficiency and memory usage by quantizing weights and optimizing the best distribution shift of integer weights. For example, in the third convolutional layer of the MACNet network, with 128 channels and a kernel size of 3, given the original single-precision tensor size of (128; 128; 3; 3), setting the bit parameter to 2 bits and the block size to 64, the size of the saved VQK is 2 × (128; 128; 3; 3), the size of the saved KDS is 2 × (128; 2; 3; 3), and the size of the saved CDS is 2 × (128). Through the aforementioned DSConv, the size of the convolutional kernel is reduced to 7% of the original convolution size, significantly reducing the computational complexity of the convolution.

4. Experiment

4.1. Experimental Setting and Evaluation Methods

All experiments were run on a DELL power edge 640 server equipped with one GeForce RTX 3090 GPU (24G video memory), with a software environment of Ubuntu 20.04, CUDA 11.3, Python 3.9, and PyTorch 1.10.2. The training was performed using the SGD optimizer, with the batch size set at 8, the initial learning rate at 0.01, the learning rate decay set to 0.0005, and the input image size was 640 × 640. A P (mean A P for IoU @ [0.5:0.95]), A P 50 ( A P for IoU 50%), A P 75 ( A P for IoU 75%), A P s ( A P for small objects), A P m ( A P for medium objects), A P l ( A P for large objects) under COCO evaluation index are used as evaluation indicators.

4.2. Pest24 Dataset

The Pest24 dataset [42] used in this study was proposed by the Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences. It is a standardized, large-scale, multi-object, high-resolution agricultural pest image dataset designed for precision agriculture applications. The dataset contains 25,378 images and 192,422 instance labels and covers 24 classes of pests designated for detection by the Chinese Ministry of Agriculture. The names of the various categories are shown in Figure 5, and these pests belong to different orders in the insect world, including Coleoptera, Hemiptera, Orthoptera, and Lepidoptera, with most of the pests belonging to Lepidoptera. The distribution of instances among different pest categories in Pest24 is significantly imbalanced, with certain pests, such as the Anomala corpulenta, Bollworm, Meadow borer, Athetis lepigone, and Holotrichia parallela, having relatively high instance numbers, all exceeding 10,000, while pests like the Holotrichia oblita, Eight-character tiger, and Nematode trench have fewer than 200 instances. This reflects the differences in the distribution of instances among different pest categories in Pest24, as well as the characteristics of real agricultural environments. Since Pest24 was collected from the field, it possesses a high degree of authenticity and reliability. The distribution of different pest categories under different vegetation, lighting, and climatic conditions varies, and this characteristic of the dataset helps to more accurately reflect the actual distribution of pests and improve the robustness of detection algorithms.
Figure 6 displays some of the images in Pest24, showing that some pests bear a high degree of visual similarity. Wang et al. used the aHash algorithm to estimate the similarity between any two types of pests in Pest24. The results indicate that 85.5% of the object similarities in Pest24 are greater than 0.5, with 23.5% exceeding 0.55, far surpassing Pascal VOC and MS COCO. Additionally, pests in real field environments are often small in size and not easily recognizable by the human eye. Therefore, we assessed the scale characteristics of the Pest24 dataset according to the scale standards defined by the COCO dataset [43]. The results, as shown in Figure 7, reveal that the number of small-scale and medium-scale objects in Pest24 is almost equal, while large-scale objects account for less than 1%. In comparison to Pest24, the MS COCO dataset has a more balanced distribution of objects across all scales. On the other hand, the Pascal VOC dataset has a higher proportion of large-scale objects, reaching 58%, far exceeding Pest24. This indicates that Pest24 leans more towards small-scale objects, making object detection on this dataset more challenging than on general datasets.

4.3. Experimental Results and Analysis

4.3.1. Ablation Experiments

To validate the effectiveness of the method proposed in this paper, we conducted a series of ablation experiments. We used YOLOv8 as the baseline and compared the results with those of YOLOv8s on the Pest24 dataset. Table 1 demonstrates the performance improvements of different methods based on YOLOv8s. When utilizing DSConv, the model experiences a slight reduction in the number of parameters, resulting in an overall 30% decrease in computational load compared to YOLOv8s. Additionally, this model achieves an FPS of 400, slightly higher than YOLOv8’s 300 FPS, thus offering slightly faster speed. While this method does not lead to an overall improvement in accuracy, it does improve the detection of small-scale and large-scale objects by 0.8 A P and 0.9 A P , respectively. This is particularly advantageous for datasets like Pest24, which are biased towards small-scale objects, indicating that this model is more suitable for deployment in the field. Additionally, the CARMF module achieved a 0.6 A P improvement over YOLOv8s, with respective improvements of 0.8 A P and 1.8 A P for small-scale and large-scale objects. This suggests that an effective upsampling method is highly applicable in the YOLOv8s model, especially when dealing with small-scale object datasets such as Pest24. In the end, our approach obtained an overall 0.5 A P boost on Pest24. Therefore, when detecting pests in the field under limited detection conditions and the need for rapid results, the method incorporating DSConv can be used. Conversely, for higher detection accuracy under good detection conditions, the CARMF method can be chosen. Of course, the combined method proposed in this paper, MACNet, can also be used to maintain a balance between speed and accuracy.
It is worth noting that when using the built-in validation code of YOLOv8s to calculate detection results, the baseline results on Pest24 differed by 2.7 A P compared to the COCO validation method. We believe that the difference lies in the method used to obtain PR curve sampling points, with YOLOv8s using linear interpolation and the other method using the nearest ground truth approach. Here, we chose the COCO validation method because we believe that the nearest ground truth approach is more reasonable, and the COCO validation method is more widely used.

4.3.2. CARMF Module Analysis

Table 2 displays the detection accuracy of various pest categories before and after the implementation of the CARMF module in the YOLOv8s network. This method has effectively improved the detection accuracy across almost all pest categories. We have defined categories with fewer than 2000 instances in the dataset as “categories with fewer instances”. Upon analysis, it was found that the detection accuracy significantly improved in these categories with fewer instances. Among the twelve categories with fewer instances in Pest24, the detection accuracy of nine categories showed substantial improvement. Figure 8 provides a detailed overview of the improvement in detection accuracy for these nine categories. While the overall model only showed a 0.6 A P improvement, the MACNet model’s average accuracy in these nine categories with fewer instances increased by 1.4 A P . This suggests that even with relatively few instances of pests, the inclusion of the CARMF method can still achieve superior performance and gather more information about the object features. In essence, the MACNet detector in this study contributes to a more comprehensive collection of pest information in the field, reducing omissions and providing greater support for agricultural pest control.
At the same time, there are a large number of images in Pest24 where object pests are gathered or stuck together, as illustrated in the three images in the first column of Figure 9. The gathering or sticking of objects not only makes the labeling work difficult but also significantly increases the challenge of pest detection. Building on this, we further validated the effectiveness of the CARMF upsampling by testing the network’s eighth layer C2f module using different sampling methods and generating corresponding heatmaps for comparison, as depicted in Figure 9. The YOLOv8s using traditional sampling methods exhibit a relatively dispersed distribution in the object sampling area, with numerous sampling points also spread across the background area. In contrast, the CARAFE shows slightly better results with a more concentrated sampling area. However, compared to CARAFE, the heatmaps generated by CARMF sampling more accurately capture the object area, better focusing attention on the object, and also more precisely representing the shape of the object, preserving semantic information to a greater extent during the sampling process. Therefore, when these feature maps are utilized for classification and regression in the final layers of the network, they demonstrate higher accuracy. The addition of the CARMF module can achieve better detection results, even in cases of object clustering or adhesion, as indicated by the performance of the heatmaps. This suggests that the improved upsampling method not only enriches the sampling information of the feature maps in space, but also enhances the recognition ability of the feature maps. This outcome holds significant importance for object detection tasks, particularly in enhancing detection performance in complex scenes and small object detection, where CARMF upsampling can be effectively utilized.

4.3.3. Comparison of the CARAFE Method with the CARMF Method

In addition, we examined the impact of different k e n c o d e r and k u p sizes on the CARMF module. From a theoretical standpoint, the convolutional layer size k e n c o d e r in the content encoder module is closely linked to the receptive field. Increasing the size of k e n c o d e r expands the encoder’s receptive field, allowing the model to utilize contextual information more extensively, which is crucial for predicting k u p . Moreover, achieving a larger k u p requires a corresponding increase in the size of k e n c o d e r , as the content encoder needs a broader receptive field for encoding and generation. However, as the kernel size increases, so does the computational complexity. Blindly increasing the size of k e n c o d e r is not a viable option when aiming for a larger receptive field. Therefore, we conducted a series of experiments based on YOLOv8s to compare the effects of different k e n c o d e r and k u p sizes on the experimental results, while also performing experimental comparisons using the same parameters for the CARAFE method. As shown in Table 3, the best results can be achieved using the same parameters and with only relatively small k e n c o d e r and k u p values. This indicates that our method can better restore fine details in images and has stronger feature extraction and expression capabilities.

4.3.4. Convolution Comparison Experiments

The introduction of DSConv has significantly alleviated the computational burden of convolutions. However, using it more frequently in the network does not necessarily lead to better results. Through a series of comparative experiments, we have identified the optimal convolution optimization strategy. As shown in the last column of Table 4, we initially performed convolution optimization on both the backbone network and the neck network, effectively reducing the network parameters and computational load. However, this optimization was accompanied by a significant decrease in detection accuracy, possibly due to the weakening of the backbone network’s ability to express complex features as a result of lightweight. Therefore, we further investigated the effects of optimizing the convolution separately in different modules of the backbone network and neck network, with a focus on the neck network. The final experimental results indicate that optimizing only the convolution module in the backbone network and optimizing the convolution module and C2f module in the neck network can achieve a good balance between model lightweighting and detection accuracy.

5. Conclusions

Agricultural pests have a significant negative impact on global crop yields, causing enormous economic losses worldwide each year. This study is based on YOLOv8s and proposes a faster and more accurate object detector, MACNet. By focusing on feature sampling and convolution calculation, the network is trained to extract richer semantic information during the training process, leading to more precise object detection in the final prediction layer. Additionally, the detector can better distinguish visually similar or less frequent instances, which is particularly beneficial for real-world agricultural pest detection in the field. Moreover, the reduced computational load increases the algorithm’s deployment convenience, contributing to improved efficiency in agricultural pest management. In the future, we will continue to research this model, striving to achieve higher detection accuracy and speed in complex field detection scenarios. We also hope that these methods can provide more powerful tools for effective agricultural pest management, mitigating the threat of pests to agricultural production and contributing to global agricultural health and food security.

Author Contributions

Conceptualization, Q.W.; methodology, Y.H., C.W. and Y.Q.; writing—original draft preparation, Y.H.; writing—review and editing, Q.W. and H.W.; data collection, Y.H., Y.Q. and Y.X.; analysis and interpretation of results, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 61773360, 61973295), the Anhui Provincial Quality Engineering Project (No. 2018jyxm1087) and the Academic funding project for top talents of disciplines in Colleges and universities of Anhui Province (No. gxbjZD2020096).

Data Availability Statement

The dataset Pest24 for this study can be consulted directly with the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mateos Fernández, R.; Petek, M.; Gerasymenko, I.; Juteršek, M.; Baebler, Š.; Kallam, K.; Moreno Giménez, E.; Gondolf, J.; Nordmann, A.; Gruden, K.J.P.B.J. Insect pest management in the age of synthetic biology. Plant Biotechnol. J. 2022, 20, 25–36. [Google Scholar] [CrossRef]
  2. Jiao, L.; Chen, M.; Wang, X.; Du, X.; Dong, D. Monitoring the number and size of pests based on modulated infrared beam sensing technology. Precis. Agric. 2018, 19, 1100–1112. [Google Scholar] [CrossRef]
  3. Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A New Pest Detection Method Based on Improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef] [PubMed]
  4. Deepan, P.; Akila, M. Detection and Classification of Plant Leaf Diseases by using Deep Learning Algorithm. Int. J. Eng. Res. Technol. 2018, 6, 1–5. [Google Scholar]
  5. Hasan, M.J.; Mahbub, S.; Alom, M.S.; Nasim, M.A. Rice disease identification and classification by integrating support vector machine with deep convolutional neural network. In Proceedings of the 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 3–5 May 2019; pp. 1–6. [Google Scholar]
  6. Yalcin, H.; Razavi, S. Plant classification using convolutional neural networks. In Proceedings of the 2016 Fifth International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Tianjin, China, 18–20 July 2016; pp. 1–5. [Google Scholar]
  7. Rong, M.; Wang, Z.; Ban, B.; Guo, X. Pest identification and counting of yellow plate in field based on improved mask r-cnn. Discret. Dyn. Nat. Soc. 2022, 2022, 1–9. [Google Scholar] [CrossRef]
  8. Wang, Z.; Qiao, L.; Wang, M. Agricultural pest detection algorithm based on improved faster RCNN. In Proceedings of the International Conference on Computer Vision and Pattern Analysis (ICCPA 2021), Guangzhou, China, 19–21 November 2021; pp. 104–109. [Google Scholar]
  9. Lu, T.; Ji, S.; Jin, W.; Yang, Q.; Luo, Q.; Ren, T.-L. Biocompatible and Long-Term Monitoring Strategies of Wearable, Ingestible and Implantable Biosensors: Reform the Next Generation Healthcare. Sensors 2023, 23, 2991. [Google Scholar] [CrossRef] [PubMed]
  10. Mirmozaffari, M.; Shadkam, E.; Khalili, S.M.; Yazdani, M. Developing a Novel Integrated Generalised Data Envelopment Analysis (DEA) to Evaluate Hospitals Providing Stroke Care Services. Bioengineering 2021, 8, 207. [Google Scholar] [CrossRef] [PubMed]
  11. Mirmozaffari, M.; Yazdani, M.; Boskabadi, A.; Ahady Dolatsara, H.; Kabirifar, K.; Amiri Golilarz, N. A Novel Machine Learning Approach Combined with Optimization Models for Eco-efficiency Evaluation. Appl. Sci. 2020, 10, 5210. [Google Scholar] [CrossRef]
  12. Bui, T.H.; Thangavel, B.; Sharipov, M.; Chen, K.; Shin, J.H. Smartphone-Based Portable Bio-Chemical Sensors: Exploring Recent Advancements. Chemosensors 2023, 11, 468. [Google Scholar] [CrossRef]
  13. Niranjan, D.; VinayKarthik, B. Deep learning based object detection model for autonomous driving research using carla simulator. In Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 7–9 October 2021; pp. 1251–1258. [Google Scholar]
  14. Han, R.; Liu, X.; Chen, T. Yolo-SG: Salience-Guided Detection of Small Objects in Medical Images. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 4218–4222. [Google Scholar]
  15. Huang, Q.; Yang, K.; Zhu, Y.; Chen, L.; Cao, L. Knowledge Distillation for Enhancing a Lightweight Magnet Tile Target Detection Model: Leveraging Spatial Attention and Multi-Scale Output Features. Electronics 2023, 12, 4589. [Google Scholar] [CrossRef]
  16. Benjumea, A.; Teeti, I.; Cuzzolin, F.; Bradley, A. YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles. arXiv 2021, arXiv:2112.11798. [Google Scholar]
  17. Pacal, I.; Karaman, A.; Karaboga, D.; Akay, B.; Basturk, A.; Nalbantoglu, U.; Coskun, S. An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets. Comput. Biol. Med. 2022, 141, 105031. [Google Scholar] [CrossRef] [PubMed]
  18. Li, Z.; Xu, B.; Wu, D.; Zhao, K.; Chen, S.; Lu, M.; Cong, J. A YOLO-GGCNN based grasping framework for mobile robots in unknown environments. Expert Syst. Appl. 2023, 225, 119993. [Google Scholar] [CrossRef]
  19. Cheng, Z.; Huang, R.; Qian, R.; Dong, W.; Zhu, J.; Liu, M. A lightweight crop pest detection method based on convolutional neural networks. Appl. Sci. 2022, 12, 7378. [Google Scholar] [CrossRef]
  20. Chu, J.; Li, Y.; Feng, H.; Weng, X.; Ruan, Y. Research on Multi-Scale Pest Detection and Identification Method in Granary Based on Improved YOLOv5. Agriculture 2023, 13, 364. [Google Scholar] [CrossRef]
  21. Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M.J.C.E.A. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
  22. Akhtar, S.; Hanif, M.; Malih, H. Automatic Urine Sediment Detection and Classification Based on YoloV8. In Proceedings of the International Conference on Computational Science and Its Applications, Athens, Greece, 3–6 July 2023; pp. 269–279. [Google Scholar]
  23. Wei, Z.; Chang, M.; Zhong, Y. Fruit Freshness Detection Based on YOLOv8 and SE attention Mechanism. Acad. J. Sci. Technol. 2023, 6, 195–197. [Google Scholar] [CrossRef]
  24. Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
  25. Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
  26. Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
  27. Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato Maturity Detection and Counting Model Based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef] [PubMed]
  28. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  29. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. pp. 483–499. [Google Scholar]
  30. Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
  31. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  32. Tian, Z.; He, T.; Shen, C.; Yan, Y. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 3126–3135. [Google Scholar]
  33. Hu, X.; Mu, H.; Zhang, X.; Wang, Z.; Tan, T.; Sun, J. Meta-SR: A magnification-arbitrary network for super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 1575–1584. [Google Scholar]
  34. Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
  35. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  36. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  37. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  38. Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11030–11039. [Google Scholar]
  39. Nascimento, M.G.d.; Fawcett, R.; Prisacariu, V.A. Dsconv: Efficient convolution operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5148–5157. [Google Scholar]
  40. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Wang, Q.-J.; Zhang, S.-Y.; Dong, S.-F.; Zhang, G.-C.; Yang, J.; Li, R.; Wang, H.-Q. Pest24: A large-scale very small object data set of agricultural pests for multi-target detection. Comput. Electron. Agric. 2020, 175, 105585. [Google Scholar] [CrossRef]
  43. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
Figure 1. Overall architecture of MACNet. Among them, the CARMF module, the DSConv module, and the DS_C2f module are proposed in this paper.
Figure 1. Overall architecture of MACNet. Among them, the CARMF module, the DSConv module, and the DS_C2f module are proposed in this paper.
Electronics 13 01068 g001
Figure 2. The overall framework of the CARMF upsampling operator. Where the α is the upsampling factor, and the C m size is set to 64.
Figure 2. The overall framework of the CARMF upsampling operator. Where the α is the upsampling factor, and the C m size is set to 64.
Electronics 13 01068 g002
Figure 3. DS_Bottleneck based DS_C2f module.
Figure 3. DS_Bottleneck based DS_C2f module.
Electronics 13 01068 g003
Figure 4. The general idea of DSConv. VQK stands for Variable Quantized Kernel, KDS stands for Kernel Distribution Shifter, CDS stands for Channel Distribution Shifter, and ⊙ stands for Hadamard Operator.
Figure 4. The general idea of DSConv. VQK stands for Variable Quantized Kernel, KDS stands for Kernel Distribution Shifter, CDS stands for Channel Distribution Shifter, and ⊙ stands for Hadamard Operator.
Electronics 13 01068 g004
Figure 5. Pest24 dataset classification. The ordinal number within each square corresponds to the category index of each pest in the data file.
Figure 5. Pest24 dataset classification. The ordinal number within each square corresponds to the category index of each pest in the data file.
Electronics 13 01068 g005
Figure 6. Pest24 image examples of various types of pests. The numbers in the upper left corner correspond to each pest category in Figure 5.
Figure 6. Pest24 image examples of various types of pests. The numbers in the upper left corner correspond to each pest category in Figure 5.
Electronics 13 01068 g006
Figure 7. Proportions of Pest24, Pascal VOC, and MS COCO objects at different scales.
Figure 7. Proportions of Pest24, Pascal VOC, and MS COCO objects at different scales.
Electronics 13 01068 g007
Figure 8. The detection accuracy of categories with a small number of instances has been improved. The blue color represents the YOLOv8s default method result, and the orange color represents the result using the CARMF method.
Figure 8. The detection accuracy of categories with a small number of instances has been improved. The blue color represents the YOLOv8s default method result, and the orange color represents the result using the CARMF method.
Electronics 13 01068 g008
Figure 9. Comparison of heatmaps for different sampling methods. The three heatmaps, from left to right, represent the linear interpolation originally used in YOLOv8, the CARAFE method on which this paper is based, and the CARMF method we proposed.
Figure 9. Comparison of heatmaps for different sampling methods. The three heatmaps, from left to right, represent the linear interpolation originally used in YOLOv8, the CARAFE method on which this paper is based, and the CARMF method we proposed.
Electronics 13 01068 g009
Table 1. Ablation experiments on the Pest24 validation dataset.
Table 1. Ablation experiments on the Pest24 validation dataset.
Method#ParamsGFLOPs A P A P 50 A P 75 A P s A P m A P l
YOLOv8s11.1 M28.742.670.946.828.547.830.7
+DSConv10.3 M20.042.670.647.429.347.931.6
+CARMF11.3 M29.043.271.248.129.348.332.5
MACNet10.5 M20.343.171.048.129.348.032.1
Table 2. Comparison of the detection accuracy of various pests before and after.
Table 2. Comparison of the detection accuracy of various pests before and after.
123456789101112
23.939.255.143.553.428.71.764.160.147.458.755.9
25.841.855.043.953.529.11.564.660.448.359.056.4
131415161718192021222324
39.753.141.251.545.464.047.33.236.419.633.754.8
38.655.541.551.946.663.648.73.636.822.534.654.0
Note: The first row is the result of the default method, and the second row of results is the result after using the CARMF method.
Table 3. Comparative tests of k e n c o d e r and k u p in different sampling methods.
Table 3. Comparative tests of k e n c o d e r and k u p in different sampling methods.
Method [ k e n c o d e r ,   k u p ]#ParamsGFLOPs A P A P 50 A P 75 A P s A P m A P l
CARAFE[3, 3]11.2 M28.942.870.647.628.547.832.1
[3, 5]11.3 M29.042.870.647.628.547.832.1
[3, 7]11.4 M29.242.770.647.028.147.534.1
CARMF[3, 3]11.2 M28.942.770.247.428.848.133.4
[3, 5]11.3 M29.043.271.248.129.348.332.5
[3, 7]11.4 M29.242.670.147.428.647.733.4
Note: The best experimental results of the CARMF method are shown in bold.
Table 4. Experiments with different convolutional optimization positions.
Table 4. Experiments with different convolutional optimization positions.
Method#ParamsGFLOPs A P A P 50 A P 75 A P s A P m A P l
B1 + B210.4 M19.941.769.346.027.346.930.0
N1 + N210.3 M24.042.369.846.829.047.130.7
B1 + N111.1 M23.842.470.047.028.047.731.6
B2 + N29.6 M15.241.869.346.227.547.029.1
B1 + N1 + N210.3 M20.042.670.647.429.347.931.6
B1 + B2 + N110.4 M19.041.669.245.727.046.932.5
B1 + B2 + N1 + N29.6 M15.241.568.746.227.846.733.0
Note: The best method is to indicate in bold. Where B represents the backbone network, N represents the neck network, 1 indicates the use of DSConv in the use of the convolution module, and 2 indicates the use of DSConv in the C2f module.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Y.; Wang, Q.; Wang, C.; Qian, Y.; Xue, Y.; Wang, H. MACNet: A More Accurate and Convenient Pest Detection Network. Electronics 2024, 13, 1068. https://doi.org/10.3390/electronics13061068

AMA Style

Hu Y, Wang Q, Wang C, Qian Y, Xue Y, Wang H. MACNet: A More Accurate and Convenient Pest Detection Network. Electronics. 2024; 13(6):1068. https://doi.org/10.3390/electronics13061068

Chicago/Turabian Style

Hu, Yating, Qijin Wang, Chao Wang, Yu Qian, Ying Xue, and Hongqiang Wang. 2024. "MACNet: A More Accurate and Convenient Pest Detection Network" Electronics 13, no. 6: 1068. https://doi.org/10.3390/electronics13061068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop