MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model

Zhang, Feng; Tian, Chuanzhao; Li, Xuewen; Yang, Na; Zhang, Yanting; Gao, Qikai

doi:10.3390/electronics14142912

Open AccessArticle

MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model

by

Feng Zhang

¹,

Chuanzhao Tian

^1,2,*,

Xuewen Li

¹,

Na Yang

¹,

Yanting Zhang

¹ and

Qikai Gao

¹

North China Institute of Aerospace Engineering, College of Remote Sensing and Information Engineering, Langfang 065000, China

²

Collaborative Innovation Center of Aerospace Remote Sensing Information Processing and Application of Hebei Province, Langfang 065000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2912; https://doi.org/10.3390/electronics14142912

Submission received: 21 May 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 21 July 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The impact of insect pests on the yield and quality of rice is extremely significant, and accurate detection of insect pests is of crucial significance to safeguard rice production. However, traditional manual inspection methods are inefficient and subjective, while existing machine learning-based approaches still suffer from limited generalization and suboptimal accuracy. To address these challenges, this study proposes an improved rice pest detection model, MTD-YOLO, based on the YOLOv8 framework. First, the original backbone is replaced with MobileNetV3, which leverages optimized depthwise separable convolutions and the Hard-Swish activation function through neural architecture search, effectively reducing parameters while maintaining multiscale feature extraction capabilities. Second, a Cross Stage Partial module with Triplet Attention (C2f-T) module incorporating Triplet Attention is introduced to enhance the model’s focus on infested regions via a channel-patial dual-attention mechanism. In addition, a Dynamic Head (DyHead) is introduced to adaptively focus on pest morphological features using the scale–space–task triple-attention mechanism. The experiments were conducted using two datasets, Rice Pest1 and Rice Pest2. On Rice Pest1, the model achieved a precision of 92.5%, recall of 90.1%, mAP@0.5 of 90.0%, and mAP@[0.5:0.95] of 67.8%. On Rice Pest2, these metrics improved to 95.6%, 92.8%, 96.6%, and 82.5%, respectively. The experimental results demonstrate the high accuracy and efficiency of the model in the rice pest detection task, providing strong support for practical applications.

Keywords:

YOLOv8; rice pest; deep learning; object detection

1. Introduction

As one of the three major staple food crops worldwide, rice serves as a fundamental pillar of global food security [1]. During its growth cycle, the infestation of pests such as rice flies seriously threatens rice production, which not only causes significant yield losses [2,3], but also leads to quality deterioration problems such as increased chalkiness and decreased protein content of rice, which directly affects the national strategic food security and farmers’ economic returns [4,5]. According to the monitoring data of the Agricultural Technology Extension Service Center in 2023, the cumulative area of rice pests in China reached 5.33 × 108 hm², resulting in economic losses accounting for more than 55% of the total crop pest losses [6]. The State of Food and Agriculture 2024 report of the Food and Agriculture Organization of the United Nations emphasizes that rice cultivation is an important component of the agrifood system [7]. Therefore, improving the efficiency and accuracy of crop pest and disease identification is of great significance in reducing agricultural production losses and promoting high-quality agricultural development [8].

The current rice pest monitoring technology system has significant dual dilemmas: traditional manual surveys are limited by low sampling density, high subjective bias rate, and response delay [9], making it difficult to meet the demand for accurate plant protection [4]. In the past two decades, many scholars have conducted research on automatic image-based identification of crop pests and diseases [10,11,12]. Traditional machine learning methods classify crop pests and diseases by manually designing feature extraction and classification strategies. Although these methods have achieved certain success in specific scenarios, such as in Thenmozhi et al. employed digital image processing techniques for preprocessing, segmentation, and geometric shape extraction to identify insect species in sugarcane crops, achieving high accuracy across nine shape categories [13]. Wang et al. designed an automatic insect identification system based on support vector machines (SVM), attaining a recognition accuracy of 93% [14]. Larios et al. proposed a classification method combining Haar random forest (RF) features, which demonstrated improved recognition performance for aquatic stonefly species [15]. Li et al. utilized spectral regression linear discriminant analysis (SR-LDA) for dimensionality reduction, followed by K-nearest neighbor (KNN) classification, achieving 90% accuracy in recognizing unclassified insect images [16]. However, these approaches rely heavily on expert-driven feature engineering, resulting in poor generalization performance in complex field environments [17].

With the continuous improvement of computing power and the continuous development of deep learning technology, the research on intelligent detection of agricultural pests based on deep learning has shown explosive growth [18,19]. Current object detection algorithms are mainly categorized into two types of architectures: Two-stage models, such as the R-CNN series [20,21], generate candidate frames through a region suggestion network, which has high detection accuracy but suffers from defects such as slow inference speed and large memory occupation, making it difficult to be deployed in resource-constrained field equipment [22,23]. Single-stage models, such as YOLO [24] and SSD [25], use an end-to-end detection strategy, in which the YOLO series of algorithms achieves a balance between accuracy and speed through a grid-based prediction mechanism, and has become the preferred solution for agricultural scenarios [26]. For example, the PestLite crop pest detection model proposed by Dong et al. compresses the number of YOLOv5 parameters to 1.2 M through multilevel spatial pyramid pooling, reducing the computational cost by 32% while maintaining 85.7% mAP [27]. The rice pest detection model proposed by Zhou et al. uses GhostNet to reconstruct the YOLOv4 backbone network, which reduces the model size by 41%, and the inference speed is increased to 67 FPS [28]. Liao Juan introduces a lightweight GsConv module with dilated convolution in YOLOv7 to enhance the feature extraction capability of small target spots and reduce the leakage rate to 6.3% [29]. Li et al. efficiently suppressed the complex background interference by integrating the channel–space dual-attention mechanism and the EfficientIoU loss function to make the YOLOv5 improve the recognition accuracy by 11.2 percentage points in the pest occlusion scene [30]. Di et al. proposed a lightweight attention-based network called TP-YOLO. It introduces a context converter and a full-dimensional dynamic convolution module for enhanced feature extraction [31]. Sun et al. implement three core improvements based on the YOLOv8l architecture: adopting an asymptotic feature pyramid network to optimize multi-scale feature fusion, reconfiguring the C2f module to achieve 55.26% parametric compression, and integrating the attention mechanism to enhance feature discrimination. Experiments show that this scheme improves mAP by 1% while maintaining detection efficiency [32]. Hu et al. introduce a global contextual attention module to enhance feature characterization, and optimize cross-layer feature fusion by combining a bidirectional feature pyramid network, which significantly improves mAP by 5.4% compared to the YOLOv5 baseline [33].

Deep learning offers an effective solution for intelligent pest detection, significantly enhancing both accuracy and processing speed in modern agriculture. Aiming at the problems of time-consuming and laborious traditional manual detection methods and low accuracy of existing machine learning models in complex farmland scenarios, this study proposes a high-precision rice pest detection model MTD-YOLO based on the YOLOv8 framework, with the following core improvements and contributions:

The original YOLOv8 backbone network (5.08 M parameters) is replaced by the lightweight MobileNetV3 (2.97 M parameters), which achieves about 2.11 M parameter reduction (41.5% reduction) through deeply separable convolution, and the model size is significantly reduced from 21.5 MB to 11.1 MB (48.4% reduction). The strong representation of fine-grained features of the pest was enhanced while compressing the structure.
Fusing the C2f module with the Triplet Attention module to construct the C2f-T structure, which effectively solves the problem of confusing leaf texture and pest region features by capturing spatial location relationship, channel dependency, and cross-scale contextual information in parallel.
Dynamic Head is introduced to replace the original detection head, utilizing its Scale-aware, Spatial-aware and Task-aware triple-attention mechanism to dynamically enhance the semantic clarity and spatial focusing ability of the pest target.
A diversified data augmentation strategy was adopted, specifically including geometric transformations (horizontal/vertical flipping, random rotation); lighting adjustments (dynamic adjustment of brightness and exposure); noise interference (Gaussian noise); and weather simulation (raindrop degradation effect). This approach systematically covers the main types of interference that may be encountered during field testing. The experimental dataset covers 12 types of typical agricultural pests and is characterized by significant biodiversity and scene complexity.
The synergistic effect of module combinations was demonstrated by ablation experiments. When using MobileNetV3 alone, the mAP@0.5 improved from 85.8% to 88.3%, indicating its excellent performance in lightweight feature extraction. Using C2f-T alone increased recall by 2.5% but reduced mAP@0.5 by 0.8%, indicating that without feature compression support, background noise is easily amplified. Combining MobileNetV3 with C2f-T improved mAP@0.5 to 89.9%, demonstrating the structural synergistic advantages of the two. The optimal performance is achieved when all three modules are used together, further validating the complementary and synergistic nature of the overall architecture.
The model performance was quantitatively and qualitatively analyzed through comparative experiments and visualization of test results.

The structure of the paper is organized as follows: Section 2 describes the rice pest dataset and the data enhancement strategy; Section 3 analyzes the original architecture of YOLOv8 and details the three improvements of MTD-YOLO: MobileNetV3 backbone, C2f-T cross-dimensional feature fusion, and Dynamic Head; Section 4 presents a detailed evaluation of the model through comparative experiments, ablation studies, and visualization of test results; Section 5 summarizes the research results and looks into the future direction.

2. Datasets

In order to improve the generalization performance and recognition accuracy of the pest detection model, this study adopts a two-stage dataset validation strategy and conducts experiments based on two rice pest image datasets, Rice Pest1 and Rice Pest2, obtained from the Roboflow platform (available online: https://roboflow.com, accessed on 9 February 2025). The distribution of the number of pest objects for each category in the different sets is shown in Figure 1. Both datasets maintain balanced category distributions internally. These images are organized into training, validation, and test sets in the form of 7:2:1, and each image has a resolution of 640 × 640 pixels and contains one to four target objects. Among them, Rice Pest1, as the core training set, focuses on two types of pests, stem borer and brown planthopper, which are prominent in rice production areas, and contains a total of 2639 high-quality labeled images, and some of the training samples are shown in Figure 2. Rice Pest2 contains 5564 multi-category samples covering 10 species of rice pests, specifically including lepidopteran pests (stem borer, stickleback, rice leaf borer), hemipteran pests (brown planthopper, rice leafhopper, white-backed fly), coleopteran pests (bean scabbard fly, rice water weevil), arachnid pests (red spider mite), and nematode species (wheat root-knot nematode), and its diversity features can effectively validate the model’s adaptability in cross-species recognition tasks, and some of the training samples are shown in Figure 3.

Aiming at the characteristics of variable insect postures and complex lighting conditions in agricultural scenes, the dataset adopts multiple data augmentation techniques: horizontal and vertical flips are imposed with 50% probability; random rotational transformations are performed in the ±15° interval; the brightness adjustment amplitude is set to be ±66%, and the exposure adjustment range to be ±25%. Since Rice Pest2 covers more variety and is suitable for evaluating model robustness in complex environments, a richer set of data enhancement strategies is introduced. Rice Pest1 serves as the benchmark test set, and the base enhancements are retained in order to ensure the consistency of the variable control and to facilitate the comparison of the effects of the model improvements. The new data enhancement methods added to the Rice Pest2 dataset include adding 4.22% noise for each image, adding random Gaussian blur at 25 px, and artificially generating raindrops to mimic the effect of bad weather, interfering with the image to cope with detection under adverse conditions. Raindrop generation parameters are designed in accordance with the China Meteorological Administration Precipitation Intensity Rating Standard (GB/T 28592-2012) [34]. Specifically, the intensity factor is uniformly sampled from the interval [0.3,0.8], corresponding to moderate to torrential rainfall levels. The density of raindrops ranges from 900 to 1900 drops/m² (with 1000 ± 100/m² as a reference for moderate rain), and raindrop lengths are randomly generated between 19 and 34 pixels, which are optically equivalent to a physical diameter of 1.9 to 3.4 mm. This parameterized perturbation strategy is intended to enhance the model’s robustness to environmental variability and improve generalization performance in complex real-world scenarios.

3. Method

3.1. MTD-YOLO

YOLOv8 [35] is an advanced deep learning framework for object detection, characterized by fast inference speed, high detection accuracy, and a streamlined parameter-efficient architecture optimized for practical deployment [36]. While inheriting the core design principles of YOLOv5, YOLOv8 introduces substantial upgrades. Its architecture is composed of three main components: the backbone, the neck, and the head. The backbone serves as the primary feature extractor, composed of convolutional blocks and the newly introduced Cross Stage Partial with 2 Fusion (C2f) module. Additionally, the Spatial Pyramid Pooling-Fast (SPPF) module is placed at the end of the backbone, enhancing receptive field expansion. YOLOv8 also employs an improved CSPDarknet structure to further boost feature fusion and overall detection efficiency. The neck acts as a bridge between the backbone and the detection head. It fuses Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) structures to achieve efficient multilevel feature aggregation, significantly improving the model’s ability to detect objects at various scales. The detection head consists of three decoupled modules responsible for object classification and bounding box regression, enabling more precise multitask learning [37,38]. An anchor-free architecture is adopted, eliminating the need for predefined anchor boxes and directly predicting object center coordinates, which significantly improves detection speed. In order to adapt to various hardware and application scenarios, YOLOv8 provides five incremental expansion versions: v8n, v8s, v8m, v8l, and v8x [39]. The core differences between the different versions lie in the number of feature extraction modules and the configuration of convolution kernels. Larger models enhance detection performance by improving feature representation capabilities, but this comes with a significant increase in computational complexity. In this study, YOLOv8s was selected as the base model due to its optimal balance between detection accuracy and inference speed. This lightweight variant demonstrates strong performance across datasets and, owing to its low parameter count and computational complexity, is well-suited for deployment on drone platforms—aligning closely with the goals of this study.

In this study, an improved model based on YOLOv8, named MTD-YOLO, is proposed, with the following key modifications to the original YOLOv8 architecture: replacing the original backbone network with MobileNetV3 constructed by depth-separable convolution, to improve the feature extraction capability. Triplet Attention is integrated with the original C2f module to construct a new composite block, C2f-T, enhancing attention to spatial and channel-wise target features. Dynamic Head replaces the original detection head, enabling more focused and semantically clear feature representations. The overall structure is shown in Figure 4.

3.2. MobileNetV3 Backbone Network

The original backbone network is replaced with MobileNetV3 [40]. As shown in Figure 5, MobileNetV3 is a lightweight network that inherits depthwise separable convolutions and inverted residual modules from MobileNetV2. These components help maintain the model’s lightweight properties while improving the efficiency of the bottleneck structure [41]. The reduction in parameter count helps mitigate overfitting, thereby enhancing the model’s generalization performance. MobileNetV3 leverages hardware-accelerated Neural Architecture Search (NAS) technology, combined with the NetAdapt algorithm for optimization. It incorporates linear projections within residual connections and integrates the Squeeze-and-Excite module to enhance feature representation. The input and output layers of the network are redesigned by removing the bottleneck connections. This modification effectively reduces inference time with minimal performance degradation. Additionally, it employs the h-swish function, a fast and efficient activation mechanism that introduces nonlinearity while reducing computational overhead compared to traditional ReLU. The nonlinearity is defined as follows:

s w i s h x = x \cdot σ (x)

(1)

The network architecture of MobileNetV3 is divided into three parts [42]: The first part contains 1 convolutional layer, which extracts features by 3 × 3 convolution, and a batch normalization layer. The second stage consists of multiple depthwise separable convolutional layers, which constitute the core component of MobileNetV3 and are responsible for the majority of feature extraction. Each depthwise separable convolutional layer consists of a depthwise convolution using a 3 × 3 kernel to capture spatial features, followed by a pointwise (1 × 1) convolution to expand the number of channels. This design significantly reduces computational cost and parameter count while maintaining strong feature extraction capability. To further enhance representation, each pointwise convolution layer is followed by a Squeeze-and-Excite module. The third stage comprises a global average pooling layer and a fully connected layer. The pooling layer compresses the spatial dimensions and extracts high-level semantic features, while the fully connected layer maps them to the final categories and produces class probabilities via the Softmax function. Detailed parameter configurations are provided in Table 1 and Table 2.

3.3. C2f-T

The C2f module is a key component of the YOLOv8 architecture, responsible for early-stage feature aggregation and transmission. It is built upon the Cross Stage Partial (CSP) architecture and comprises two convolutional layers (cv1 and cv2) along with multiple bottleneck blocks. These components perform splicing and convolution operations to aggregate features across different layers. Parallel gradient flow is enabled through extensive use of residual connections, which facilitate efficient backpropagation across branches. Although the C2f module employs a multi-branch residual design, the differences between these branches are mainly in the number of convolutional layers and lack of attention to the interaction characteristics of the spatial and channel dimensions of the data. Additionally, agricultural scenes present challenges such as complex backgrounds and targets that resemble the background, necessitating the introduction of attention modules to focus on regions of interest and reduce interference from complex backgrounds. Currently, commonly used attention modules include self-attention, multi-head attention, efficient attention, dynamic attention, and deformable attention mechanisms [43,44,45,46]. Among them, self-attention, multi-head attention, dynamic attention, and deformable attention impose high computational costs, while efficient attention introduces a substantial parameter overhead. Therefore, we selected the lightweight and efficient attention module Triplet Attention [47]. Triplet Attention, with its three-branch design [48], effectively meets this requirement while keeping computational complexity under control.

As shown in Figure 6 and Figure 7, Triplet Attention takes an input tensor of shape C × H × W, which is simultaneously processed through three parallel branches. In the first branch, the tensor is simplified to a feature map of dimension 2 × H × W through a pooling layer, then passes through a convolutional layer and a Sigmoid activation function to generate an attention weight map of dimension 1 × H × W. To handle interactions between the channel and spatial dimensions, the other two branches perform rotations on different spatial dimensions, transforming the input feature map into formats of H × C × W and W × H × C. This rotation effectively coordinates the interdependencies between different dimensions. Unlike CBAM and SENet, TA does not require learning a large number of adjustable parameters to establish dependencies between different dimensions. The Triplet Attention mechanism achieves dimension-dependent relationship transformations solely through tensor rotation, adding almost no additional parameters while significantly improving network performance. Next, the Z-Pool operation is applied in the spatial dimension to aggregate max and average pooled features. Z-Pool is essentially a combination of max pooling and average pooling. Its expression is

Z - P ool (x) = [M a x P o o l (x), A v g P o o l (x)]

(2)

This pooling layer combines the core advantages of max pooling and average pooling: it captures key information representation through feature extrema while effectively suppressing noise interference using mean smoothing, thereby significantly improving adaptability and robustness to heterogeneous features. Moreover, the two pooling operations can be executed in parallel, maintaining constant-time computational efficiency. To further aggregate spatial cues while preserving inter-channel interactions, the feature maps are reshaped to sizes of 2 × C × W and 2 × H × C, respectively. These feature maps are processed through a 7 × 7 convolutional kernel and a batch normalization layer, yielding intermediate outputs of 1 × H × W and 1 × H × C, with the weight contributions of each branch calculated via residual connections. The feature maps, after batch normalization, are rotated back to the C × W × H format to match the input image dimensions. Finally, the results from the three branches are aggregated via average pooling to generate the final Triplet Attention output, where p denotes the dimension permutation operation.

Y_{T A} = \frac{1}{3} (Y_{b r a n c h 1} + P_{H W} (Y_{b r a n c h 2}) + P_{W C} (Y_{b r a n c h 3}))

(3)

The structure of the proposed C2f-T module is illustrated in Figure 8. Specifically, the Triplet Attention is positioned at the end of the bottleneck substructure, transforming the original single-path gradient flow—typically composed of two convolutional layers—into a multi-branch structure. This enhancement enriches the spatial and channel-wise feature representation while improving the diversity and flow of gradients.

3.4. Dynamic Head

The original detection head adopts a static multiscale fusion mechanism, which lacks adaptability in complex field environments. It struggles to handle pose variations of insects and exhibits limited flexibility in modeling inter-channel feature dependencies. To address these limitations, this study replaces the original detection head with Dynamic Head [49], which enhances feature representation through dynamic weight assignment and adaptive feature calibration. This module applies attention mechanisms across three dimensions of the feature tensor: scale, spatial, and channel. Specifically, scale-aware attention adjusts features across different resolutions, spatial-aware attention enhances the localization of objects within the spatial domain, and task-aware attention calibrates channel-wise features for different detection tasks such as classification and regression. This strategic design effectively enhances the object detection performance and significantly improves the representation of the object detection head without adding any computational overhead. This unified framework allows the model to jointly consider scale, spatial, and task-level factors during detection. As a result, the model exhibits enhanced capacity to capture the distinctive features of rice pests and achieves better accuracy under complex environmental conditions, as illustrated in Figure 9.

Before using the DyHead detection head, the feature pyramid is rescaled and represented as a four-dimensional tensor, which is further defined as S = H × W. The tensor is reshaped into a three-dimensional tensor, defined as follows:

F \in R^{L \times H \times W \times C}

(4)

F \in R^{L \times S \times C}

(5)

L

is the number of layers in the pyramid,

H

is the height of the median layer feature,

W

is the width of the median layer feature, and

C

is the number of channels in the median layer feature.

Given the feature tensor, the general formula for self-attention is defined as:

W (F) = π (F) \cdot F

(6)

The attention function is converted into three successive attentions: scale-aware attention, spatial-aware attention, and task-aware attention, each of which focuses on only one perspective, defined as follows:

W (F) = π_{C} (π_{S} (π_{L} (F) \cdot F) \cdot F) \cdot F

(7)

π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{s, c} F)) \cdot F

(8)

π_{S} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l, k} \cdot F (l; p_{k} + {∆ p}_{k}; c) \cdot {∆ m}_{k}

(9)

π_{C} (F) \cdot F = m a x (α^{1} (F) \cdot F_{c} + β^{1} (F), α^{2} (F) \cdot F_{c} + β^{2} (F))

(10)

π

(

\cdot

) is the attention function,

π_{L}

(

\cdot

),

π_{S}

(

\cdot

), and

π_{C}

(

\cdot

) are three different attention functions applied to dimensions L, S, and C, respectively.

f

(

\cdot

) is a linear function approximated by a 1 × 1 convolutional layer.

σ

(

\cdot

) is a Hard-sigmoid function. K is the number of sparsely sampled locations.

p_{k}

+

{∆ p}_{k}

is to focus on the shift position of the discriminant region by means of a self-learned spatial offset

{∆ p}_{k}

.

{∆ m}_{k}

is the self-learned importance scalar at position

p_{k}

, both learned from input features at the median level of F.

F_{c}

is the feature slice at the cth channel.

α^{1} {, α}^{2}, β^{1} {, a n d β}^{2}

are learnable parameters to control the activation threshold.

4. Experimental Design and Analysis of Results

4.1. Experimental Environment and Parameters

Hardware Configuration: The processor is an Intel(R) Core(TM) i7-14650HX (Intel Corporation, Santa Clara, CA, USA) with 2.20 GHz, RAM is 32 GB, and the graphics card is NVIDIA RTX4060 (NVIDIA Corporation, Santa Clara, CA, USA) with 8 GB of video memory. Software Configuration: The operating system is win11, the deep learning framework is PyTorch-3.10.1, the conda version is 12.4, and the programming language is Python-3.8.10.

Parameter settings: The image size is 640, epoch is 100, batch is 8, and the learning rate is 0.01.

4.2. Evaluation Metrics

The evaluation is based on four widely used metrics: Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP). Precision measures the ratio of true positive predictions to the total number of predicted positives, while Recall represents the ratio of true positives to the total number of actual positive samples. Average Precision is defined as the area under the Precision–Recall curve for a single class, reflecting the model’s performance in that specific category. Meanwhile, mean Average Precision is the mean value of AP across all categories, providing an overall performance metric for multi-class detection tasks. The mAP@0.5 metric refers to the mean AP computed using an Intersection over Union (IoU) threshold of 0.5. This metric is widely used in object detection tasks and serves as a reliable indicator of a model’s ability to accurately locate targets. In addition to mAP@0.5, this study also reports mAP@[0.5:0.95], which represents the average mAP calculated at multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This stricter and more comprehensive evaluation metric allows for a more rigorous assessment of the model’s performance across varying levels of localization precision. The definitions of the formulas used are as follows:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}, A P = \int_{0}^{1} P (r) d r, m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(11)

T P

denotes True Positives, the number of samples correctly predicted as positive samples by the model,

F P

denotes False Positives, the number of negative samples incorrectly predicted as positive samples by the model,

F N

denotes False Negatives, the number of positive samples incorrectly predicted as negative samples by the model, and

N

denotes the number of categories.

4.3. Experimental Demonstration

All model training and evaluation were conducted on an NVIDIA RTX4060 GPU under consistent hyperparameter settings. As shown in Table 3, on the Rice Pest1 dataset, MTD-YOLO achieves significant improvements in all performance metrics compared to the baseline model YOLOv8. Precision was improved from 90.9% to 92.5%, Recall from 87.2% to 90.1%, mAP@0.5 from 85.8% to 90.0%, and mAP@[0.5:0.95] from 66.5% to 67.8%, which are 1.6%, 2.9%, 4.2%, and 1.3%, respectively. In addition, Table 4 shows the comparison of the average precision metrics for the two pest categories, where the AP of Penggerek Batang padi kuning improved from 95.2% to 96.8% and that of Wereng Coklat improved from 76.4% to 83.2%. This indicates that the model’s ability to detect the more difficult-to-identify categories has been significantly enhanced.

On the Rice Pest2 dataset, MTD-YOLO also shows strong performance advantages. As shown in Table 5, the precision of the model improves from 93.1% to 95.6%, the recall improves from 90.6% to 92.8%, mAP@0.5 improves from 94.2% to 96.6%, and mAP@[0.5:0.95] improves from 80.7% to 82.5%, which are 2.5%, 2.2%, 2.4%, and 1.8%, respectively. Moreover, Table 6 presents the average precision results for ten individual pest categories, all of which exhibited consistent performance improvements. Among them, the original high APs of red spider and rice gall midge were further improved to 99.5% after the improvement, while the relatively weak original performances of yellow rice borer and rice leaf roller were also steadily improved after the improvement. This fully verifies the robustness and generalization ability of this method under multi-category and complex background conditions.

To assess the stability of the model’s performance, we conducted five independent training runs on the Rice Pest1 dataset, as shown in Figure 10. The mean, standard deviation, and 95% confidence interval of the key metrics were calculated, as shown in Table 7. Specifically, the model achieved a precision of 92.46% ± 0.34 (95% CI: [92.04%, 92.88%]) and a recall of 90.12% ± 0.29 (95% CI: [89.76%, 90.48%]). The mAP@0.5 was 90.02% ± 0.36 (95% CI: [89.58%, 90.46%]), while mAP@[0.5:0.95] reached 67.80% ± 0.32 (95% CI: [67.42%, 68.18%]). These results demonstrate that the proposed MTD-YOLO model exhibits strong consistency and robustness across repeated experiments.

4.4. Ablation Experiments

To assess the individual contributions of the MobileNetV3, DyHead, and C2f-T modules within the YOLOv8 framework, ablation experiments were conducted on the Rice Pest1 dataset, as shown in Table 8. Replacing the original backbone with MobileNetV3 alone significantly reduces the parameter count, while achieving an mAP@0.5 of 88.3%. This demonstrates the effectiveness of its inverted residual structure in compressing features without sacrificing accuracy. Introducing the DyHead module increases the recall by 2.0%, although it leads to a considerable increase in parameter count. This suggests that the enhanced multiscale detection capability comes at the cost of greater computational complexity. Isolated application of the C2f-T module results in an unexpected 0.8% drop in mAP@0.5, indicating that its cross-stage fusion design is most effective when supported by the multi-level features extracted by MobileNetV3. When all three modules are integrated, the model achieves peak performance with 92.5% precision, 90.1% recall, and 90.0% mAP@0.5. The experiments show that the model performance improvement comes from the complementary design of the three components, MobileNetV3 achieves efficient feature compression, DyHead enhances the multi-scale target response, and C2f-T optimizes the cross-level semantic fusion.

To visually validate the effectiveness of the key modules introduced in this study on improving detection performance, Figure 11 shows the changes in detection confidence for a specific target (Penggerek Batang padi kunin) as each core improvement module is progressively added to the model. (a) Base Model: The confidence score is 89%, indicating that the unmodified YOLOv8 model exhibits a baseline level of detection capability. (b) +MobileNetV3: Confidence increases significantly to 92%, demonstrating that the lightweight backbone improves feature extraction efficiency for the target object. (c) +C2f-T: The addition of the C2f-T module further increases confidence to 93%. The integrated Triplet Attention mechanism rotates tensor dimensions and applies Z-Pool to model spatial-channel dependencies. This enhances the model’s focus on discriminative features while suppressing background noise, thereby improving confidence. (d) +Dynamic Head: With the addition of the Dynamic Head module, the confidence score further increases to 95%. This module facilitates adaptive multiscale feature learning via scale-aware, spatial-aware, and task-aware attention, resulting in more confident and precise detection.

This visualization result strongly validates the lightweight feature extraction capability of the MobileNetV3 backbone network, the effectiveness of Triplet Attention in enhancing feature discriminability, and the role of Dynamic Head in optimizing prediction accuracy. Working in tandem, these three components collectively form an efficient and precise object detection framework, significantly enhancing the model’s detection confidence and reliability in complex backgrounds.

4.5. Model Comparison Experiments

In order to demonstrate the effectiveness of the improved algorithms in this experiment, comparisons are made with mainstream algorithms in object detection, including Faster R-CNN, SSD, YOLOv7 [50], YOLOv5 [51], YOLOv3 [52], YOLOv6 [53], YOLOv9, YOLOv10, and YOLOv11 while the experimental environment remains unchanged. As shown in Table 9, MTD-YOLO leads all comparison models with 92.1% precision, 90.1% recall—significantly better than YOLOv5 to YOLOv11—and mAP@0.5 is improved by 27.6 percentage points compared to the two-stage model Faster R-CNN, and 1.7 percentage points compared to the next best model YOLOv3-tiny, proving that the MTD-YOLO model has obvious advantages in the rice pest task.

4.6. Pest Detection

In this study, the superiority of the improved MTD-YOLO model compared to the benchmark YOLOv8 model in the rice pest detection task is verified by four sets of comparative experimental images. As shown in Figure 12, the MTD-YOLO model demonstrates a 1–5% confidence improvement in complex scenarios where there are multiple angle variations of the target and differences in the number of groups. Furthermore, in densely grouped target scenarios, the improved model exhibits stronger discriminative capability. These results highlight the potential of MTD-YOLO for practical deployment in real-world agricultural environments.

To further explore differences in attention regions between models during pest identification, this study employed Grad-CAM to visualize and compare the focus areas of the original YOLOv8 and the proposed MTD-YOLO model. The visualization settings were configured as follows: the detection layer was set to [10] and the confidence threshold was set to 0.2. Regions highlighted in red indicate higher levels of model attention or activation.

As shown in Figure 13, the MTD-YOLO model produced more localized and focused activation regions compared to the baseline YOLOv8, accurately attending to pest targets while minimizing activation over background noise. This indicates improved attention precision and enhanced model robustness in complex visual scenes.

5. Conclusions

This study addresses the inefficiency of traditional manual inspection and the low accuracy of existing detection methods in complex farmland environments by proposing a novel rice pest detection model, MTD-YOLO. The model incorporates MobileNetV3 as the backbone network, integrates the Triplet Attention mechanism into the C2f module, and replaces the original detection head with Dynamic Head. These improvements collectively construct an efficient and accurate detection framework. Two high-quality datasets covering 12 major rice pests were used, and multiple enhancement strategies (including Gaussian blurring, noise simulation, raindrop simulation, and light adjustment) were employed to improve adaptability to complex farmland scenes. Experimental results demonstrate that MTD-YOLO significantly improves detection accuracy under complex agricultural conditions, effectively overcoming the limitations of existing approaches. Experiments on the Rice Pest1 and Rice Pest2 datasets further validate the model’s effectiveness, with mAP@0.5 improvements of 4.2% and 2.4%, respectively, over the baseline model.

Future work will focus on enhancing hardware-aware algorithm co-design, expanding the range of pest categories to improve model generalization, and exploring multimodal data fusion to strengthen feature representation. These efforts aim to develop a more robust and deployable pest detection system, thereby supporting intelligent agricultural management.

Author Contributions

Conceptualization, F.Z. designed the whole study. The first draft of the manuscript was written by F.Z. The reviews of the manuscript were performed by C.T. and X.L. The supervision of the project was performed by N.Y., Y.Z. and Q.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science Research Project of Hebei Education Department, grant number BJK2024115; and by the project Research on the Extraction of Planting Area of Minor Crops based on High-Resolution Images (Taking sesame as an example), grant number BKY-2022-15.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, M.; Liu, X.; Lin, X. Status and Development Trend of Rice Production in China. Hans J. Agric. Sci. 2023, 13, 562–567. [Google Scholar] [CrossRef]
Zhang, Z.; Zhan, W.; Sun, K.; Zhang, Y.; Guo, Y.; He, Z.; Hua, D.; Sun, Y.; Zhang, X.; Tong, S.; et al. RPH-Counter: Field detection and counting of rice planthoppers using fully convolutional network with object-level supervision. Comput. Electron. Agric. 2024, 225, 109242. [Google Scholar] [CrossRef]
Zhang, J.; Shen, D.; Chen, D.; Ming, D.; Ren, D.; Diao, Z. ISMSFuse: Multi-modal fusing recognition algorithm for rice bacterial blight disease adaptable in edge computing scenarios. Comput. Electron. Agric. 2024, 223, 109089. [Google Scholar] [CrossRef]
Deng, J.; Yang, C.; Huang, K.; Lei, L.; Ye, J.; Zeng, W.; Zhang, J.; Lan, Y.; Zhang, Y. Deep-Learning-Based Rice Disease and Insect Pest Detection on a Mobile Phone. Agronomy 2023, 13, 2139. [Google Scholar] [CrossRef]
Cheng, D.; Zhao, Z.; Feng, J. Rice Diseases Identification Method Based on Improved YOLOv7-Tiny. Agriculture 2024, 14, 709. [Google Scholar] [CrossRef]
Wang, T.H.; Guo, Y.Z.; Zhang, J.L.; Zhang, C.Y. Study on rice pest identification based on improved yolov5s. J. Agric. Mach. 2024, 55, 39–48. [Google Scholar] [CrossRef]
Yin, J.; Zhu, J.; Chen, G.; Jiang, L.; Zhan, H.; Deng, H.; Long, Y.; Lan, Y.; Wu, B.; Xu, H. An Intelligent Field Monitoring System Based on Enhanced YOLO-RMD Architecture for Real-Time Rice Pest Detection and Management. Agriculture 2025, 15, 798. [Google Scholar] [CrossRef]
Wang, Y.; Yi, C.; Huang, T.; Liu, J. Research on Intelligent Recognition for Plant Pests and Diseases Based on Improved YOLOv8 Model. Appl. Sci. 2024, 14, 5353. [Google Scholar] [CrossRef]
Ju, Z.; Yi, C.; Zhou, C.; Qi, Z. YOLO-Rice: A rice pest detection based on YOLOv5. Control Eng. 2024, 31, 2196–2205. [Google Scholar] [CrossRef]
Wen, C.; Guyer, D.; Li, W. Local feature-based identification and classification for orchard insects. Biosyst. Eng. 2009, 104, 299–307. [Google Scholar] [CrossRef]
Wen, C.; Guyer, D. Image-based orchard insect automated identification and classification method. Comput. Electron. Agric. 2012, 89, 110–115. [Google Scholar] [CrossRef]
Liu, T.; Chen, W.; Wu, W.; Sun, C.; Guo, W.; Zhu, X. Detection of aphids in wheat fields using a computer vision technique. Biosyst. Eng. 2016, 141, 82–93. [Google Scholar] [CrossRef]
Thenmozhi, K.; Reddy, U.S. Image processing techniques for insect shape detection in field crops. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017; pp. 699–704. [Google Scholar] [CrossRef]
Wang, J.; Lin, C.; Ji, L.; Liang, A. A new automatic identification system of insect images at the order level. Knowl. Based Syst. 2012, 33, 102–110. [Google Scholar] [CrossRef]
Larios, N.; Soran, B.; Shapiro, L.G.; Martínez-Muñoz, G.; Lin, J.; Dietterich, T.G. Haar random forest features and SVM spatial matching kernel for stonefly species identification. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; IEEE: Piscataway, NJ, USA; pp. 2624–2627. [Google Scholar] [CrossRef]
Li, X.L.; Huang, S.G.; Zhou, M.Q.; Geng, G.H. KNN-spectral regression LDA for insect recognition. In Proceedings of the 2009 First International Conference on Information Science and Engineering, Nanjing, China, 26–28 December 2009; IEEE: Piscataway, NJ, USA; pp. 1315–1318. [Google Scholar] [CrossRef]
Yin, J.; Huang, P.; Xiao, D.; Zhang, B. A Lightweight Rice Pest Detection Algorithm Using Improved Attention Mechanism and YOLOv8. Agriculture 2024, 14, 1052. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep Learning in Agriculture: A Survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Qin, K.; Zhang, J.; Hu, Y. Identification of Insect Pests on Soybean Leaves Based on SP-YOLO. Agronomy 2024, 14, 1586. [Google Scholar] [CrossRef]
Bacea, D.S.; Oniga, F. Single stage architecture for improved accuracy real-time object detection on mobile devices. Image Vis. Comput. 2023, 130, 104613. [Google Scholar] [CrossRef]
Wu, P.; Li, H.; Zeng, N.; Li, F. FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public. Image Vis. Comput. 2022, 117, 104341. [Google Scholar] [CrossRef]
Wei, Y.; Huang, S.; Huang, Y. Multi scale object detection method for satellite remote sensing images based on improved YOLOv7. Aerosp. Return Remote Sens. 2024, 45, 153–162. [Google Scholar]
Liu, Y.; Li, X.; Fan, Y.; Liu, L.; Shao, L.; Yan, G.; Geng, Y.; Zhang, Y. Classification of peanut pod rot based on improved YOLOv5s. Front. Plant Sci. 2024, 15, 1364185. [Google Scholar] [CrossRef]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A Novel YOLO-Based Deep Learning Technique for Crop Pest Detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Zhou, W.; Niu, Y.; Wang, Y.; Li, D. Improved YOLOv4 GhostNet based rice pest and disease identification method. Jiangsu J. Agric. 2022, 38, 685–695. [Google Scholar]
Liao, J.; Liu, K.; Yang, Y.; Yan, C.; Zhang, A.; Zhu, D. Research on Rice Disease Identification Model in Natural Environment Based on RDN-YOLO. J. Agric. Mach. 2024, 55, 233–242. [Google Scholar]
Li, K.; Wang, J.; Jalil, H.; Wang, H. A Fast and Lightweight Detection Algorithm for Passion Fruit Pests Based on Improved YOLOv5. Comput. Electron. Agric. 2023, 204, 107534. [Google Scholar] [CrossRef]
Di, Y.; Phung, S.L.; Van Den Berg, J.; Clissold, J.; Bouzerdoum, A. TP-YOLO: A Lightweight Attention-Based Architecture for Tiny Pest Detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3394–3398. [Google Scholar]
Sun, D.; Zhang, K.; Zhong, H.; Xie, J.; Xue, X.; Yan, M.; Wu, W.; Li, J. Efficient Tobacco Pest Detection in Complex Environments Using an Enhanced YOLOv8 Model. Agriculture 2024, 14, 353. [Google Scholar] [CrossRef]
Hu, Y.; Deng, X.; Lan, Y.; Chen, X.; Long, Y.; Liu, C. Detection of Rice Pests Based on Self-Attention Mechanism and Multi-Scale Feature Fusion. Insects 2023, 14, 280. [Google Scholar] [CrossRef]
GB/T 28592–2012; Precipitation Grade. General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China. Standardization Administration of the People’s Republic of China: Beijing, China, 2012.
Jocher, G. Ultralytics YOLO, Version8.0.0; [Computer Software]. Available online: https://github.com/ultralytics (accessed on 10 January 2024).
Wu, X.; Liang, J.; Yang, Y.; Li, Z.; Jia, X.; Pu, H.; Zhu, P. SAW-YOLO: A Multi-Scale YOLO for Small Target Citrus Pests Detection. Agronomy 2024, 14, 1571. [Google Scholar] [CrossRef]
Wu, C.; Lei, J.; Liu, W.; Ren, M.; Ran, L. Unmanned Ship Identification Based on Improved YOLOv8s Algorithm. Comput. Mater. Contin. 2024, 78, 3071–3088. [Google Scholar] [CrossRef]
Chen, X.; Luo, Y.; Wang, C.; Luo, K.; Huang, S.; Yu, L. Navel Orange Detection Method Based on Improved YOLOv8s in Complex Scenarios. J. Jiamusi Univ. (Nat. Sci. Ed.) 2024, 42, 18–21. [Google Scholar]
Pan, W.; Wei, C.; Qian, C.; Yang, Z. Improved YOLOv8s model for small target detection from the perspective of unmanned aerial vehicles. Comput. Eng. Appl. 2024, 60, 142–150. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Cui, J.; Wei, W.; Zhao, M. Rice Disease Identification Model Based on Improved MobileNetV3. J. Agric. Mach. 2023, 54, 217–224. [Google Scholar]
Yuan, P.; Ouyang, L.; Zhai, Z.; Tian, Y. Research on Lightweight Identification of Rice Diseases Based on MobileNetV3Small ECA. J. Agric. Mach. 2024, 55, 253–262. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. S6000–S6010. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient Attention: Attention with Linear Complexities. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3530–3538. [Google Scholar]
An, Y.; Song, C. Multiscale Dynamic Attention and Hierarchical Spatial Aggregation for Few-Shot Object Detection. Appl. Sci. 2025, 15, 1381. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. arXiv 2022, arXiv:2201.00520. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Zhang, Y.; Luo, L.; Dou, Q.; Heng, P. Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification. Med. Image Anal. 2023, 86, 102772. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Tao, X.; Fang, J.; Imyhxy; et al. Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 9 November 2022).
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]

Figure 1. Distribution of the number of pest objects in each category in different sets.

Figure 2. Rice Pest1 training samples. (a) Vertical flips; (b) exposure adjustment; (c,d) brightness adjustment.

Figure 3. Rice Pest2 training samples. (a) Unenhanced image; (b) add noise; (c) simulated raindrops; (d) Gaussian blur.

Figure 4. MTD-YOLO network structure.

Figure 5. Block structure of MobileNetV3 network model.

Figure 6. The overall architecture of the Triplet Attention module.

Figure 7. Detailed computation process of a single branch in the Triplet Attention module.

Figure 8. C2f-T network structure.

Figure 9. DyHead network structure. The “*” symbol indicates element-wise multiplication between the attention weights and the feature map.

Figure 10. Results of five experiments on the Rice Pest1 dataset.

Figure 11. Changes in confidence in targets. (a) YOLOv8s; (b) Replace MobileNetV3; (c) MobileNetV3 and C2f-T; (d) MobileNetV3, C2f-T, and Dynamic Head.

Figure 12. Detection results before and after model improvement. (a–d) In each subfigure, the left image shows results before improvement, and the right image shows results after improvement.

Figure 13. Comparison of thermograms before and after model improvement. (a) Original image; (b) heat map before improvement; (c) improved heat map.

Table 1. Parameter information of MobileNetV3. SE denotes whether there is a Squeeze-and-Excite in that block. NL denotes the type of nonlinearity used. HS denotes h-swish and RE denotes ReLU. NBN denotes no batch normalization. S denotes stride.

Input	Operator	Exp Size	#Out	SE	NL	S
2242 × 3	conv2d, 3 × 3	-	16	-	HS	2
1122 × 16	bneck, 3 × 3	16	16	√	RE	2
562 × 16	bneck, 3 × 3	72	24	-	RE	2
282 × 24	bneck, 3 × 3	88	24	-	RE	1
282 × 24	bneck, 5 × 5	96	40	√	HS	2
142 × 40	bneck, 5 × 5	240	40	√	HS	1
142 × 40	bneck, 5 × 5	240	40	√	HS	1
142 × 40	bneck, 5 × 5	120	48	√	HS	1
142 × 48	bneck, 5 × 5	144	48	√	HS	1
142 × 48	bneck, 5 × 5	288	96	√	HS	2
72 × 96	bneck, 5 × 5	576	96	√	HS	1
72 × 96	bneck, 5 × 5	576	96	√	HS	1
72 × 96	conv2d, 1 × 1	-	576	√	HS	1
72 × 576	pool, 7 × 7	-	-	-	-	1
12 × 576	conv2d 1 × 1, NBN	-	1024	-	HS	1
12 × 1024	conv2d 1 × 1, NBN	-	k	-	-	1

Table 2. Parameter comparison between original YOLOv8 backbone and MobileNetV3 backbone.

Backbone	YOLOv8	MobileNetV3
Params/M	5.08	2.97

Table 3. Comparison of MTD-YOLO on Rice Pest1 and YOLOv8 Performance Metrics.

Modules	P/%	R/%	mAP@0.5/%	mAP@[0.5:0.95]/%
YOLOv8	90.9	87.2	85.8	66.5
MTD-YOLO	92.5	90.1	90	67.8

Table 4. AP of each category on Rice Pest1 before and after improvement.

Category	AP(YOLOv8)/%	AP(MTD-YOLO)/%
Penggerek Batang padi kuning	95.2	96.8
Wereng Coklat	76.4	83.2

Table 5. Comparison of MTD-YOLO on Rice Pest2 and YOLOv8 performance metrics.

Modules	P/%	R/%	mAP@0.5/%	mAP@[0.5:0.95]/%
YOLOv8	93.1	90.6	94.2	80.7
MTD-YOLO	95.6	92.8	96.6	82.5

Table 6. AP of each category on Rice Pest2 before and after improvement.

Category	AP(YOLOv8)/%	AP(MTD-YOLO)/%
army worm	92.0	93.3
legume blister beetle	94.0	95.9
red spider	95.2	99.5
rice gall midge	95.1	99.5
rice leaf roller	91.5	97.4
rice leafhopper	93.0	95.8
rice water weevil	92.4	96.6
wheat phloeothrips	94.3	97.2
white backed plant hopper	95.0	97.4
yellow rice borer	91.1	93.4

Table 7. Calculation results for the mean, standard deviation, and 95% confidence interval of key indicator.

Criteria	Mean ± SD (%)	95%CI
Precision	92.46 ± 0.34	[92.04, 92.88]
Recall	90.12 ± 0.29	[89.76, 90.48]
mAP@0.5	90.02 ± 0.36	[89.58, 90.46]
mAP@[0.5:0.95]	67.80 ± 0.32	[67.42, 68.18]

Table 8. Comparison of ablation experiments.

MobileNetV3	DyHead	C2f-T	P/%	R/%	mAP@0.5/%	Params/M	Size/MB	FPS/s
×	×	×	90.9	87.2	85.8	11.14	21.5	121.9
√	×	×	91.1	89.1	88.3	10.04	11.1	108.1
×	√	×	89.9	89.2	88.6	18.01	26.3	78.6
×	×	√	90.8	89.7	86.6	11.14	23.6	86.3
√	√	×	91.7	89.5	89.1	15.63	24.2	72.3
√	×	√	88.7	89.1	89.9	14.26	22.3	69.3
×	√	√	89.9	89.7	89.2	18.12	28.6	45.6
√	√	√	92.5	90.1	90	16.07	33.3	35.3

Table 9. Model comparison experiment on the Rice Pest1 dataset.

Modules	P/%	R/%	mAP@0.5/%
Yolov3	90.2	87.9	85.7
Yolov3-tiny	91.3	79.3	88.3
Yolov5n	88.1	87	86.7
Yolov5s	89.5	88.7	85.2
Yolov6n	86.1	85.1	86.5
Yolov6s	86.8	85.3	85.2
Yolov8n	89.8	87	86.2
Faster R-CNN	37.7	76.3	62.4
SSD	91.6	49	85
Yolov9	88.3	88	85.1
Yolov10	87	85.3	87.8
Yolov11	88.5	88.1	86.5
MTD-YOLO	92.1	90.1	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, F.; Tian, C.; Li, X.; Yang, N.; Zhang, Y.; Gao, Q. MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model. Electronics 2025, 14, 2912. https://doi.org/10.3390/electronics14142912

AMA Style

Zhang F, Tian C, Li X, Yang N, Zhang Y, Gao Q. MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model. Electronics. 2025; 14(14):2912. https://doi.org/10.3390/electronics14142912

Chicago/Turabian Style

Zhang, Feng, Chuanzhao Tian, Xuewen Li, Na Yang, Yanting Zhang, and Qikai Gao. 2025. "MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model" Electronics 14, no. 14: 2912. https://doi.org/10.3390/electronics14142912

APA Style

Zhang, F., Tian, C., Li, X., Yang, N., Zhang, Y., & Gao, Q. (2025). MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model. Electronics, 14(14), 2912. https://doi.org/10.3390/electronics14142912

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model

Abstract

1. Introduction

2. Datasets

3. Method

3.1. MTD-YOLO

3.2. MobileNetV3 Backbone Network

3.3. C2f-T

3.4. Dynamic Head

4. Experimental Design and Analysis of Results

4.1. Experimental Environment and Parameters

4.2. Evaluation Metrics

4.3. Experimental Demonstration

4.4. Ablation Experiments

4.5. Model Comparison Experiments

4.6. Pest Detection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI