Next Article in Journal
Farm Household Pluriactivity, Factor Inputs, and Crop Structure Adjustment: Evidence from Sichuan Province, China
Previous Article in Journal
Methods for Constructing Soil Dynamic Models Under Intelligent Cultivation: Dynamic Interaction Mechanisms Between Farming Tools with Complex Structures and Soil
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLOv-MA: A High-Precision Foreign Object Detection Algorithm for Rice

1
School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450001, China
2
School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Agriculture 2025, 15(13), 1354; https://doi.org/10.3390/agriculture15131354
Submission received: 27 March 2025 / Revised: 30 May 2025 / Accepted: 23 June 2025 / Published: 25 June 2025
(This article belongs to the Section Digital Agriculture)

Abstract

Rice plays a crucial role in global agricultural production, but various foreign objects often mix in during its processing. To efficiently and accurately detect small foreign objects in the rice processing pipeline, ensuring food quality and consumer safety, this study innovatively proposes a YOLOv-MA-based foreign object detection algorithm for rice, leveraging deep learning techniques. The proposed algorithm adaptively enhances multi-scale feature representation across small, medium, and large object detection layers by incorporating the multi-scale dilated attention (MSDA) mechanism. Additionally, the adaptive spatial feature fusion (ASFF) module is employed to improve multi-scale feature fusion in rice foreign object detection, significantly boosting YOLOv8’s object detection capability in complex scenarios. Compared to the original YOLOv8 model, the improved YOLOv-MA model achieves performance gains of 3%, 3.5%, 2%, 3.9%, and 4.2% in mean Average Precision (mAP@[0.5:0:95]) for clods, corn, screws, stones, and wheat, respectively. The overall mAP@[0.5:0:95] reaches 90.8%, reflecting an improvement of 3.3%. Furthermore, the proposed model outperforms SSD, FCOS, EfficientDet, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv11, and YOLOv12 in overall performance. Thus, this model not only reduces the burden of manual inspection but also provides an efficient and high-precision solution for rice foreign object detection.

1. Introduction

According to the latest data released by the Food and Agriculture Organization of the United Nations (FAO), as of 2024, global rice production ranks second among food crops, after wheat. This highlights the importance of rice in global agricultural production, especially in Asia, where rice is a major part of people’s daily diet. However, during processing, transportation, and storage of rice, foreign objects such as stones, metal fragments, other grains, and clods are often introduced. This not only affects the appearance and texture of rice but also poses a threat to consumer health [1]. In addition, the varietal purity of rice seeds—which guarantees the genetic quality of the seed, ensuring that it belongs to a single rice variety—is also a critical concern in agricultural production. Accurate foreign object detection can contribute to identifying and preserving such purity during the seed selection and processing stages. Therefore, ensuring rice is free from harmful foreign objects is essential for safeguarding food safety and enhancing product quality [2], and it plays an important role in identifying and maintaining the varietal purity of rice seeds during seed production. Traditional manual screening methods are inefficient and prone to human error, making them inadequate for modern large-scale production. With rising consumer concerns about food quality and safety, and increasingly stringent food safety requirements in international markets, producers are under pressure to ensure that their products meet quality standards and regulations. In this context, efficiently and accurately detecting and rejecting foreign matter in rice has become a key challenge in food safety.
In recent years, foreign object detection in agricultural products has become a critical research area in food safety and quality control. Various methods have been proposed to improve detection efficiency and accuracy, including hyperspectral imaging, terahertz spectroscopy, and image processing techniques. For instance, Saeidan et al. [3] identified wood chips, plastics, stones, and plant debris in cocoa beans using hyperspectral imaging combined with principal component analysis (PCA) and support vector machines (SVM), achieving a test accuracy of 81.28%. Wang et al. [4] applied terahertz spectral imaging for nondestructive detection of shell contaminants in walnuts, with classification accuracy exceeding 95%. Yang et al. [5] enhanced the detection of foreign fibers in cotton using an improved image processing pipeline that integrates image enhancement, Otsu segmentation, and morphological post-processing.
Despite promising results in specific scenarios, foreign object detection in agricultural contexts still faces several challenges. First, foreign materials often closely resemble crops in color and shape, making traditional feature-based algorithms less robust and limiting their generalization capability. Second, small-sized foreign objects occupy minimal image area and are susceptible to noise, leading to frequent missed or false detections. Additionally, variations in illumination and background clutter in complex environments further degrade detection performance. Therefore, enhancing detection accuracy and robustness—particularly for small objects under challenging conditions—remains a key issue in the field.
With the rapid development of deep learning, convolutional neural networks (CNNs) have become the dominant technology in object detection, especially for automatic feature extraction from raw data. Using multiple convolutional and pooling layers, CNNs capture local image features and abstract complex patterns layer by layer. Like the YOLO series [6] (including YOLO, YOLO9000, YOLOv3, YOLOv4, YOLOv5 [7], YOLOX, YOLOv6 [8], YOLOv7 [9], etc.), faster deep learning methods such as Faster R-CNN [10], Mask-RCNN [11], and SSD have been widely used in real-time object detection and image classification tasks. Through end-to-end learning, these methods can automatically learn and identify key features in images without manual feature design, providing fast and accurate target location and recognition.
In particular, in the detection of foreign objects in rice, such as stones and metal fragments, traditional detection methods often struggle to meet the requirements for precise identification, thereby affecting quality control. Therefore, improving the detection accuracy of targets in complex backgrounds has become a hot topic in current research. Many researchers have significantly improved target detection accuracy by employing data augmentation, anchor box optimization, multi-scale feature fusion, and enhanced attention mechanisms. For example, Yang et al. [12] proposed KPE-YOLOv5, which optimizes anchor box distribution using K-means++ clustering and integrates the scSE attention module with small target detection layers, significantly improving target localization and classification accuracy. Wang et al. [13] proposed UAV-YOLOv8, which incorporates Wise-IoU v3 bounding box regression loss and the BiFormer attention mechanism to optimize localization accuracy and feature focus in object detection. Alhawsawi et al. [14] proposed an improved YOLOv8-based framework that improves multi-scale contextual information capture by incorporating a context enrichment module (CEM). Meng et al. [15] proposed the YOLOv7-MA model, which incorporates micro-scale detection layers and the convolutional block attention module (CBAM) to improve the detection accuracy of wheat spikes in complex backgrounds. Although these methods have achieved success in many applications, they still face challenges in detecting foreign objects in rice under complex backgrounds.
To address the aforementioned challenges, this paper proposes an innovative deep learning-based algorithm, YOLOv-MA, specifically designed for the efficient detection of foreign objects in rice. By incorporating the multi-scale dilated attention (MSDA) mechanism, the model enhances feature representation across different object scales, enabling adaptive handling of small, medium, and large foreign objects. Additionally, the adaptive spatial feature fusion (ASFF) module is employed to further optimize the multi-scale feature fusion process, significantly enhancing the model’s detection accuracy and robustness in complex backgrounds [16]. The main contributions of this paper are listed as follows:
  • A YOLOv8-based model is proposed, in which the multi-scale dilated attention (MSDA) and adaptive spatial feature fusion (ASFF) modules are integrated. This represents the first attempt to combine both modules for rice foreign object detection, and the model’s ability to identify small and irregular contaminants in complex scenes is effectively enhanced.
  • A high-quality rice foreign object detection dataset was constructed with detailed annotations. A diverse range of contaminants, such as stones, metal fragments, and clods, is included, providing a reliable benchmark for future research.
  • Extensive experiments were conducted to evaluate the proposed method. The results demonstrate that the model is able to outperform several mainstream detection algorithms in terms of accuracy and robustness, proving its effectiveness in practical applications.

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

The successful training of the rice foreign object detection model relies on high-quality, representative, and diverse image data. To construct a reliable dataset, the rice and foreign object samples in this study were obtained from the Key Laboratory of Grain Information Processing and Control at the Ministry of Education, Henan University of Technology. Based on the Chinese national standards GB/T 1354-2018 [17] Rice, this study selected common types of foreign objects found in rice, including stones, clods, metal fragments, screws, corn kernels, and wheat.
In this study, approximately 5000 images for rice foreign object detection were collected, with the dataset split into a training set and a validation set in a ratio of 8:2. This ratio was chosen as a commonly accepted practice in deep learning to provide a sufficient amount of data for training while reserving enough data for reliable performance evaluation. The dataset was randomly split to ensure that both the training and validation sets are representative of the overall data distribution. All images were captured using a high-resolution camera against a uniform background and under varying lighting conditions. Manual annotation was performed using Labelme software (version 4.5.13) to ensure data accuracy and annotation quality. To enhance training efficiency and reduce computational costs, all images underwent standardized preprocessing, including resizing to 384 × 384 pixels to ensure consistency of input data. Furthermore, size normalization helps mitigate feature scale variation caused by differences in image resolution, thereby improving the model’s robustness and detection performance [18]. On this basis, we also applied data augmentation techniques, such as converting images to grayscale, increasing brightness, adjusting image contrast, and adding Gaussian noise, to further improve the model’s generalization ability and robustness. This dataset serves as a stable and high-quality input source for training subsequent deep learning models. To demonstrate the applied preprocessing techniques, Figure 1 presents several representative data augmentation results derived from a single image in the rice foreign object dataset.

2.2. Evaluation Index

In the task of rice foreign object detection, evaluating model performance requires a comprehensive consideration of detection accuracy, inference speed, and model complexity. Therefore, this study employs precision, recall, average precision (AP), and mean average precision (mAP) as evaluation metrics to assess the accuracy of the detection model. In addition, to measure the computational complexity and inference efficiency of the model, this study adopts GFLOPs and the number of parameters as metrics for model complexity, ensuring that the model maintains high computational efficiency and deployability while achieving accurate detection [19].
Precision is the proportion of correctly predicted positive samples among all predicted positive samples, calculated using Equation (1):
Precision = T P T P + F P
Recall is the proportion of all true foreign object samples that are correctly detected and is calculated using Equation (2):
Recall = T P T P + F N
where TP represents the number of foreign object samples correctly detected, FP represents the number of normal rice samples incorrectly detected as foreign objects, and FN represents the number of foreign object samples not correctly detected.
Average precision (AP) is the area under the precision–recall curve, which measures the overall detection capability of the model across different recall levels. It is calculated using Equation (3):
A P = 0 1 Precision × d Recall
Mean Average Precision (mAP) is the average of the AP values across all categories. In this study, it represents the mean AP of five categories: stones, clods, screws, corn kernels, and wheat. It is calculated using Equation (4):
m A P = 1 N i = 1 N A P i
where N represents the total number of categories, which, in this study, includes various types of foreign objects in rice. Additionally, this study adopts mAP@[0.5:0:95] as the core evaluation metric. This metric calculates the average mAP by setting different IoU thresholds (ranging from 0.5 to 0.95, with a step size of 0.05), providing a more comprehensive assessment of the model’s detection capability under varying matching criteria [20].
To further evaluate the computational complexity of the model, this study introduces the concepts of GFLOPs and the number of parameters as evaluation metrics. GFLOPs represent the amount of computation required during inference; a lower GFLOPs value indicates lower computational cost, making the model more suitable for deployment on embedded devices or in low-computing-power environments. The number of parameters represents the total number of trainable parameters within the model. A model with fewer parameters typically achieves faster inference speed and reduced storage requirements but may compromise detection accuracy. In this study, the network structure is optimized to balance the parameter count and detection performance, ensuring that the model maintains high accuracy while minimizing computational cost [21].

2.3. YOLOv8

Since its introduction, the YOLO model has been a pioneering representative of object detection technology in the field of computer vision. Its outstanding performance and efficiency have led to its widespread adoption in both academia and industry. With the continuous advancement of technology, the YOLO series has undergone multiple optimizations and upgrades, with each generation of models introducing improvements in accuracy, speed, and application scenarios. In 2023, Ultralytics released YOLOv8, marking another significant milestone in the YOLO series. Compared to its predecessors, such as YOLOv5 and YOLOv7, YOLOv8 introduced breakthroughs in multiple aspects, particularly excelling in both accuracy and efficiency. As a result, it has become a leading choice in the field of object detection. YOLOv8 offers five different network architectures—YOLOv8-n, YOLOv8-s, YOLOv8-m, YOLOv8-l, and YOLOv8-x—as illustrated in Figure 2. These architectures provide a unified and efficient solution for various computer vision tasks, including object detection, instance segmentation, and image classification [22].
The architecture of the YOLOv8 model consists of four core components: the input layer, backbone network layer, neck structure layer, and output layer. The input layer is optimized through various data augmentation techniques, such as mosaic augmentation, dynamic anchor box calculation, and grayscale padding, enhancing the model’s robustness across diverse environments and data variations. Next, the backbone network extracts high-level feature information from the input image. It incorporates techniques such as convolutional modules (Conv), the C2f module, and Spatial Pyramid Pooling Faster (SPPF). These components enable the capture of both global context and local details, providing a comprehensive feature representation for subsequent processing. The neck structure layer serves as a bridge between the backbone network and the output layer. It effectively integrates multi-scale information through structures such as the feature pyramid network (FPN) and the path aggregation network (PAN), enhancing the model’s adaptability to multi-scale objects and thereby improving detection accuracy. Finally, the output layer generates the final detection results and applies the non-maximum suppression (NMS) algorithm to eliminate redundant bounding boxes, ensuring precise and reliable outputs. The deep learning framework of YOLOv8 leverages the powerful capabilities of convolutional neural networks (CNNs) to comprehensively analyze input images and accurately generate object locations, classifications, and confidence scores.

2.4. Improved YOLOv8 Network

Although YOLOv8 has demonstrated outstanding accuracy and efficiency in object detection tasks, making it one of the leading detection models, it still faces challenges in specific scenarios. Taking rice foreign object detection as an example, the wide variety of foreign objects, their significant size differences, and the complex background create challenges. YOLOv8 occasionally fails to detect small targets or misclassifies low-contrast objects. These issues may affect the accuracy and reliability of detection results in real-world applications. To address these limitations, this study proposes an improved method based on YOLOv8 by introducing the multi-scale dilated attention (MSDA) mechanism and the adaptive spatial feature fusion (ASFF) module. These enhancements aim to improve object detection accuracy and strengthen feature fusion in complex backgrounds. The structure of the improved YOLOv8 network is shown in Figure 3.
This study primarily focuses on optimizing the detection head of the YOLOv8 model. The original YOLOv8 model relies on conventional concatenation operations (Concat) and a standard detection module (Detect); however, it often exhibits insufficient detection accuracy when handling multi-scale objects, particularly in the detection of small and low-contrast targets. To address this issue, this study introduces two key enhancements to the model’s Neck and Head sections: the multi-scale dilated attention (MSDA) mechanism and the adaptive spatial feature fusion (ASFF) module.
First, the MSDA mechanism is incorporated into the Neck section. This mechanism combines dilated convolutions with an attention mechanism to enhance the feature representations of multi-scale targets across small, medium, and large object detection layers adaptively. By applying the MSDA mechanism at various detection scales, the model is able to capture the details of different targets more precisely, significantly improving its ability to detect objects, particularly under low-contrast conditions and complex backgrounds.
Secondly, in the Head section, the conventional Detect module is replaced with the Detect_ASFF module. The ASFF module adaptively learns the fusion weights of feature maps at different scales, addressing the issue of inconsistent feature scales inherent in traditional methods. This process significantly enhances detection accuracy for multi-scale objects. This improvement is particularly notable in scenarios with large variations in object sizes and complex backgrounds.

2.4.1. Multi-Scale Dilated Attention (MSDA)

Traditional convolutional neural networks are constrained by a fixed receptive field in object detection, making it challenging to capture multi-scale targets effectively, particularly for small and low-contrast objects [23]. The multi-scale dilated attention (MSDA) mechanism is a hybrid module that combines dilated convolution with multi-head self-attention. It is designed to enhance feature representation in complex scenarios through multi-scale receptive field modeling and a sparse attention mechanism [24]. Its core improvement mechanisms are as follows:
  • Multi-Scale Receptive Field Modeling with Dilated Convolution
Traditional convolution operations are limited by a fixed-size local receptive field (e.g., 3 × 3 or 5 × 5), which hinders the simultaneous capture of local details and global semantics. MSDA addresses this limitation by introducing differentiated dilation rates in different attention heads to construct multi-scale feature extraction branches.
For the input feature map X R H × W × C , a query (Q), key (K), and value (V) matrix is first generated by linear projection. The channel dimension is divided into n heads, and each head independently processes a subset of features. Different expansion rates r i are set for the convolution operation in each head. The mathematical form of the dilated convolution is shown in Equation (5).
DConv ( X , r ) i = k G k X i + r k
where G k is the convolutional kernel weight, and r is the dilation rate. For example, three different dilation rates (r = 1, 2, 3) correspond to different receptive field sizes (3 × 3, 5 × 5, 7 × 7). A 3 × 3 kernel at r = 2 is equivalent to the receptive field of a standard 5 × 5 convolution, but only requires 9 parameters. This approach expands the receptive field and significantly reduces the amount of computation, allowing the model to capture details while enhancing global semantic expression. The model principle of MSDA is shown in Figure 4.
2.
Sparse Attention Mechanism and Sliding Window
To reduce the computational complexity of global self-attention, ( O ( N 2 ) ) , MSDA proposes the sliding window dilated attention (SWDA), which calculates attention weights only within local windows. For position ( i , j ) , a sliding window W i , j is defined centered at this position, and attention scores are computed only for the keys ( K W i , j ) and ( V W i , j ) within the window, as shown in Equation (6).
Attention ( Q i , j , K W i , j ) = Softmax ( Q i , j K W i , j T d k )
In this way, MSDA reduces the complexity of the calculation to O ( N k 2 ) (where k is the window size) while retaining the semantic relevance of local regions.
3.
Channel–Spatial Dual-Path Attention Enhancement
MSDA introduces dual-path attention (DPA), which integrates channel attention and spatial attention mechanisms to further enhance feature representation capability.
Channel attention: channel weights are generated using global average pooling (GAP) to suppress irrelevant channels.
ω c = σ ( MLP ( GAP ( X ) ) )
Spatial attention: spatial dependencies are extracted using dilated convolutions to focus on the target region.
ω s = σ ( D C o n v ( X , r ) )
The final output is as follows:
X out = ( ω c ω s ) X
Here, σ represents the Sigmoid function, while denotes element-wise multiplication. By simultaneously incorporating channel and spatial attention mechanisms, MSDA optimizes feature representation across multiple dimensions, enhancing the model’s focus on important feature regions. This approach demonstrates significant advantages, particularly in object detection tasks involving complex backgrounds.
4.
Head Feature Aggregation and Pruning Optimization
In multi-head feature aggregation, the outputs from each head, h 1 , h 2 , , h n , are integrated through concatenation and linear projection.
X fused = Linear ( Concat ( h 1 , h 2 , , h n ) )
Meanwhile, MSDA applies channel pruning, performing dynamic channel pruning on the features after dilated convolution (with a pruning rate of 15%). This significantly reduces computational cost while preserving model accuracy.

2.4.2. Adaptive Spatial Feature Fusion (ASFF)

In the YOLOv8 model, the path aggregation network (PANet) typically serves as the feature fusion module, employing both bottom-up and top-down pathways to facilitate information exchange and integration. This enables the effective utilization of semantic information from feature maps at different levels to enhance object detection accuracy. However, this method only fuses the feature maps by making them uniform in size, which limits the full utilization of feature information across different scales and imposes certain constraints on multi-scale object detection. To solve this problem, this paper introduces the adaptive spatial feature fusion (ASFF) mechanism [25].
Unlike conventional multi-scale feature fusion methods that rely on element-wise operations or cascaded approaches, the core concept of ASFF is to adaptively learn the spatial fusion weights for feature maps at different scales, thereby dynamically adjusting the fusion strategy for features at various levels. Specifically, ASFF can spatially filter out conflicting information and suppress inconsistencies between features at different scales, thereby enhancing scale invariance. This significantly enhances YOLOv8’s object detection capabilities in complex scenarios, particularly in multi-scale and low-contrast object detection. In addition, ASFF not only significantly improves detection accuracy and robustness but also maintains the model’s computational efficiency, enhancing performance with minimal additional inference overhead. This makes ASFF an exceptionally effective feature fusion strategy, especially in application scenarios that require real-time inference, showing its strong advantages. The structure of ASFF is shown in Figure 5.

2.5. Experimental Environment

This study is based on the open-source machine learning framework PyTorch 1.13.1, using Windows 11, 11th Gen Intel(R) Core(TM) i7-11800H @2.30 GHz, 16 GB of RAM, RTX3060 with 6 GB of RAM, and CUDA 11.8 for GPU acceleration. The Python version used is 3.8. The training parameters are set as shown in Table 1.

3. Results and Discussion

3.1. Comparison Experiments

To determine the most suitable algorithm as the baseline model, this study first conducted a comprehensive comparative experiment on mainstream object detection models. The selected networks include SSD, FCOS [26], EfficientDet [27], YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9 [28], YOLOv11, and YOLOv12. The selection was based on three key criteria: (1) these models are representative and widely adopted, covering both classical and state-of-the-art object detection architectures; (2) they represent diverse detection paradigms, including anchor-based (e.g., SSD, YOLO series), anchor-free (FCOS), and compound-scaled structures (EfficientDet); (3) they have proven practical value in industrial applications, particularly in food quality inspection and machine vision, which aligns with the real-world relevance of this study. All models were trained on the same dataset with uniform training parameters. In the experiments, the evaluation metrics mainly comprised accuracy, recall, mAP@[0.5:0:95], GFLOPs, and the number of parameters. This study specifically selected mAP@[0.5:0:95] as an evaluation metric because it provides a more stringent standard, enabling a comprehensive assessment of model performance across different IoU thresholds. It is particularly suitable for measuring the accuracy of multi-scale object detection tasks. The results are presented in Table 2.
Experimental results show that YOLOv8 achieves an mAP@[0.5:0:95] of 87.5%, demonstrating outstanding performance among the compared models and maintaining high accuracy across different IoU thresholds. Additionally, YOLOv8’s computational cost (GFLOPs) and parameter count (3.007 M) are relatively moderate, and its demand for computational resources is lower compared to other high-accuracy models, achieving a good balance between accuracy and efficiency, as shown in Figure 6. Therefore, considering YOLOv8’s performance in terms of accuracy, efficiency, and resource consumption, it was selected as the baseline model for this study.
Additionally, to demonstrate the enhanced detection performance of the improved model, this study conducted comparative experiments between the improved model and the baseline YOLOv8. The comparative experimental results are presented in Table 3.
The comparative experimental results indicate that the improved model, YOLOv-MA, achieves a significant improvement in mAP across all categories, with notable increases of 3% and 4% for clods and corn, respectively. This demonstrates that the enhanced model can effectively enhance detection accuracy. Figure 7 presents a comparison of detection results between the improved YOLOv-MA and the original YOLOv8, clearly showing that YOLOv-MA exhibits higher confidence scores.
As shown in Figure 8, the improved YOLOv-MA model outperforms the original YOLOv8 in both the training and validation phases. The training losses, including bounding box loss, classification loss, and DFL loss, decrease more smoothly and rapidly in YOLOv-MA, indicating greater efficiency in object localization, classification, and label learning. During validation, the losses for the improved model are significantly lower than those of the original YOLOv8, particularly in terms of bounding box localization and classification accuracy, demonstrating enhanced generalization ability. In summary, the integration of MSDA and ASFF effectively improves the detection accuracy and generalization performance of YOLOv8.

3.2. Ablation Experiments

3.2.1. Comparative Experiments on Attention Mechanisms

In order to verify the effectiveness of the MSDA attention module in the improved algorithm, in this paper, MSDA and iRMB [29], DLKA [30], EMA, ACmix [31], and MLCA [32] attention modules are compared in the experiments on YOLOv8, while keeping other training conditions consistent. The experimental results are shown in Table 4.
In the comparative experiments, the MSDA attention mechanism demonstrated outstanding performance in the YOLOv8 model, achieving an mAP@[0.5:0:95] of 88.0%, which represents a 0.5% improvement over the baseline YOLOv8 (87.5%). Compared with other attention mechanisms, MSDA maintains lower computational complexity and parameter count (GFLOPs: 8.8, Parameters: 3.357 M) while achieving performance enhancements. For example, in YOLOv8, the iRMB attention mechanism achieved an mAP@[0.5:0:95] of 86.6%, with a computational complexity of 18.5 GFLOPs and 3.361 M parameters, whereas MSDA demonstrated superior performance under similar computational and parameter constraints. Additionally, the DLKA attention mechanism in YOLOv8 achieved an mAP@[0.5:0:95] of 87.9%, with a computational complexity of 17.3 GFLOPs and 6.015 M parameters, whereas MSDA delivered superior performance with even lower GFLOPs and parameter count. These results demonstrate that the MSDA attention mechanism not only enhances model performance but also maintains computational efficiency, thereby proving its effectiveness in the YOLOv8 model.

3.2.2. Comparison of Different Feature Fusion Mechanism Modules

To verify the effectiveness of different feature fusion mechanisms, this study incorporated the ASFF feature fusion mechanism into the Neck layer of YOLOv8 and conducted ablation experiments on various YOLOv8 variants that include combinations of modules such as ASFF, iRMB, DLKA, EMA, ACmix, and MLCA. The detailed experimental results are presented in Table 5.
The experiments indicate that by incorporating only the ASFF module, the YOLOv8-ASFF model increased its mAP@[0.5:0:95] from 87.5% to 89.9%, thereby validating the effectiveness of cross-scale feature interaction. Furthermore, by employing dilated convolutions and a multi-head attention mechanism, the MSDA module further refined multi-scale feature modeling, resulting in a 0.5% improvement in mAP@[0.5:0:95] (to 88.0%) for the YOLOv8-MSDA model, with only a 0.7 GFLOPs increase in computational complexity. This demonstrates the advantage of MSDA in enhancing feature modeling precision.
Notably, the YOLOv-MA model, which integrates both the ASFF and MSDA modules, exhibited the most significant improvement. Building on the YOLOv8-ASFF model, it further increased mAP@[0.5:0:95] by 0.9%, reaching a final value of 90.8%. This demonstrates the synergistic effect of the ASFF and MSDA modules, markedly enhancing detection accuracy while maintaining low computational overhead (with only a 0.7 GFLOPs increase).
Compared with other modules, YOLOv-MA demonstrates superior control over computational complexity while delivering higher detection accuracy, further validating the benefits of combining these two modules. For example, although the DLKA module introduces large-kernel convolutions to enhance the receptive field (resulting in a 25.2% increase in the number of parameters), its feature fusion process lacks a dynamic weight adjustment mechanism, leading to an mAP@[0.5:0:95] improvement of only 1.2% (88.7%), which is less pronounced than that achieved by YOLOv-MA. The MLCA module, which employs lightweight channel attention and local context aggregation, achieved an mAP@[0.5:0:95] of 90.3% with only a 0.377 M increase in parameters; however, it still does not match the accuracy of YOLOv-MA. This indicates that the local-global feature complementary strategy offers a superior balance between accuracy and computational efficiency, although there is still room for improvement. The comparative analysis presented in Figure 9 reinforces the advantage of the YOLOv-MA model, demonstrating its ability to achieve higher detection accuracy with minimal computational cost relative to other feature fusion strategies.
In summary, the YOLOv-MA model, by integrating both “multi-scale feature enhancement” and “adaptive feature fusion” mechanisms, effectively overcomes the accuracy limitations of traditional models in object detection. It demonstrates outstanding real-time detection capabilities in complex scenarios, offering a more efficient and precise solution for future complex object detection tasks, especially in real-time applications that demand high precision.

3.2.3. Model Interpretability and Feature Visualization

Deep neural networks are often highly complex, and to improve their interpretability, feature visualization has emerged as an effective tool. Grad-CAM is a widely used method that generates heatmaps by computing the gradients of each layer’s feature maps in classification or regression tasks, thereby revealing the regions that the model focuses on. The gradient values are obtained by backpropagating the confidence score of the output class through Grad-CAM. In the resulting heatmap, pixels in the feature map with higher gradients are represented by darker red shades, while those with lower gradients are depicted in darker blue shades.
In this study, Grad-CAM was utilized to perform an interpretability analysis on both the original YOLOv8 model and the improved YOLOv-MA. By visualizing heatmaps for various input images, the areas of focus of the models during object detection were investigated. As shown in Figure 10, the YOLOv8 model is prone to interference from the image background, with the activation regions displaying a scattered attention distribution. Some high-response areas are concentrated on the texture features of the rice, and there are instances of missed targets, indicating that the original YOLOv8 struggles to effectively focus on low-resolution targets in complex backgrounds. In contrast, the improved model proposed in this study performs better, effectively filtering out background information and highlighting target foreign objects during the detection process.

3.3. Discussion

The YOLOv-MA model proposed in this study achieves a significant breakthrough in rice foreign object detection accuracy by integrating the multi-scale dilated attention (MSDA) mechanism and the adaptive spatial feature fusion (ASFF) module, reaching an mAP@[0.5:0:95] of 90.8%. Its innovation lies in dynamically enhancing multi-scale features and enabling adaptive cross-level semantic interactions, thereby effectively mitigating the core issues of missed and false detections in complex backgrounds. Although rice foreign object detection is of great significance in food safety, current related research is still relatively limited. Most methods are aimed at general target detection or applied to other agricultural products, lacking targeted optimization. To fill this research gap, this study first integrated MSDA and ASFF into the YOLOv8 framework to build a high-quality rice foreign object detection dataset and customized optimization for this task, significantly improving the model’s ability to identify small-scale, irregular foreign bodies in complex scenarios. This study not only provides reliable technical support for rice quality control, but the proposed YOLOv-MA model also demonstrates strong applicability in agricultural scenarios. Visual detection tasks in agriculture often involve multi-scale object recognition and complex environmental interference, placing higher demands on feature representation and semantic modeling capabilities. By integrating a multi-scale dilated attention mechanism with adaptive spatial feature fusion, YOLOv-MA shows promise in enhancing small-object detection performance. It holds potential for future applications in pest monitoring, fruit ripeness evaluation, and crop health assessment, offering precise and efficient visual solutions for smart agriculture. Moreover, YOLOv-MA exhibits strong cross-domain transferability. For instance, in power system fault diagnosis, the Hypertuned-YOLO model combined with EigenCAM significantly improves interpretability and localization accuracy [33]. In industrial defect detection, deep ensemble models incorporating Weighted Boxes Fusion enhance the robustness of insulator fault recognition [34]. With further lightweight optimization and deployment adaptation, YOLOv-MA can be extended to such complex detection scenarios, showing broad potential for cross-industry intelligent visual applications.
However, unlike terahertz imaging-based approaches such as AHA-RetinaNet-X, which can penetrate and recognize occluded or low-contrast impurities with a reported mAP of 92.1% [35], the RGB-based YOLOv-MA still faces challenges in detecting visually similar or partially covered foreign objects under extreme conditions. In addition, although YOLOv-MA improves detection accuracy, its computational cost remains higher than that of lightweight models such as the MobileNetV3-enhanced YOLOv5 used for tomato detection, which achieved a 78% parameter reduction and real-time CPU performance [36]. These comparisons highlight the limitations in both generalizability under diverse sensory conditions and real-time deployment on low-power devices. Future research can focus on the collaborative optimization of multimodal sensing and lightweight architectures. On the one hand, hyperspectral or X-ray imaging data can be incorporated to enhance the representation of transparent and low-reflectivity foreign objects, while domain adaptation techniques can be employed to develop dynamically generalized models for varying environments. On the other hand, neural architecture search (NAS) and mixed-precision quantization compression can be explored to achieve real-time inference on edge devices while maintaining high detection accuracy.
In addition, constructing an industrial-grade, standardized dataset covering a wide range of foreign object types (such as microplastics and organic fibres) and designing an end-to-end collaborative sorting system with hardware and software will facilitate the transition of algorithms from theoretical verification to large-scale application. Future work should further expand the dataset to enhance its performance in different environments and improve the generality and practicality of the model.

4. Conclusions

This study proposes YOLOv-MA, an improved lightweight object detection model tailored for the detection of foreign objects in rice. By integrating the MSDA attention mechanism and ASFF feature fusion module into the YOLOv8 architecture, the model significantly enhances its ability to extract and fuse multi-scale features in complex environments. The proposed method achieves a mean average precision (mAP) of 90.8%, surpassing several state-of-the-art object detection models in both accuracy and computational efficiency. Comparative experiments and ablation studies confirm the effectiveness of each module in boosting detection performance. These findings demonstrate the potential of YOLOv-MA for practical applications in intelligent agricultural product screening and provide a solid foundation for further exploration in this field.

Author Contributions

Conceptualization, J.W. and Y.J.; methodology, J.W.; software, H.C.; validation, J.W., Y.J. and M.J.; formal analysis, J.W.; investigation, J.W.; resources, Y.J.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W., M.J., T.A. and Y.J.; visualization, J.W.; supervision, Y.J.; project administration, Y.J.; funding acquisition, Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61975053, No. 62271191), the Natural Science Foundation of Henan (No. 222300420040), the Program for Science and Technology Innovation Talents in Universities of Henan Province (No. 22HASTIT017, No. 23HASTIT024), the Open Fund Project of the Key Laboratory of Grain Information Processing and Control, Ministry of Education, Henan University of Technology (No. KFJ2021102), the major public welfare projects of Henan Province (No. 201300210100), and the Innovative Funds Plan of Henan University of Technology (No. 2021ZKCJ04).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Payne, K.; O’Bryan, C.A.; Marcy, J.A.; Crandall, P.G. Detection and prevention of foreign material in food: A review. Heliyon 2023, 9, e19574. [Google Scholar] [CrossRef] [PubMed]
  2. Edwards, M.C.; Stringer, M.F. Observations on patterns in foreign material investigations. Food Control 2006, 18, 773–782. [Google Scholar] [CrossRef]
  3. Saeidan, A.; Khojastehpour, M.; Golzarian, M.R.; Mooenfard, M.; Khan, H.A. Detection of foreign materials in cocoa beans by hyperspectral imaging technology. Food Control 2021, 129, 108242. [Google Scholar] [CrossRef]
  4. Wang, Q.; Hameed, S.; Xie, L.; Zhang, Y.; Liu, Y.; Chen, Q. Non-destructive quality control detection of endogenous contaminations in walnuts using terahertz spectroscopic imaging. Food Meas. 2020, 14, 2453–2460. [Google Scholar] [CrossRef]
  5. Yang, W.; Li, D.; Zhu, L.; Kang, Y.; Li, F. A new approach for image processing in foreign fiber detection. Comput. Electron. Agric. 2009, 68, 68–77. [Google Scholar] [CrossRef]
  6. Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  7. Yao, J.; Qi, J.; Zhang, J.; Shao, H.; Yang, J.; Li, X. A real-time detection algorithm for kiwifruit defects based on YOLOv5. Electronics 2021, 10, 1711. [Google Scholar] [CrossRef]
  8. Saydirasulovich, N.; Abdusalomov, A.; Jamil, M.K.; Nasimov, R.; Kozhamzharova, D.; Cho, Y.I. A YOLOv6-based improved fire detection approach for smart city environments. Sensors 2023, 23, 3161. [Google Scholar] [CrossRef] [PubMed]
  9. Wang, Y.; Wang, H.; An, Z. Efficient detection model of steel strip surface defects based on YOLO-V7. IEEE Access 2022, 10, 133936–133944. [Google Scholar] [CrossRef]
  10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
  12. Yang, R.; Li, W.; Shang, X.; Zhu, D.; Man, X. KPE-YOLOv5: An improved small target detection algorithm based on YOLOv5. Electronics 2023, 12, 817. [Google Scholar] [CrossRef]
  13. Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
  14. Alhawsawi, A.N.; Khan, S.D.; Rehman, F.U. Enhanced YOLOv8-based model with context enrichment module for crowd counting in complex drone imagery. Remote Sens. 2024, 16, 4175. [Google Scholar] [CrossRef]
  15. Meng, X.; Li, C.; Li, J.; Li, X.; Guo, F.; Xiao, Z. YOLOv7-MA: Improved YOLOv7-based wheat head detection and counting. Remote Sens. 2023, 15, 3770. [Google Scholar] [CrossRef]
  16. Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement detection method for road traffic in UAV images based on multiscale feature fusion. Remote Sens. 2022, 14, 3498. [Google Scholar] [CrossRef]
  17. GB/T 1354-2018; Rice. State Administration for Market Regulation, Standardization Administration of China: Beijing, China, 2018.
  18. Pei, X.; Zhao, Y.; Chen, L.; Guo, Q.; Duan, Z.; Pan, Y.; Hou, H. Robustness of machine learning to color, size change, normalization, and image enhancement on micrograph datasets with large sample differences. Mater. Des. 2023, 232, 112086. [Google Scholar] [CrossRef]
  19. Sun, X. Enhanced tomato detection in greenhouse environments: A lightweight model based on S-YOLO with high accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
  20. Song, C.; Zhang, F.; Li, J.; Xie, J.; Yang, C.; Zhou, H.; Zhang, J. Detection of maize tassels for UAV remote sensing image with an improved YOLOX model. J. Integr. Agric. 2023, 22, 1671–1683. [Google Scholar] [CrossRef]
  21. Jia, P.; Sheng, H.; Jia, S. LPCF-YOLO: A YOLO-based lightweight algorithm for pedestrian anomaly detection with parallel cross-fusion. Sensors 2025, 25, 2752. [Google Scholar] [CrossRef]
  22. Talaat, F.M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. Neural Comput. Appl. 2023, 35, 20939–20954. [Google Scholar] [CrossRef]
  23. Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
  24. Jiao, J.; Tang, Y.-M.; Lin, K.-Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.-S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
  25. Li, Y.; Xue, J.; Zhang, M.; Yin, J.; Liu, Y.; Qiao, X.; Zheng, D.; Li, Z. YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5. Agronomy 2023, 13, 1901. [Google Scholar] [CrossRef]
  26. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
  27. Midigudla, R.S.; Dichpally, T.; Vallabhaneni, U.; Wutla, Y.; Sundaram, D.M.; Jayachandran, S. A comparative analysis of deep learning models for waste segregation: YOLOv8, EfficientDet, and Detectron 2. Multimed. Tools Appl. 2025, 1–24. [Google Scholar] [CrossRef]
  28. Sharma, A.; Kumar, V.; Longchamps, L. Comparative performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN models for detection of multiple weed species. Smart Agric. Technol. 2024, 9, 100648. [Google Scholar] [CrossRef]
  29. Xie, X.; Xu, B.; Chen, Z. Real-time fall attitude detection algorithm based on iRMB. Signal Image Video Process. 2024, 19, 156. [Google Scholar] [CrossRef]
  30. Han, Z.; Cai, Y.; Liu, A.; Zhao, Y.; Lin, C. MS-YOLOv8-based object detection method for pavement diseases. Sensors 2024, 24, 4569. [Google Scholar] [CrossRef]
  31. Li, S.; Wang, S.; Wang, P. A small object detection algorithm for traffic signs based on improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef]
  32. Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
  33. Stefenon, S.F.; Seman, L.O.; Klaar, A.C.R.; Ovejero, R.G.; Leithardt, V.R.Q. Hypertuned-YOLO for interpretable distribution power grid fault location based on EigenCAM. Ain Shams Eng. J. 2024, 15, 102722. [Google Scholar] [CrossRef]
  34. Stefenon, S.F.; Seman, L.O.; Singh, G.; Yow, K.C. Enhanced insulator fault detection using optimized ensemble of deep learning models based on weighted boxes fusion. Int. J. Electr. Power Energy Syst. 2025, 168, 110682. [Google Scholar] [CrossRef]
  35. Li, G.; Ge, H.; Jiang, Y.; Zhang, Y.; Jiang, M.; Wen, X.; Sun, Q. Research on wheat impurity identification method based on terahertz imaging technology. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2025, 326, 125205. [Google Scholar] [CrossRef]
  36. Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight tomato real-time detection method based on improved YOLO and mobile deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Figure 1. Representative data augmentation results derived from a single image in the rice foreign object dataset: (a) grayscale, (b) brightness enhanced, (c) contrast adjusted, and (d) Gaussian noise added.
Figure 1. Representative data augmentation results derived from a single image in the rice foreign object dataset: (a) grayscale, (b) brightness enhanced, (c) contrast adjusted, and (d) Gaussian noise added.
Agriculture 15 01354 g001
Figure 2. YOLOv8 network architecture diagram.
Figure 2. YOLOv8 network architecture diagram.
Agriculture 15 01354 g002
Figure 3. Architecture of the improved YOLOv8 network.
Figure 3. Architecture of the improved YOLOv8 network.
Agriculture 15 01354 g003
Figure 4. The model principle of MSDA.
Figure 4. The model principle of MSDA.
Agriculture 15 01354 g004
Figure 5. ASFF architecture diagram.
Figure 5. ASFF architecture diagram.
Agriculture 15 01354 g005
Figure 6. Comparison of metrics among different network architectures.
Figure 6. Comparison of metrics among different network architectures.
Agriculture 15 01354 g006
Figure 7. Detection results of the improved model and YOLOv8 on three different images (ac).
Figure 7. Detection results of the improved model and YOLOv8 on three different images (ac).
Agriculture 15 01354 g007
Figure 8. Comparison of training processes between the improved model and YOLOv8.
Figure 8. Comparison of training processes between the improved model and YOLOv8.
Agriculture 15 01354 g008
Figure 9. Performance comparison of different feature fusion mechanisms.
Figure 9. Performance comparison of different feature fusion mechanisms.
Agriculture 15 01354 g009
Figure 10. Visualization comparison between the improved model and YOLOv8 on different images (ad). Blue indicates regions with zero features, while red represents areas with the maximum number of extracted features.
Figure 10. Visualization comparison between the improved model and YOLOv8 on different images (ad). Blue indicates regions with zero features, while red represents areas with the maximum number of extracted features.
Agriculture 15 01354 g010
Table 1. Training parameter settings.
Table 1. Training parameter settings.
ParameterValue
epochs500
patience100
batch16
optimizerAuto
weight_decay0.0005
momentum0.937
warmup_momentum0.8
Close_mosaic10
iou0.7
imgsz384
irf0.01
Table 2. Comparison of experimental results for different network architectures.
Table 2. Comparison of experimental results for different network architectures.
Network ModelPrecision [%]Recall [%]mAP0.5:0.95 [%]GFLOPsParams [M]
SSD96.795.867.962.826.285
FCOS99.399.874.1161.932.155
EfficientDet99.899.668.25.23.874
YOLOv599.999.984.97.12.504
YOLOv699.999.986.611.84.234
YOLOv799.999.870.8105.237.218
YOLOv899.999.987.58.13.007
YOLOv999.999.985.47.61.972
YOLOv1199.999.985.16.32.583
YOLOv1299.999.883.16.32.557
Table 3. Comparison results between the improved model and YOLOv8.
Table 3. Comparison results between the improved model and YOLOv8.
Network ModelClodCornScrewStoneWheat
YOLOv80.8990.8760.9010.8690.831
YOLOv-MA0.9290.9110.9210.9080.873
Table 4. Ablation experimental results of attention mechanisms.
Table 4. Ablation experimental results of attention mechanisms.
AttentionGFLOPsParams [M]mAP0.5:0.95 [%]
+iRMB18.53.36186.6
+DLKA17.36.01587.9
+EMA8.23.01287.6
+ACmix8.83.30787.2
+MLCA8.23.01187.8
+MSDA8.83.35788.0
YOLOv88.13.00787.5
Table 5. Ablation experimental results of feature fusion modules.
Table 5. Ablation experimental results of feature fusion modules.
No.ASFFMSDAiRMBDLKAEMAACmixMLCAGFLOPsParams [M]mAP0.5:0.95 [%]
1 8.13.00787.5
2 10.34.38089.9
3 8.83.35788.0
4 11.04.73090.8
5 13.44.49090.1
6 14.45.62888.7
7 10.44.38588.7
8 11.04.68089.2
9 10.44.38490.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Jiang, M.; Abbas, T.; Chen, H.; Jiang, Y. YOLOv-MA: A High-Precision Foreign Object Detection Algorithm for Rice. Agriculture 2025, 15, 1354. https://doi.org/10.3390/agriculture15131354

AMA Style

Wang J, Jiang M, Abbas T, Chen H, Jiang Y. YOLOv-MA: A High-Precision Foreign Object Detection Algorithm for Rice. Agriculture. 2025; 15(13):1354. https://doi.org/10.3390/agriculture15131354

Chicago/Turabian Style

Wang, Jiahui, Mengdie Jiang, Tauseef Abbas, Hao Chen, and Yuying Jiang. 2025. "YOLOv-MA: A High-Precision Foreign Object Detection Algorithm for Rice" Agriculture 15, no. 13: 1354. https://doi.org/10.3390/agriculture15131354

APA Style

Wang, J., Jiang, M., Abbas, T., Chen, H., & Jiang, Y. (2025). YOLOv-MA: A High-Precision Foreign Object Detection Algorithm for Rice. Agriculture, 15(13), 1354. https://doi.org/10.3390/agriculture15131354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop