Multi-Scale Attention-Augmented YOLOv8 for Real-Time Surface Defect Detection in Fresh Soybeans

Zhili Wu; Yakai He; Da Huo; Zhiyou Zhu; Yanchen Yang; Zhilong Du

doi:10.3390/pr13103040

,

and

¹

China National Packaging and Food Machinery Corporation, Beijing 100083, China

²

Chinese Academy of Agricultural Mechanization Sciences Group Co., Ltd., Beijing 100083, China

³

State Key Laboratory of Agricultural Equipment Technology, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Processes2025, 13(10), 3040;https://doi.org/10.3390/pr13103040

This article belongs to the Special Issue Processes in Agri-Food Technology

Version Notes

Order Reprints

Abstract

Ensuring the surface quality of fresh soybeans is critical for maintaining their commercial value and consumer confidence. However, traditional manual inspection remains labor-intensive, subjective, and inadequate for real-time, high-throughput sorting. In this study, we present a multi-scale attention-augmented You Only Look Once version 8 (YOLOv8) framework tailored for real-time surface defect detection in fresh soybeans. The proposed model integrates two complementary attention mechanisms—Squeeze-and-Excitation (SE) and Multi-Scale Dilated Attention (MSDA)—to enhance the detection of small, irregular, and low-contrast defects under complex backgrounds. Rather than relying on cross-model comparisons, we perform systematic ablation studies to evaluate the individual and combined contributions of SE and MSDA across diverse defect categories. Experimental results from a custom-labeled soybean dataset demonstrate that the integrated SE+MSDA model achieves superior performance in terms of precision, recall, and Mean Average Precision (mAP), particularly for challenging categories such as wormholes and speckles. The proposed framework provides a lightweight, interpretable, and deployment-ready solution for intelligent agricultural inspection, with potential applicability to broader food quality control tasks.

Keywords:

fresh soybeans; surface defect detection; YOLOv8; SE module; food quality inspection

1. Introduction

Food safety and external appearance quality are critical concerns in the modern food industry. The visual features of external surfaces—such as discoloration, deformation, or blemishes—can signal early signs of quality degradation, including microbial contamination or spoilage, thereby making visual inspection an effective frontline method for identifying unsafe products before they enter the supply chain [1,2]. As a nutritionally dense and widely consumed legume, fresh soybeans (also known as edamame) are increasingly embraced by health-conscious consumers and regarded as a symbol of healthy diets [3,4]. However, surface defects such as mechanical breakage, color fading, mold growth, and pest infestation not only diminish their visual appeal but also raise concerns about quality and safety, reinforcing the need for robust surface inspection mechanisms during processing [5,6]. Currently, the quality assessment of agricultural products like fresh soybeans still relies heavily on manual inspection [6,7]. Such approaches are labor-intensive, time-consuming, and prone to subjectivity, often resulting in inconsistent outcomes across different personnel and production lines. Moreover, traditional computer vision methods based on handcrafted features are limited in their ability to cope with the complex visual diversity and unpredictable environmental variations characteristic of real-world soybean defects [8]. These limitations underscore the urgency of developing intelligent, real-time detection systems that combine accuracy with adaptability for deployment in automated sorting and quality control pipelines.

With the rapid progress of deep learning, automated visual inspection powered by convolutional neural networks (CNNs) has shown promise in addressing complex defect detection tasks in food quality control. Among them, object detection algorithms have evolved from two-stage detectors like R-CNN (Region-based Convolutional Neural Networks) to one-stage architectures, with the YOLO (You Only Look Once) family emerging as a dominant real-time solution [9,10]. Successive improvements in YOLOv2 to YOLOv7 have progressively enhanced accuracy and speed through innovations such as multi-scale feature extraction, CSPNet (Cross-Stage Partial Network) backbones, and Mosaic augmentation [11,12]. YOLOv8 further integrates anchor-free design, segmentation tasks, and a more compact backbone, offering state-of-the-art performance across various datasets and applications [13,14]. Despite the strength of YOLO-based detectors, their direct application to fresh soybean surface inspection faces several hurdles. The defects in soybean pods are diverse in type and subtle in visual presentation, ranging from small specks and tiny cracks to barely visible mold, often exhibiting low contrast against natural backgrounds. Additionally, real-world scenes with non-uniform lighting, overlapping pods, and cluttered conveyors pose further difficulties for accurate localization and classification. These challenges necessitate models that are both lightweight and robust to visual variability.

Recent advances have shown that incorporating attention mechanisms and multi-scale feature fusion can significantly improve the precision of CNNs in defect detection, particularly for small or ambiguous targets. The Squeeze-and-Excitation (SE) block, a widely adopted channel attention mechanism, enhances network focus on discriminative features while suppressing irrelevant channels [15]. It has proven effective in various agricultural applications, such as identifying minor defects on apples and passion fruit [16,17]. Similarly, spatial attention modules like CBAM [18] enable the model to identify not just “what” but also “where to focus, improving localization under noisy conditions. More recently, Multi-Scale Dilated Attention (MSDA) structures have been introduced to enhance receptive fields without additional downsampling, effectively capturing both global context and fine-grained texture in complex backgrounds [19,20].

To address the challenges of real-time defect detection in fresh soybeans—characterized by small, irregular, and low-contrast surface anomalies—we propose an attention-augmented object detection framework based on YOLOv8. Specifically, the model integrates a SE module to enhance channel-wise feature calibration and improve sensitivity to subtle defects, as well as a novel MSDA mechanism that leverages dilated convolutions and cross-scale fusion to capture heterogeneous features in visually complex environments. To evaluate the model’s performance under realistic conditions, we manually constructed a domain-specific dataset encompassing seven common defect categories, including normal pods, discoloration, insect holes, rust spots, and foreign objects (Figure 1), all annotated from real production scenarios. We further built a high-speed conveyor test platform (Figure 2) that replicates actual sorting-line conditions—such as overlapping pods, motion blur, and variable lighting—and includes a pneumatic rejection system, simulating the demands of multimodal coordination and real-time defect removal. This combined setup enables comprehensive validation of both model accuracy and deployment feasibility in practical agricultural environments.

Figure 1. Classification of soybean surface conditions. (a) Color deterioration, (b) physical damage, (c) single-seed pod, (d) normal pod, (e) rust spots, (f) insect holes, and (g) foreign objects, where (h) shows the actual discharge interface of the sorting system, illustrating the physical separation of defective and normal soybean pods under real-world conditions.

Figure 2. Schematic of the real-world soybean sorting system. The setup includes high-speed image acquisition, pneumatic rejection, and multi-line sorting under complex conditions such as overlapping pods and variable lighting.

The main contributions of this study are as follows: (1) we design an enhanced YOLOv8-based model tailored for fresh soybean surface quality inspection, with improved detection of small and ambiguous defects; (2) propose a novel Multi-Scale Dilated Attention (MSDA) module to enhance global-local feature fusion under noisy and variable backgrounds; and (3) construct a real-world annotated dataset of fresh soybean defects and conduct extensive experiments, demonstrating superior performance compared to baseline detectors. This research fills a critical gap in the domain of intelligent visual inspection for agricultural products and provides a reference architecture for broader multimodal quality control applications [21,22]. Future directions may include integration with hyperspectral imaging or lightweight edge deployment to further improve robustness and applicability under diverse industrial settings.

The rest of the paper is structured as follows. Section 2 introduces the dataset collection, preprocessing procedures, and details of the proposed SE and MSDA modules integrated into YOLOv8n. Section 3 reports the experimental design and results, while Section 4 discusses the findings, limitations, and practical implications. Section 5 concludes the paper and suggests directions for future research.

2. Materials and Methods

2.1. Image Acquisition and Dataset

The experimental dataset was collected at the Engineering Laboratory of the Chinese Academy of Agricultural Mechanization Sciences, located in Chaoyang District, Beijing. The target samples were freshly harvested edamame (fresh soybeans) obtained from Jingyuan Commune in Changping District, Beijing. After harvest, the edamame pods were imaged using two customized Hikvision line-scan cameras (Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou, China) [8]. The cameras were positioned at a distance of 0.3–5 m with a field of view width set to 60 cm. To capture the dynamic state of the falling pods, the cameras were placed on opposite sides of the soybean trajectory, collecting both top-view and bottom-view images. The initial falling speed of the soybeans was approximately 2.5 m/s [6]. In total, about 50 kg of fresh edamame was collected, from which 12,125 raw images were initially acquired. Since most harvested pods were normal, constructing a balanced dataset was challenging. To mitigate class imbalance, a subset of 1441 images was carefully selected to ensure sufficient representation of defective categories while maintaining diversity. Each image had a resolution of 1024 × 192 pixels. The dataset includes five categories of external surface defects (see Figure 1)—aged pods, broken pods, single-seed pods, rust-spotted pods, and insect-damaged pods—as well as various types of foreign objects, such as twigs, leaves, and stones. The overall structure of the real-time image acquisition and sorting system is illustrated in Figure 3, where the placement of cameras, auxiliary lighting, and pneumatic ejection modules are shown within the integrated hardware platform. This setup simulates realistic production environments with high-speed motion, variable lighting, and occlusion, making the dataset a representative benchmark for robust agricultural inspection.

Figure 3. Overview of the real-time visual inspection system for fresh soybeans. The system consists of a vibratory feeder (1), conveyor belt (16), image acquisition system (14), upper and lower auxiliary light sources (11, 9), air blow actuator (10), and product outlets for defective (7) and acceptable soybeans (8). During operation, soybeans are transported via the conveyor, visually inspected under controlled lighting, and automatically sorted based on defect detection results.

2.2. Data Preprocessing

To construct the soybean surface defect and foreign object dataset, all images were manually annotated using the LabelImg tool (https://github.com/HumanSignal/labelImg, accessed on the 1 May 2025), generating annotations (see Figure 4a) suitable for training with the YOLOv8n framework. In total, 2715 annotated instances were labeled across 1441 images. The dataset covers seven categories, indexed as follows for training: 0 = normal, 1 = foreign object, 2 = damage, 3 = insect hole, 4 = spot, 5 = single pod, and 6 = color deterioration. The corresponding distributions are: 414 normal pods (17.2%), 505 foreign objects (20.9%), 431 damaged pods (17.9%), 186 insect holes (7.7%), 408 spots (16.9%), 378 single pods (15.7%), and 393 color-deteriorated pods (16.3%). To enhance the model’s generalization ability, various data augmentation techniques were applied, including image rotation, flipping, brightness and contrast adjustment, and random cropping [23]. To address the class imbalance issue in the dataset, oversampling and undersampling strategies were employed where appropriate [24].

Figure 4. Annotation process and label visualization for soybean defect detection. (a) Manual annotation interface using LabelImg (Version: 1.8.6), showing object bounding boxes and category selection. Each image has a resolution of 1024 × 192 pixels. (b) Sample visualization of annotated training data, where different colors and numerical labels indicate distinct defect classes: 0 = normal, 1 = foreign_object, 2 = damage, 3 = insect_hole, 4 = spot, 6 = color deterioration.

All objects in the images were annotated with bounding boxes and corresponding class labels. Initially, annotation files were generated in XML format. Since YOLOv8n does not directly support XML annotations, the files were converted into YOLO-compatible TXT format. Each TXT file includes the following information: (1) Class ID (as a numeric label); (2) x-coordinate of the bounding box center (normalized by image width); (3) y-coordinate of the bounding box center (normalized by image height); (4) width of the bounding box (normalized), and (5) height of the bounding box (normalized). To intuitively demonstrate the annotation results, representative examples of the labeled images are shown in Figure 4b, where each defect type is assigned a distinct color and numeric identifier, corresponding to the categories defined in the dataset.

2.3. YOLOv8 Architecture and Improvements

2.3.1. YOLOv8n Baseline

YOLOv8 is a one-stage object detection framework that balances detection accuracy and computational efficiency. The YOLOv8 family includes five model variants—YOLOv8n (Nano), YOLOv8s (Small), YOLOv8m (Medium), YOLOv8l (Large), and YOLOv8x (Extra Large) [25]—which progressively increase in complexity, computational load, and accuracy. However, higher model capacity typically comes at the cost of reduced detection speed. To meet the real-time and high-precision requirements of agricultural applications, we selected the YOLOv8n model as the base detector for identifying surface defects and foreign objects in fresh soybeans. YOLOv8n offers excellent real-time performance and computational efficiency, making it suitable for deployment in resource-constrained environments such as embedded systems and mobile platforms, while maintaining sufficient detection accuracy [26].

The YOLOv8n architecture comprises four main components: the input module, backbone, neck, and prediction head. Compared to earlier YOLO variants, YOLOv8n features a more streamlined and efficient architecture optimized for speed and real-time responsiveness. At the input stage, YOLOv8n employs multi-scale data augmentation techniques [27], including random scaling, cropping, flipping, brightness adjustment, and contrast variation. These strategies enhance the model’s robustness to object scale variability and improve generalization. Additionally, YOLOv8n supports an anchor-free detection mechanism [25,28], allowing the model to directly predict object centers and bounding boxes during training without requiring predefined anchor boxes. This design simplifies configuration and reduces computational complexity.

For feature extraction, the backbone incorporates a lightweight architecture based on Cross-Stage Partial Network (CSPNet) and Depthwise Separable Convolutions [25]. CSPNet divides feature maps into two parts for parallel convolution and fusion operations, effectively enhancing feature representation while reducing redundant computation. The prediction head uses a fully convolutional anchor-free design that directly performs classification and bounding box regression at each pixel location. It outputs class confidence scores and bounding box coordinates for each detection. This design simplifies training, reduces computational overhead, and improves detection performance in complex scenes involving overlapping and densely packed targets [29]. Through the integration of multi-scale augmentation and anchor-free detection, YOLOv8n significantly accelerates inference while maintaining high accuracy, making it well-suited for the real-time surface quality inspection of fresh soybeans in practical settings.

To visually illustrate the proposed enhancements, Figure 5 presents the architecture of the improved YOLOv8n model used for the real-time detection of soybean surface defects and foreign objects. The model consists of three main components: the backbone, neck, and detection head. The backbone adopts a streamlined CSPNet-based structure, while the neck integrates a Multi-Scale Dilated Attention (MSDA) mechanism to enhance feature aggregation across scales. Additionally, a lightweight SE attention module is inserted after the SPPF (Spatial Pyramid Pooling—Fast) block to strengthen the channel-wise feature representation. These architectural modifications aim to improve the model’s ability to detect small and overlapping objects under complex backgrounds with minimal computational overhead.

Figure 5. The architecture of the improved YOLOv8n model for real-time detection of soybean surface defects and foreign objects. The backbone incorporates the SE attention module after the SPPF (Spatial Pyramid Pooling—Fast) block to enhance channel-wise feature weighting. In the neck, the original PANet structure is augmented with the Multi-Scale Dilated Attention (MSDA) modules, enabling better feature aggregation across scales. The model outputs detection results at three different resolutions in the head for small, medium, and large objects.

2.3.2. Squeeze-and-Excitation (SE) Attention Module

Although YOLOv8n is a lightweight variant of the YOLOv8 family with high detection speed and efficiency, its reduced model complexity may limit its ability to capture fine-grained features, especially in complex scenes with small or subtle defects. In the context of fresh soybean defect and foreign object detection, accurate identification often depends on subtle visual cues and fine spatial variations. To address this challenge, we integrated the Squeeze-and-Excitation (SE) attention mechanism into the backbone of the YOLOv8n network to enhance channel-wise feature discrimination and improve detection precision.

The SE block is a lightweight channel attention module that strengthens informative features while suppressing less relevant or redundant ones by adaptively recalibrating channel-wise responses [15]. It has been widely adopted in image classification and object detection tasks, showing its effectiveness in improving feature representation with minimal computational overhead [30]. In agricultural contexts, similar attention mechanisms have also been applied to enhance fine-grained detection, especially when the visual difference between healthy and defective produce is subtle [31]. It consists of three key steps:

Squeeze: Global average pooling is applied to each channel to capture global spatial information, resulting in a channel descriptor vector. Given an input feature map

X \in R^{H \times W \times C}

, the squeezed feature

z_{C}

for channel C is computed as:

z_{C} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{C} (i, j),

(1)

where H and W denote the height and width of the input feature map, respectively.

Excitation: The channel descriptor vector is passed through two fully connected (FC) layers with a ReLU activation followed by a Sigmoid function to generate channel-wise weights:

s = σ (W_{2} \cdot δ (W_{1} \cdot Z)),

(2)

where

W_{1}

and

W_{2}

are learnable weight matrices,

δ

denotes the ReLU activation, and

σ

is the Sigmoid function.

Recalibration: The original feature map X is reweighted by multiplying each channel by its corresponding scalar weight

s_{C}

, yielding the recalibrated output

{\tilde{X}}_{C}

:

{\tilde{X}}_{C} = s_{C} \cdot X_{C} .

(3)

This mechanism allows the network to emphasize more informative channels while suppressing those less relevant to the task, thus improving the feature representation quality (see Figure 6 for the SE module structure) [15].

Figure 6. Architecture of the SE (Squeeze-and-Excitation) attention module. The module first applies global average pooling to generate a channel-wise descriptor, which is then passed through two fully connected (FC) layers with a ReLU and Sigmoid activation to produce channel-wise weights. These weights are used to recalibrate the original input feature maps, enhancing important channels and suppressing less informative ones.

In our implementation, the SE module is inserted after the SPPF module at the end of the YOLOv8n backbone. The SPPF module aggregates multi-scale contextual information, while the SE module further refines these features by adaptively emphasizing channels with higher task relevance. This placement ensures that high-level semantic features are recalibrated before being passed to the neck and prediction head, enhancing the model’s sensitivity to small and visually similar defects. Despite its simplicity, the SE module introduces minimal additional computational cost, making it suitable for real-time agricultural applications [32]. Our experiments show that incorporating SE attention significantly improves detection performance, particularly under complex backgrounds and for small target instances. It enhances the model’s generalization and feature sensitivity, contributing to more accurate classification and localization of defective soybeans and foreign objects.

2.3.3. MSDA: Multi-Scale Dilated Attention Module

To enhance YOLOv8n’s ability to detect surface defects and foreign objects in fresh soybeans—particularly under complex backgrounds and across varied object scales—we introduce a Multi-Scale Dilated Attention (MSDA) module. Surface anomalies in soybeans often exhibit diverse shapes, sizes, and distributions, making them difficult to capture using standard convolutional layers with fixed receptive fields. While previous studies have explored multi-scale attention mechanisms and fusion strategies such as BiFPN to address these challenges [33], these approaches are often computationally intensive and designed for larger detection models. In contrast, our MSDA design offers a lightweight yet effective solution by integrating dilated convolutions and scale-adaptive attention into a compact structure. This enables the model to aggregate both global and local features while maintaining high inference speed, making it well-suited for real-time agricultural applications on resource-constrained devices.

The dynamic weighting strategy of MSDA draws from broader research on adaptive attention [30], allowing the model to prioritize the most relevant scale-aware information while suppressing noise. In agricultural applications, such fine-scale adaptivity is crucial—particularly when dealing with variable lighting, cluttered backgrounds, and highly localized defects [34]. Our design ensures these capabilities are preserved even under real-time constraints typical of edge-device deployment [32].

The MSDA module consists of three core components: multi-scale convolutions, dilated convolutions, and attention-based weighting. First, given an input feature map

X \in R^{H \times W \times C}

, where H and W are the height and width, and C is the number of channels, multiple convolution kernels of different scales are applied to extract features:

X_{i} = f_{c o n v} (X; k_{i}),

(4)

where

f_{c o n v}

denotes the convolution operation and

k_{i}

is the kernel size at scale i, yielding feature maps

X_{1}, X_{2}, \dots, X_{n}

. To enhance contextual perception, dilated convolutions are applied. Given a dilation rate r, the operation is defined as:

y (p) = \sum_{s = 1}^{s} w (s) \cdot x (p + r \cdot s),

(5)

where

y (p)

is the output at position p, x(⋅) denotes the input feature value, and w(s) is the kernel weight at location s. This allows the model to enlarge the receptive field without increasing computational burden, improving sensitivity to dispersed or small-scale features. Finally, MSDA employs an attention mechanism to assign adaptive weights

α_{i}

to each scale’s feature map based on its relevance:

X_{a t t n} = \sum_{i = 1}^{n} α_{i} \cdot X_{i}, where \sum_{i = 1}^{n} α_{i} = 1

(6)

The attention weights

α_{i}

are dynamically calculated using a scoring function

f_{a t t n}

:

α_{i} = \frac{e x p (f_{a t t n} (X_{i}))}{\sum_{j = 1}^{n} e x p (f_{a t t n} (X_{j}))} .

(7)

These formulations follow standard definitions in the attention mechanism literature [30]. This adaptive weighting mechanism enables the model to focus on the most informative scale-specific features while suppressing redundant or irrelevant information. As illustrated in Figure 7, the MSDA module effectively enhances YOLOv8n’s multi-scale representation and contextual awareness, resulting in improved robustness and accuracy for detecting surface defects and foreign objects in fresh soybeans. This improvement proves especially valuable in challenging environments with cluttered backgrounds or scale variations.

Figure 7. The architecture of the proposed Multi-Scale Dilated Attention (MSDA) module. Input feature maps are processed through multiple convolution layers of varying kernel sizes (e.g., 3 × 3, 5 × 5, etc.) followed by dilated convolutions to enhance receptive fields. The resulting multi-scale feature maps are then adaptively weighted via an attention mechanism, and aggregated to form the output feature map.

2.4. Experimental Configuration

To verify the effectiveness of the proposed improvements to the YOLOv8n model, a series of experiments were conducted using a customized hardware and software environment. All experiments were implemented on a workstation running Windows 10, equipped with an Intel Core i3-12490F CPU (Intel Corporation, Santa Clara, CA, USA), NVIDIA GeForce RTX 3080 GPU (10GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), and 32GB of RAM (Kingston Technology Company, Inc., Fountain Valley, CA, USA). The deep learning framework used was PyTorch 2.3.1, with Python 3.11 as the programming language. All model training and evaluation were performed using the PyCharm Professional (version 2024.1; JetBrains s.r.o., Prague, Czech Republic) Integrated Development Environment. The training process utilized the improved YOLOv8n model as the base detector. The input image size was fixed at 640 × 640 pixels, the batch size was set to 64, and the initial learning rate was 0.01. The number of training epochs was 200. Data augmentation techniques such as random flipping, rotation, brightness adjustment, and contrast variation were applied to enhance generalization. The dataset was randomly split into training (80%), validation (10%), and testing (10%) subsets. All images were manually annotated with bounding boxes and corresponding class labels. To address class imbalance, a combination of oversampling and undersampling strategies was applied to ensure sufficient training samples for minority classes. To ensure fair and comprehensive evaluation, multiple performance indicators were used, including Precision, Recall, Mean Average Precision (mAP@50 and mAP@50:95), model size (Parameters), floating-point operations (FLOPs), and inference speed (FPS). Detailed metric definitions and corresponding equations are provided in Section 3.1.

The training and validation processes were visually monitored using loss and metric curves automatically generated by the YOLOv8 framework. As shown in Figure 8, all training losses (box, classification, and distribution focal loss) consistently decreased over 100 epochs, indicating a stable optimization process. Correspondingly, the evaluation metrics including precision, recall, mAP@50, and mAP@50–95 on the validation set exhibited a steady upward trend, demonstrating improved detection accuracy and generalization capability. These curves confirm the effectiveness and convergence of the proposed enhancements during training.

Figure 8. Training and validation curves of the enhanced YOLOv8n model over 100 epochs. The first row shows the evolution of training losses (box loss, classification loss, and distribution focal loss) and evaluation metrics (precision and recall). The second row presents validation losses and two commonly used performance indicators: mAP@50 and mAP@50–95. Solid blue lines represent raw results, while dotted orange lines indicate smoothed trends.

3. Experiments and Results

3.1. Evaluation Metrics

To quantitatively evaluate the detection performance of the improved YOLOv8n model, several widely adopted object detection metrics were employed:

(1) Precision. Precision measures the proportion of correctly identified positive samples among all samples predicted as positive [9]:

P r e c i s i o n = \frac{T P}{T P + F P},

(8)

where TP (True Positive) denotes correctly detected targets, and FP (False Positive) denotes incorrectly predicted targets.

(2) Recall. Recall quantifies the proportion of correctly predicted positive samples among all actual positives [9]:

R e c a l l = \frac{T P}{T P + F N}

(9)

where FN (False Negative) denotes the missed targets that were not detected.

(3) Mean Average Precision (mAP). Mean Average Precision reflects the overall detection accuracy across all classes. In this study, two forms are reported: mAP@50, IoU threshold set at 50%; mAP@50:95, Averaged over IoU thresholds from 50% to 95% in steps of 5% [10].

The Average Precision (AP) for a single class is computed as the area under the precision–Recall curve:

A P = \int_{0}^{1} p (r) d r

(10)

where

p (r)

is the precision as a function of recall r. The mean of AP across all N classes is:

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(11)

(4) Others. Parameters refer to the total number of learnable weights in the model, reflecting model size and memory requirements. FLOPs (Floating Point Operations) indicate the model’s computational cost during inference, typically reported in GFLOPs. Frames per second (FPS) denotes how many images the model can process per second, reflecting real-time performance. Higher FPS is critical for deployment in agricultural automation scenarios.

3.2. Performance of the Enhanced YOLOv8n Model

To establish the effectiveness of the proposed SE+MSDA enhancements under practical conditions, we evaluated the final trained model on a held-out test set. As shown in Figure 9, the model achieved high detection accuracy across all seven defect categories. The average mAP@50 reached 0.949, with individual class average precisions (AP) ranging from 0.910 (foreign object) to 0.970 (normal pod). These results highlight the model’s robustness in detecting both subtle and complex defects under realistic conditions. The normalized confusion matrix further reveals strong classification confidence, with most diagonal elements exceeding 0.90. Minor confusion was observed between background and low-contrast categories such as insect holes and foreign objects, indicating the challenges posed by low-visibility anomalies.

Figure 9. Precision–recall curve (left) and normalized confusion matrix (right) of the SE+MSDA-enhanced YOLOv8n model on the test set. The PR curve illustrates high class-wise detection performance, while the confusion matrix reflects strong classification consistency with few misclassifications.

In addition, Table 1 provides a quantitative comparison with baseline YOLO variants. The proposed SE+MSDA-YOLOv8n achieves the highest overall precision (94.2%) and mAP@50 (95.1%) among all models, outperforming YOLOv5n, YOLOv8n, and YOLOv10n. Although YOLOv10n attains the highest mAP@50:95 (63.7%), our method strikes a better balance between accuracy and efficiency, with comparable or fewer parameters (2.66M) and lower FLOPs (7.4 × 10⁹) than YOLOv8n. Overall, the proposed model demonstrates consistent and well-balanced performance, laying the foundation for further investigation via ablation studies.

Table 1. Comparison of different detection models on the soybean defect dataset.

3.3. Ablation Study on Attention Modules for Unified Edamame Defect Detection

Given the specific focus of this study on optimizing YOLOv8 for real-time surface defect detection in fresh soybeans, we did not include direct comparisons with other mainstream object detection algorithms such as Faster R-CNN, SSD (Single Shot MultiBox Detector), and YOLOv6. These models, while widely used, typically exhibit significantly higher computational complexity and lower inference speed compared to YOLOv8n, making them less suitable for real-time applications in resource-constrained agricultural environments. Instead, we focused on evaluating the performance gains introduced by the Squeeze-and-Excitation (SE) attention module and the Multi-Scale Dilated Attention (MSDA) mechanism through rigorous ablation studies. These experiments were conducted to assess how each attention module, individually and jointly, contributes to the model’s detection accuracy, efficiency, and robustness under unified conditions.

The ablation results are summarized in Table 2. Although the integration of the SE module slightly reduced precision, it significantly improved Recall by 5.8 percentage points, mAP@50 by 1.9 points, and mAP@50:95 by 1.3 points, indicating better overall detection performance. The SE module introduces negligible computational overhead while slightly increasing inference time. This improvement stems from SE’s ability to dynamically reweight channel-wise features, enhancing relevant features and suppressing irrelevant ones. This channel-wise recalibration improves sensitivity to ambiguous or edge-case edamame samples, particularly boosting recall.

Table 2. Ablation results of SE and MSDA modules on unified dataset.

Upon introducing the MSDA module, precision decreased by 0.6 percentage points, while Recall increased by 1 point. Both mAP@50 and mAP@50:95 experienced a slight decline. However, the parameter count and FLOPs were reduced, and inference speed increased. These effects are attributable to MSDA’s multi-scale feature extraction and dilated convolution, which expand the receptive field without increasing resolution. Although MSDA enhances contextual awareness and activates subtle defect regions, it can also introduce spatial noise and false positives due to over-attending to ambiguous background textures.

Finally, combining SE and MSDA modules yielded the best performance across all metrics, with precision and recall both improving, and a 3-point increase in mAP@50 and a 2.7-point increase in mAP@50:95 compared to the baseline YOLOv8n. The combination demonstrated complementarity between global channel recalibration (SE) and local multi-scale spatial attention (MSDA), achieving a balance between detection accuracy and efficiency.

The visual comparisons in Figure 10 further illustrate the effect of different attention modules on detection quality. As shown in the examples, the baseline YOLOv8n (Figure 10a) tends to miss wormholes or misclassify normal beans due to insufficient focus on subtle features. The MSDA-enhanced model (Figure 10b) improves localization by expanding spatial receptive fields, but occasionally produces false positives due to over-activation of background textures. The SE-enhanced variant (Figure 10c) demonstrates improved recall but with slightly conservative bounding. In contrast, the combined SE+MSDA model (Figure 10d) shows the most balanced and robust performance, effectively identifying multiple small or overlapping defects with precise localization. These qualitative results validate the complementary benefits of SE and MSDA in enhancing the model’s sensitivity and robustness under complex detection scenarios.

Figure 10. Detection results of different YOLOv8n variants on fresh soybean surface defects. (a) Baseline YOLOv8n, (b) MSDA-YOLOv8n, (c) SE-YOLOv8n, and (d) SE+MSDA-YOLOv8n. The combined model (d) shows more accurate and robust detection, especially for small and ambiguous defects.

3.4. Ablation Study on Defect-Specific Detection Performance of Attention Mechanisms

In real-world production scenarios, it is essential to accurately detect and classify individual categories of surface defects in fresh soybeans, such as spots, foreign objects, damage, single pods, overripe beans, and wormholes. A fine-grained evaluation of the model’s performance across defect types offers two critical benefits. First, it helps identify which specific defect categories pose challenges to the detection model. Second, it prevents the overall mAP from concealing poor performance on rare but critical defect types, which may be underrepresented in the dataset. As shown in Table 3, the baseline YOLOv8n model achieved the highest precision in detecting single pods (93.6%), followed by foreign objects and damage. In contrast, wormhole defects had the lowest precision (84.6%) and an especially low recall of 51.9%, indicating a significant number of missed detections. This is likely due to the irregular shapes and colors of wormholes, which make them hard to distinguish from the background. The mAP@50 and mAP@50:95 values further confirm that YOLOv8n performs well for well-defined defects like single pods, but struggles to localize small or obscure regions such as wormholes and foreign objects.

Table 3. Detection performance of different attention mechanisms across fresh soybean defect categories. Each cell reports either precision/recall or mAP@50/mAP@50:95.

When incorporating the SE attention module, precision decreased in most categories—by 10 percentage points for damage, 8.4 points for single pods, and 9.3 points for normal samples—likely due to more conservative bounding. However, precision for wormholes improved significantly (+14.6 points), and recall showed major gains across all classes, particularly wormholes (+21.2 points). The SE module enhances feature sensitivity by learning channel-wise importance, making the model more capable of identifying subtle or blurred regions. As a result, although some precision is sacrificed, the overall defect coverage improves—especially for difficult categories like wormholes. The MSDA module also brought improvements to certain classes. For instance, precision rose for spot (+8.6 points) and overripe categories (+1.9 points), while wormhole recall again increased by 21.2 points. While most mAP@50 values remained stable, the mAP@50:95 for wormholes increased by 5.5 points. These results suggest that MSDA is particularly effective for detecting small or complex defects, owing to its capacity to capture multi-scale spatial features. Nevertheless, some noise was introduced for well-separated classes such as normal or single pods, resulting in minor drops in precision.

Finally, the combined SE + MSDA configuration outperformed all other variants across nearly all metrics and defect categories. Precision improved for every class (except for a slight decline of 0.7 points for single pods), while recall also improved across the board, particularly for wormholes (+11.6 points) and normal beans (+6.5 points). The mAP@50 and mAP@50:95 values showed consistent gains across all classes. These findings demonstrate the complementary nature of SE and MSDA: together, they enhance the model’s capacity to balance precision and recall, improve sensitivity to small or complex defects, and reduce false negatives, resulting in a robust and well-generalized detection model. Importantly, these results highlight that the SE and MSDA modules address distinct challenges—SE enhances sensitivity to subtle, low-contrast defects such as wormholes, while MSDA improves robustness under overlapping and cluttered scenes. Their integration therefore provides complementary benefits specific to fresh soybean defect detection, rather than a simple stacking of modules.

Figure 11 offers a comprehensive visual comparison of the detection performance across different models and defect categories. The hollow bars represent average metric values (precision, recall, mAP@50, and mAP@50:95) for each model, while the overlaid dotted lines depict defect-specific trends across seven categories: Spot, Foreign Object, Damage, Single Pod, Overripe, Wormhole, and Normal. Notably, the SE+MSDA-YOLOv8n variant consistently yields the highest average performance across all metrics, confirming its robust generalization. However, the visual patterns also reveal important nuances: for example, in Figure 11b, SE-YOLOv8n achieves the highest average recall, yet Figure 11a shows that it suffers from the lowest average precision due to over-detection in certain classes such as normal and single pods. In contrast, the SE+MSDA configuration maintains a more balanced precision-recall trade-off, especially evident in its stable defect-wise trends and superior localization performance in Figure 11d. These visual insights validate the complementary benefits of integrating channel-wise and multi-scale attention mechanisms, and highlight their effectiveness in improving both overall and category-specific detection reliability.

Figure 11. Comparative performance of YOLOv8n and its variants (SE, MSDA, and SE+MSDA) across four evaluation metrics. (a) Precision, (b) recall, (c) mAP@50, and (d) mAP@50:95.

4. Discussion

4.1. Main Contributions

This study demonstrates that integrating lightweight attention mechanisms—Squeeze-and-Excitation (SE) and Multi-Scale Dilated Attention (MSDA)—into the YOLOv8n framework significantly improves the accuracy and robustness of fresh soybean surface defect detection in near real-time settings on GPU hardware. The SE module contributes to channel-wise feature recalibration, effectively amplifying discriminative responses for subtle, low-contrast defects such as wormholes and rust spots [15,35]. Meanwhile, MSDA leverages dilated convolutions across multiple receptive fields to extract spatial features from heterogeneous and overlapping pods, enabling precise localization under complex backgrounds [36,37]. Beyond numerical performance—where the SE+MSDA configuration achieves 95.1% mAP@50, 87.8% recall, and 1.8 ms inference latency per image on an NVIDIA RTX 3080 GPU(NVIDIA Corporation, Santa Clara, CA, USA)—our results support a broader theoretical insight: attention mechanisms not only enhance feature representation but can act as adaptive behavioral regulators in object detection systems [38]. This aligns with recent findings in agricultural AI, which suggest that attention-based models dynamically reallocate computational resources according to input saliency, balancing precision and recall under real-world uncertainty [39,40]. Although only tested on workstation hardware, the lightweight design suggests strong potential for future edge deployment under realistic production environments, which will be a key direction for future validation. In summary, the integration of SE and MSDA represents a task-oriented enhancement of YOLOv8n, effectively balancing subtle-defect sensitivity and multi-scale robustness for conveyor-belt inspection, while suggesting a practical alternative to more complex attention designs.

4.2. System-Level and Practical Implications

From a system design perspective, our work addresses the persistent trade-off between detection accuracy and computational efficiency—a central bottleneck in edge deployment for smart agriculture [12,41]. Traditional improvements in object detection often rely on model scaling or transformer-based architectures, which are ill-suited for constrained environments like conveyor belt sorting lines. Instead, the proposed attention-augmented YOLOv8n provides an interpretable, modular, and hardware-friendly alternative, achieving performance gains without increasing the backbone complexity or inference cost [13].

Moreover, the empirical generalization of the model across diverse lighting conditions and motion-induced blur illustrates the importance of robustness in deployment-oriented studies. Unlike prior works that benchmark detection models under ideal datasets (e.g., PASCAL VOC or COCO), this study contributes to a growing literature that emphasizes ecological validity and industrial applicability in food quality control [14,42]. In addition, our dataset was collected under real conveyor-belt operating conditions using an industrial camera system, rather than under laboratory settings. As a result, the images inherently contain non-ideal artifacts such as motion blur, uneven illumination, and compression effects that are unavoidable in real-world inspection lines. While such artifacts may reduce the clarity of local defect boundaries, they also make the dataset more representative of deployment scenarios. The fact that our attention-augmented YOLOv8n maintained high precision and recall under these conditions demonstrates its robustness and practical value. Future work could explicitly investigate the impact of different artifact sources (e.g., JPEG compression, sensor noise) and explore artifact-aware training strategies to further enhance reliability in industrial environments. Related studies on JPEG a tifact removal [43] provide useful insights that could be leveraged in such directions.

Such robustness under non-ideal imaging conditions further distinguishes our approach from prior studies that rely on clean benchmark datasets, underscoring its readiness for practical deployment in agricultural inspection. This robustness is particularly valuable in minimizing false negatives for critical defects like contamination and mechanical damage, which bear direct implications for food safety and consumer trust [44]. Practically, accurate detection of low-frequency, high-risk defects supports automated sorting, traceability, and compliance with regulatory standards. Studies on food integrity have shown that even rare defect misclassification can lead to batch recalls or reputational damage [1]. By significantly reducing false negatives in categories like mold and foreign objects, our system adds value to digitalized agricultural supply chains and smart food processing.

4.3. Limitations and Future Directions

This research is not without limitations. The current improvements are built upon architectural refinements within the YOLO framework rather than paradigm-level innovations. Moreover, the dataset was collected from a single region and harvest batch of fresh edamame, which reflects the specific focus of this study on real-world inspection of fresh soybeans rather than on all soybean varieties. Although relatively small in scale, overfitting risks were mitigated by using pretrained YOLO backbones and independent test set validation, and the results indicate stable generalization under realistic conditions. Compared with large-scale benchmarks such as GrainSpace [45] and MANTA [46], our dataset is modest in size but tailored to realistic industrial settings, demonstrating that domain-specific data can still support robust defect detection. Future work may extend validation to other cultivars, harvest conditions, and imaging setups, but such expansion goes beyond the intended scope of this study.

Moreover, while we included YOLOv5n, YOLOv8n, and YOLOv10n as representative lightweight baselines, additional comparisons with non-YOLO detectors such as MobileNet-SSD [47] and EfficientDet-D0 [48] would further contextualize the contribution; we note this as a limitation and direction for future studies. Future research may also explore attention fusion strategies guided by task-aware priors (e.g., spectral–spatial relationships), semi-supervised training under label scarcity, and multimodal integration—particularly hyperspectral and thermal sensing—to further improve sensitivity and reduce ambiguity in defect classification [6,49]. In addition, while our current study focuses on attention-augmented CNN-based detectors, recent research demonstrates that combining detection pipelines with large language models (LLMs) can enable more flexible reasoning and open-vocabulary object recognition. For example, Jin et al. introduced DVDet, a descriptor-enhanced open-vocabulary detector that leverages LLMs as an implicit knowledge base for fine-grained region-text alignment, significantly improving open-vocabulary detection performance [50]. Exploring how attention-based YOLO variants may interact with such LLM frameworks represents a promising direction for future research.

5. Conclusions

Unlike existing approaches that focus primarily on either maximizing accuracy through structural complexity [33] or minimizing computation for edge deployment [32], this study introduces a balanced and domain-specific enhancement of YOLOv8n using SE and MSDA modules for fresh soybean defect detection. By optimizing lightweight attention without altering the model backbone, our method achieves improved accuracy with low-latency inference, offering a lightweight and deployment-friendly solution for agricultural quality inspection, with strong potential for future edge applications. This work underscores the value of task-oriented architectural refinement over generic model scaling, offering a practical reference for intelligent food grading in resource-constrained environments.

Author Contributions

Conceptualization, Z.W. and Y.H.; methodology, Z.W.; software, Z.W.; validation, Z.W., Y.Y. and Z.Z.; formal analysis, Z.W.; investigation, Z.W. and D.H.; resources, Y.Y. and Z.Z.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W., Y.H. and Z.D.; visualization, Z.W.; supervision, Y.H. and Z.D.; project administration, Y.H. and Z.D.; funding acquisition, Y.H., Y.Y. and Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 2022YFD2002205-01).

Data Availability Statement

The data supporting this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

Zhili Wu, Yakai He, Da Huo, Zhiyou Zhu, Yanchen Yang and Zhilong Du were employed by the China National Packaging and Food Machinery Corporation and the Chinese Academy of Agricultural Mechanization Sciences Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
FPS	Frames Per Second
FLOPs	Floating Point Operations per Second
mAP	Mean Average Precision
mAP@50	Mean Average Precision at IoU threshold 50%
mAP@50:95	Mean Average Precision averaged over IoU thresholds from 50% to 95%
MSDA	Multi-Scale Dilated Attention
SE	Squeeze-and-Excitation
YOLO	You Only Look Once
YOLOv8n	YOLO version 8 nano

References

Azad, Z.R.A.A.; Ahmad, M.F.; Siddiqui, W.A. Food Spoilage and Food Contamination. In Health and Safety Aspects of Food Processing Technologies; Malik, A., Erginkaya, Z., Erten, H., Eds.; Springer International Publishing: Cham, Swizerland, 2019; pp. 9–28. [Google Scholar] [CrossRef]
Ma, Y.; Yang, W.; Xia, Y.; Xue, W.; Wu, H.; Li, Z.; Zhang, F.; Qiu, B.; Fu, C. Properties and Applications of Intelligent Packaging Indicators for Food Spoilage. Membranes 2022, 12, 477. [Google Scholar] [CrossRef] [PubMed]
Djanta, M.K.A.; Agoyi, E.E.; Agbahoungba, S.; Quenum, F.J.-B.; Chadare, F.J.; Assogbadjo, A.E.; Agbangla, C.; Sinsin, B. Vegetable soybean, edamame: Research, production, utilization and analysis of its adoption in Sub-Saharan Africa. J. Hortic. For. 2020, 12, 1–12. [Google Scholar] [CrossRef]
Wszelaki, A.L.; Delwiche, J.F.; Walker, S.D.; Liggett, R.E.; Miller, S.A.; Kleinhenz, M.D. Consumer liking and descriptive analysis of six varieties of organically grown edamame-type soybean. Food Qual. Prefer. 2005, 16, 651–658. [Google Scholar] [CrossRef]
Grunert, K.G. Food quality and safety: Consumer perception and demand. Eur. Rev. Agric. Econ. 2005, 32, 369–391. [Google Scholar] [CrossRef]
Gao, X.; Li, S.; Qin, S.; He, Y.; Yang, Y.; Tian, Y. Hollow discrimination of edamame with pod based on hyperspectral imaging. J. Food Compos. Anal. 2025, 137, 106904. [Google Scholar] [CrossRef]
Macedo, R.A.G.; Belan, P.A.; Araújo, S.A. An Embedded Computer Vision System for Beans Quality Inspection. Int. J. Comput. Appl. 2020, 175, 44–53. [Google Scholar] [CrossRef]
Vithu, P.; Moses, J.A. Machine vision system for food grain quality evaluation: A review. Trends Food Sci. Technol. 2016, 56, 13–20. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
Chen, D.; Lin, F.; Lu, C.; Zhuang, J.; Su, H.; Zhang, D.; He, J. YOLOv8-MDN-Tiny: A lightweight model for multi-scale disease detection of postharvest golden passion fruit. Postharvest Biol. Technol. 2025, 219, 113281. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, S.; Chen, Y.; Xia, Y.; Wang, H.; Jin, R.; Wang, C.; Fan, Z.; Wang, Y.; Wang, B. Detection of small foreign objects in Pu-erh sun-dried green tea: An enhanced YOLOv8 neural network model based on deep learning. Food Control 2025, 168, 110890. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html (accessed on 14 July 2025).
Sui, J.; Liu, L.; Wang, Z.; Yang, L. RE-YOLO: An apple picking detection algorithm fusing receptive-field attention convolution and eﬃcient multi-scale attention. PLoS ONE 2025, 20, e0319041. [Google Scholar] [CrossRef]
Zhou, Y.; Li, Z.; Xue, S.; Wu, M.; Zhu, T.; Ni, C. Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s. Agriculture 2025, 15, 1111. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. Available online: https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.html (accessed on 14 July 2025).
Li, M.; Liu, M.; Zhang, W.; Guo, W.; Chen, E.; Zhang, C. A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning. Appl. Sci. 2024, 14, 7071. [Google Scholar] [CrossRef]
Wu, M.; Lin, H.; Shi, X.; Zhu, S.; Zheng, B. MTS-YOLO: A Multi-Task Lightweight and Efficient Model for Tomato Fruit Bunch Maturity and Stem Detection. Horticulturae 2024, 10, 1006. [Google Scholar] [CrossRef]
Nguyen, X.-T.; Mac, T.-T.; Nguyen, Q.-D.; Bui, H.-A. An Industrial System for Inspecting Product Quality Based on Machine Vision and Deep Learning. Vietnam. J. Comput. Sci. 2025, 12, 193–208. [Google Scholar] [CrossRef]
Kuo, C.-J.; Chen, C.-C.; Chen, T.-T.; Tsai, Z.-J.; Hung, M.-H.; Lin, Y.-C.; Chen, Y.-C.; Wang, D.-C.; Homg, G.-J.; Su, W.-T. A Labor-Efficient GAN-based Model Generation Scheme for Deep-Learning Defect Inspection among Dense Beans in Coffee Industry. In Proceedings of the 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), Vancouver, BC, Canada, 22–26 August 2019; pp. 263–270. [Google Scholar] [CrossRef]
Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern Recognit. 2023, 137, 109347. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Int. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar] [CrossRef]
Nguyen, L.A.; Tran, M.D.; Son, Y. Empirical Evaluation and Analysis of YOLO Models in Smart Transportation. AI 2024, 5, 2518–2537. [Google Scholar] [CrossRef]
Hussain, M. YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision. arXiv 2024, arXiv:2407.02988. [Google Scholar] [CrossRef]
Kandasamy, K.; Natarajan, Y.; Sri Preethaa, K.R.; Ali, A.A.Y. A Robust TrafficSignNet Algorithm for Enhanced Traffic Sign Recognition in Autonomous Vehicles Under Varying Light Conditions. Neural Process. Lett. 2024, 56, 241. [Google Scholar] [CrossRef]
Su, Q.; Mu, J. Complex Scene Occluded Object Detection with Fusion of Mixed Local Channel Attention and Multi-Detection Layer Anchor-Free Optimization. Automation 2024, 5, 176–189. [Google Scholar] [CrossRef]
Brauwers, G.; Frasincar, F. A General Survey on Attention Mechanisms in Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 3279–3298. [Google Scholar] [CrossRef]
Venkateswara, S.M.; Padmanabhan, J. Deep learning based agricultural pest monitoring and classification. Sci. Rep. 2025, 15, 8684. [Google Scholar] [CrossRef]
Godinho de Oliveira, B.A.; Magalhães Freitas Ferreira, F.; Martins, C.A.P. da S. Fast and Lightweight Object Detection Network: Detection and Recognition on Resource Constrained Devices. IEEE Access 2018, 6, 8714–8724. [Google Scholar] [CrossRef]
Zhou, H.; Song, X.; Wang, G.; Li, C. Multi-Scale Attention Network for Object Detection. In Proceedings of the 2023 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), Adelaide, Australia, 9–11 July 2023; pp. 80–85. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Shaoqing, W.; Yamauchi, H. A Speed-up Channel Attention Technique for Accelerating the Learning Curve of a Binarized Squeeze-and-Excitation (SE) Based ResNet Model. J. Adv. Inf. Technol. 2024, 15, 565–571. [Google Scholar] [CrossRef]
Zhang, B.; Wang, R.; Zhang, H.; Yin, C.; Xia, Y.; Fu, M.; Fu, W. Dragon fruit detection in natural orchard environment by integrating lightweight network and attention mechanism. Front. Plant Sci. 2022, 13, 1040923. [Google Scholar] [CrossRef]
Wang, Z.; Xia, Y.; Wang, H.; Liu, X.; Che, R.; Guo, X.; Li, H.; Zhang, S.; Wang, B. Fresh Tea Leaf-Grading Detection: An Improved YOLOv8 Neural Network Model Utilizing Deep Learning. Horticulturae 2024, 10, 1347. [Google Scholar] [CrossRef]
da Costa, A.Z.; Figueroa, H.E.H.; Fracarolli, J.A. Computer vision based detection of external defects on tomatoes using deep learning. Biosyst. Eng. 2020, 190, 131–144. [Google Scholar] [CrossRef]
Sekharamantry, P.K.; Melgani, F.; Malacarne, J. Deep Learning-Based Apple Detection with Attention Module and Improved Loss Function in YOLO. Remote Sens. 2023, 15, 1516. [Google Scholar] [CrossRef]
Obsie, E.Y.; Qu, H.; Zhang, Y.-J.; Annis, S.; Drummond, F. Yolov5s-CA: An Improved Yolov5 Based on the Attention Mechanism for Mummy Berry Disease Detection. Agriculture 2023, 13, 78. [Google Scholar] [CrossRef]
Cubero, S.; Lee, W.S.; Aleixos, N.; Albert, F.; Blasco, J. Automated Systems Based on Machine Vision for Inspecting Citrus Fruits from the Field to Postharvest—A Review. Food Bioprocess. Technol. 2016, 9, 1623–1639. [Google Scholar] [CrossRef]
Nithya, R.; Santhi, B.; Manikandan, R.; Rahimi, M.; Gandomi, A.H. Computer Vision System for Mango Fruit Defect Detection Using Deep Convolutional Neural Network. Foods 2022, 11, 3483. [Google Scholar] [CrossRef]
Wang, M.; Fu, X.; Sun, Z.; Zha, Z.-J. JPEG Artifacts Removal via Compression Quality Ranker-Guided Networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 7–15 January 2020; International Joint Conferences on Artificial Intelligence Organization: Yokohama, Japan, 2020; pp. 566–572. [Google Scholar]
Pandey, V.K.; Srivastava, S.; Dash, K.K.; Singh, R.; Mukarram, S.A.; Kovács, B.; Harsányi, E. Machine Learning Algorithms and Fundamentals as Emerging Safety Tools in Preservation of Fruits and Vegetables: A Review. Processes 2023, 11, 1720. [Google Scholar] [CrossRef]
Fan, L.; Ding, Y.; Fan, D.; Di, D.; Pagnucco, M.; Song, Y. GrainSpace: A Large-scale Dataset for Fine-grained and Domain-adaptive Recognition of Cereal Grains. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 21084–21093. [Google Scholar] [CrossRef]
Zatsarynna, O.; Bahrami, E.; Abu Farha, Y.; Francesca, G.; Gall, J. MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 3438–3448. Available online: https://openaccess.thecvf.com/content/CVPR2025/html/Zatsarynna_MANTA_Diffusion_Mamba_for_Efficient_and_Effective_Stochastic_Long-Term_Dense_CVPR_2025_paper.html (accessed on 27 August 2025).
Li, Y.; Huang, H.; Xie, Q.; Yao, L.; Chen, Q. Research on a Surface Defect Detection Algorithm Based on MobileNet-SSD. Appl. Sci. 2018, 8, 1678. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Tan_EfficientDet_Scalable_and_Efficient_Object_Detection_CVPR_2020_paper.html (accessed on 27 August 2025).
Zhang, M.; Jiang, Y.; Li, C.; Yang, F. Fully convolutional networks for blueberry bruising and calyx segmentation using hyperspectral transmittance imaging. Biosyst. Eng. 2020, 192, 159–175. [Google Scholar] [CrossRef]
Jin, S.; Jiang, X.; Huang, J.; Lu, L.; Lu, S. LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=usrChqw6yK (accessed on 27 August 2025).

Figure 1. Classification of soybean surface conditions. (a) Color deterioration, (b) physical damage, (c) single-seed pod, (d) normal pod, (e) rust spots, (f) insect holes, and (g) foreign objects, where (h) shows the actual discharge interface of the sorting system, illustrating the physical separation of defective and normal soybean pods under real-world conditions.

Figure 2. Schematic of the real-world soybean sorting system. The setup includes high-speed image acquisition, pneumatic rejection, and multi-line sorting under complex conditions such as overlapping pods and variable lighting.

Figure 3. Overview of the real-time visual inspection system for fresh soybeans. The system consists of a vibratory feeder (1), conveyor belt (16), image acquisition system (14), upper and lower auxiliary light sources (11, 9), air blow actuator (10), and product outlets for defective (7) and acceptable soybeans (8). During operation, soybeans are transported via the conveyor, visually inspected under controlled lighting, and automatically sorted based on defect detection results.

Figure 4. Annotation process and label visualization for soybean defect detection. (a) Manual annotation interface using LabelImg (Version: 1.8.6), showing object bounding boxes and category selection. Each image has a resolution of 1024 × 192 pixels. (b) Sample visualization of annotated training data, where different colors and numerical labels indicate distinct defect classes: 0 = normal, 1 = foreign_object, 2 = damage, 3 = insect_hole, 4 = spot, 6 = color deterioration.

Figure 5. The architecture of the improved YOLOv8n model for real-time detection of soybean surface defects and foreign objects. The backbone incorporates the SE attention module after the SPPF (Spatial Pyramid Pooling—Fast) block to enhance channel-wise feature weighting. In the neck, the original PANet structure is augmented with the Multi-Scale Dilated Attention (MSDA) modules, enabling better feature aggregation across scales. The model outputs detection results at three different resolutions in the head for small, medium, and large objects.

Figure 6. Architecture of the SE (Squeeze-and-Excitation) attention module. The module first applies global average pooling to generate a channel-wise descriptor, which is then passed through two fully connected (FC) layers with a ReLU and Sigmoid activation to produce channel-wise weights. These weights are used to recalibrate the original input feature maps, enhancing important channels and suppressing less informative ones.

Figure 7. The architecture of the proposed Multi-Scale Dilated Attention (MSDA) module. Input feature maps are processed through multiple convolution layers of varying kernel sizes (e.g., 3 × 3, 5 × 5, etc.) followed by dilated convolutions to enhance receptive fields. The resulting multi-scale feature maps are then adaptively weighted via an attention mechanism, and aggregated to form the output feature map.

Figure 8. Training and validation curves of the enhanced YOLOv8n model over 100 epochs. The first row shows the evolution of training losses (box loss, classification loss, and distribution focal loss) and evaluation metrics (precision and recall). The second row presents validation losses and two commonly used performance indicators: mAP@50 and mAP@50–95. Solid blue lines represent raw results, while dotted orange lines indicate smoothed trends.

Figure 9. Precision–recall curve (left) and normalized confusion matrix (right) of the SE+MSDA-enhanced YOLOv8n model on the test set. The PR curve illustrates high class-wise detection performance, while the confusion matrix reflects strong classification consistency with few misclassifications.

Figure 10. Detection results of different YOLOv8n variants on fresh soybean surface defects. (a) Baseline YOLOv8n, (b) MSDA-YOLOv8n, (c) SE-YOLOv8n, and (d) SE+MSDA-YOLOv8n. The combined model (d) shows more accurate and robust detection, especially for small and ambiguous defects.

Figure 11. Comparative performance of YOLOv8n and its variants (SE, MSDA, and SE+MSDA) across four evaluation metrics. (a) Precision, (b) recall, (c) mAP@50, and (d) mAP@50:95.

Table 1. Comparison of different detection models on the soybean defect dataset.

Model	Precision (%)	Recall (%)	mAP@50	mAP@50:95	Parameters #	FLOPs
YOLO v5n	90.2	89.5	94.7	60.7	2.5 × $10^{6}$	7.20 × $10^{9}$
YOLO v8n	89.2	85.0	92.1	47.4	3.01 × $10^{6}$	8.01 × $10^{9}$
YOLO v10n	88.7	89.6	93.8	63.7	2.7 × $10^{6}$	8.5 × $10^{9}$
Proposed	94.2	87.8	95.1	50.1	2.66 × $10^{6}$	7.4 × $10^{9}$

Table 2. Ablation results of SE and MSDA modules on unified dataset.

Model	Precision (%)	Recall (%)	mAP@50	mAP@50:95	Parameters #	FLOPs	Inference Time (ms)
YOLO v8n	89.2	85.0	92.1	47.4	3.01 × $10^{6}$	8.1 × $10^{9}$	1.6 ms
SE	85.7	90.8	94.0	48.7	3.02 × $10^{6}$	8.0 × $10^{9}$	1.8 ms
MSDA	88.6	86.0	91.8	46.5	2.65 × $10^{6}$	7.3 × $10^{9}$	1.9 ms
SE+MSDA (Proposed)	94.2	87.8	95.1	50.1	2.66 × $10^{6}$	7.4 × $10^{9}$	1.8 ms

Table 3. Detection performance of different attention mechanisms across fresh soybean defect categories. Each cell reports either precision/recall or mAP@50/mAP@50:95.

Defect Category	Metric	YOLOv8n	SE-YOLOv8n	MSDA-YOLOv8n	SE+MSDA-YOLOv8n
Spot	Precision/Recall	86.0/96.1	83.0/96.9	94.6/90.6	94.8/94.5
	mAP@50/@50:95	93.0/53.2	95.8/51.2	95.6/53.4	96.2/54.0
Foreign Object	Precision/Recall	92.1/85.8	87.3/88.2	89.5/85.3	95.1/86.2
	mAP@50/@50:95	91.7/38.2	89.9/39.5	87.8/37.9	91.8/39.9
Damage	Precision/Recall	91.7/90.2	81.7/93.3	86.6/93	93.4/87.2
	mAP@50/@50:95	95.5/47.6	93.8/46.1	92.8/45	95.7/48.2
Single Pod	Precision/Recall	93.6/92.6	85.2/95.8	88.1/92.6	92.9/95.8
	mAP@50/@50:95	96.8/54.2	95.4/52.3	94.4/48.3	96.7/53.3
Overripe	Precision/Recall	89.5/88.6	85.6/90.9	91.4/83.0	92.1/90.9
	mAP@50/@50:95	92.4/48.4	93.3/51.4	93.7/49.9	94.1/48.9
Wormhole	Precision/Recall	84.6/51.9	99.2/73.1	82.9/73.1	99.9/63.5
	mAP@50/@50:95	81.7/40.6	94.0/49.2	87.6/46.1	94.2/54.0
Normal	Precision/Recall	87.2/89.7	77.9/97.4	87.1/84.6	91.1/96.2
	mAP@50/@50:95	93.4/49.9	95.6/51.2	90.8/44.8	97.0/52.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Multi-Scale Attention-Augmented YOLOv8 for Real-Time Surface Defect Detection in Fresh Soybeans

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition and Dataset

2.2. Data Preprocessing

2.3. YOLOv8 Architecture and Improvements

2.3.1. YOLOv8n Baseline

2.3.2. Squeeze-and-Excitation (SE) Attention Module

2.3.3. MSDA: Multi-Scale Dilated Attention Module

2.4. Experimental Configuration

3. Experiments and Results

3.1. Evaluation Metrics

3.2. Performance of the Enhanced YOLOv8n Model

3.3. Ablation Study on Attention Modules for Unified Edamame Defect Detection

3.4. Ablation Study on Defect-Specific Detection Performance of Attention Mechanisms

4. Discussion

4.1. Main Contributions

4.2. System-Level and Practical Implications

4.3. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics