MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling

Yu, Liang; Feng, Xingcan; Zeng, Yuze; Guo, Weili; Yang, Xingda; Zhang, Xiaochen; Tan, Yong; Sun, Changjiang; Lu, Xiaoping; Sun, Hengyi

doi:10.3390/electronics15010097

Open AccessArticle

MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling

by

Liang Yu

^1,2,3

,

Xingcan Feng

^1,2,3,

Yuze Zeng

^1,2,3,

Weili Guo

^1,2,3,

Xingda Yang

^1,2,3,

Xiaochen Zhang

^1,2,3,

Yong Tan

⁴,

Changjiang Sun

⁵,

Xiaoping Lu

⁶

and

Hengyi Sun

^1,2,3,*

¹

School of Computer, Guangdong University of Science and Technology, Dongguan 523083, China

²

Institute of Data Intelligence, Guangdong University of Science and Technology, Dongguan 523083, China

³

Guangdong AIoT Application Innovation Joint Laboratory, Guangdong University of Science and Technology, Dongguan 523083, China

⁴

College of Physics, Changchun University of Science and Technology, Changchun 130022, China

⁵

School of Aviation, Beijing Institute of Technology, Zhuhai 519085, China

⁶

Faculty of Innovation Engineering, Macau University of Science and Technology, Macau 999078, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 97; https://doi.org/10.3390/electronics15010097 (registering DOI)

Submission received: 8 November 2025 / Revised: 20 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Advancements in Edge and Cloud Computing for Industrial IoT)

Download

Browse Figures

Versions Notes

Abstract

The automation of quality control in agricultural food processing, particularly the detection of incomplete peeling in taro, constitutes a critical frontier for ensuring food safety and optimizing production efficiency in the Industry 4.0 era. However, this domain is fraught with significant technical challenges, primarily stemming from the inherent visual characteristics of residual peel: extremely minute scales relative to the vegetable body, highly irregular morphological variations, and the dense occlusion of objects on industrial conveyor belts. To address these persistent impediments, this study introduces a comprehensive solution comprising a specialized dataset and a novel detection architecture. We established the Taro Peel Industrial Dataset (TPID), a rigorously annotated collection of 18,341 high-density instances reflecting real-world production conditions. Building upon this foundation, we propose MDB-YOLO, a lightweight, multi-dimensional bionic detection model evolved from the YOLOv8s architecture. The MDB-YOLO framework integrates a synergistic set of innovations designed to resolve specific detection bottlenecks. To mitigate the conflict between background texture interference and tiny target detection, we integrated the C2f_EMA module with a Wise-IoU (WIoU) loss function, a combination that significantly enhances feature response to low-contrast residues while reducing the penalty on low-quality anchor boxes through a dynamic non-monotonic focusing mechanism. To effectively manage irregular peel shapes, a dynamic feature processing chain was constructed utilizing DySample for morphology-aware upsampling, BiFPN_Concat2 for weighted multi-scale fusion, and ODConv2d for geometric preservation. Furthermore, to address the issue of missed detections caused by dense occlusion in industrial stacking scenarios, Soft-NMS was implemented to replace traditional greedy suppression mechanisms. Experimental validation demonstrates the superiority of the proposed framework. MDB-YOLO achieves a mean Average Precision (mAP50-95) of 69.7% and a Recall of 88.0%, significantly outperforming the baseline YOLOv8s and advanced transformer-based models like RT-DETR-L. Crucially, the model maintains high operational efficiency, achieving an inference speed of 1.1 ms on an NVIDIA A100 and reaching 27 FPS on an NVIDIA Jetson Xavier NX using INT8 quantization. These findings confirm that MDB-YOLO provides a robust, high-precision, and cost-effective solution for real-time quality control in agricultural food processing, marking a significant advancement in the application of computer vision to complex biological targets.

Keywords:

incomplete taro peeling; MDB-YOLO; dynamic feature reconstruction; small object detection; edge computing

1. Introduction

1.1. Industrial Context and Motivation

In the contemporary landscape of agricultural product processing, the transition towards fully automated intelligent manufacturing is not merely a trend but a strategic necessity driven by the escalating demands for scale, consistency, and hygiene standards. Taro, a staple root vegetable consumed globally for its nutritional value and versatility, undergoes a rigorous processing workflow before it is suitable for deep processing into derivative products such as taro flour, chips, or frozen cubes [1]. A crucial and notoriously challenging phase in this workflow is the peeling procedure. Although industrial mechanical peeling equipment undertakes the majority of this task, it is not without flaws. Owing to the diverse sizes and shapes of raw taro, occurrences of incomplete peeling, in which residual skin persists firmly adhered to the flesh, are common and unavoidable [2,3,4,5], as shown in Figure 1.

The presence of residual taro peel, particularly small or extremely small fragments, poses a severe threat to downstream product quality. If left undetected, these residues can contaminate the final product, leading to a palpable decline in quality, consumer complaints, and potentially irreversible economic losses and reputational damage for the factory [6]. The current industry standard for mitigating this risk relies heavily on manual inspection. In this setup, workers are stationed along conveyor belts to visually observe, identify, and manually sort taro tubers with residual peels, returning them to the peeling machine for a second processing cycle.

However, this reliance on manual labor introduces a host of systemic inefficiencies and error modalities. Firstly, the visual distinction between residual peel and the taro flesh can be exceedingly subtle. After the mechanical peeling process, the surface of the exposed taro flesh often exhibits small areas of slight oxidation or natural discoloration, which can be easily mistaken for residual peel by fatigued workers. This leads to a high rate of false positives, where perfectly good raw material is rejected and re-processed unnecessarily, incurring waste. Conversely, and perhaps more critically, varying subjective standards among different quality control personnel can lead to inconsistent sorting criteria. As visual fatigue sets in over long shifts, the likelihood of missed detections (false negatives) increases, allowing contaminated taro to pass through to the deep-processing stages. The entire manual inspection process is labor-intensive, consuming significant human and material resources while failing to guarantee the zero-defect output required by modern food safety standards.

Therefore, addressing the actual needs of taro deep-processing facilities requires a paradigm shift towards automated intelligent detection. This study explores the feasibility of employing advanced object detection methodologies [7,8,9,10,11,12,13,14,15], specifically Deep Learning-based Computer Vision, to automatically and objectively identify incompletely peeled taro [16,17]. The primary objective is to design and implement a quantitative standard for detection, thereby reducing labor costs, enhancing processing efficiency, and minimizing material waste. Given the practical constraints of factory deployment, such as hardware budget limitations and the requirement for real-time processing speeds to match conveyor throughput, the solution must be anchored in a low-cost, lightweight, yet high-precision algorithm. This necessitates a move beyond generic “black box” models towards architectures specifically tailored to the physical characteristics of the agricultural target.

1.2. Problem Analysis and Data Characteristics

To develop an effective computer vision solution, one must first rigorously define the problem space and the specific nature of the data. The target object in this study is explicitly defined as “residual taro peel adhering to the surface of peeled taro.” This definition arises from the specific production flow: raw taro is cleaned, mechanically peeled, and then transported on a conveyor belt. The detection system must operate at this transport stage, where fully peeled and incompletely peeled taro are mixed in a dynamic stream.

The visual data presents three distinct and formidable challenges that render standard detection models (such as the vanilla YOLO series) inadequate:

1.2.1. Challenge 1: Tiny Targets and Texture Interference

The residual peel fragments are often extremely small, frequently occupying a minute fraction of the total image pixels (often less than 32 × 32 pixels, qualifying them as “small objects” in the COCO definition, or even “tiny objects”). Furthermore, the visual contrast between the brownish peel and the off-white or cream-colored taro flesh can be low, especially under industrial lighting. Complicating this further is the texture of the taro itself, which may feature shadows caused by surface undulations, oxidation spots, or remaining soil particles that mimic the appearance of peel. A standard Convolutional Neural Network (CNN) can easily confuse these background textures with the target, leading to false positives or missed detections. The model needs a mechanism to attend specifically to the fine-grained textural signatures of the peel while suppressing the noise of the taro flesh.

1.2.2. Challenge 2: Irregular Morphology

Unlike manufactured objects with rigid, predictable geometries (e.g., cars, license plates, boxes), biological residues possess highly irregular and amorphous shapes. A piece of peel might manifest as a slender strip, a curved crescent, a tiny jagged dot, or an irregular blotch. Traditional convolution operations, which typically utilize fixed square kernels (e.g., 3 × 3), struggle to adapt to these fluid boundaries. When a square kernel is applied to a thin, curved strip of peel, the resulting feature map often includes a significant amount of background information (taro flesh), diluting the feature signal. Alternatively, the geometric mismatch can cause the edges of the peel to be “chopped” or blurred during downsampling, leading to a loss of morphological integrity.

1.2.3. Challenge 3: Dense Occlusion and Stacking

On a high-throughput production line, maximizing efficiency often means maximizing the density of product on the belt. Consequently, taro tubers are often densely stacked, touching, or partially occluding one another. When two adjacent taros both possess residual peel, the bounding boxes for these defects may be in extremely close proximity or significantly overlapping. Standard post-processing algorithms like Non-Maximum Suppression (NMS) are designed to suppress overlapping boxes to avoid duplicate detections of the same object. In this dense scenario, however, “hard” NMS is prone to erroneously suppressing valid detections of separate defects that happen to be close to each other, leading to a significant drop in recall. This phenomenon is analogous to the “crowd counting” problem in pedestrian detection but applied here to biological defects [18].

Addressing these challenges requires a departure from generic detection architectures. This study proposes MDB-YOLO, a model that integrates “bionic” principles—mimicking the adaptability and focus of biological vision—through dynamic convolutions and attention mechanisms. By systematically reconstructing the feature extraction and processing pipeline, MDB-YOLO aims to provide a robust solution tailored to the nuances of the taro peeling detection task.

1.3. Contributions

To overcome the limitations of generic object detection algorithms in the complex environment of root vegetable processing, this study presents a systematic solution encompassing dataset construction, bionic architecture design, and edge deployment. The primary contributions are summarized as follows:

(1)

Construction of the Taro Peel Industrial Dataset (TPID): We established the first high-resolution, expert-annotated benchmark specifically for taro peeling defects. Comprising 18,341 densely annotated instances across 1056 images, the TPID incorporates a “Human-in-the-Loop” annotation protocol to ensure ground truth fidelity. It explicitly models industrial variables—including motion blur, illumination fluctuations, and object rotation—to rigorously map the stochastic distribution of real-world production environments.

(2)

Proposal of the MDB-YOLO Bionic Architecture: We designed a novel, lightweight detection framework that integrates three “bionic” attention mechanisms to resolve specific physical bottlenecks:

Texture-Scale Adaptation: The integration of the C2f_EMA module with Wise-IoU (WIoU) employs efficient multi-scale attention to amplify the feature response of minute, low-contrast residues while preventing gradient contamination from low-quality examples.
Geometric Reconstruction: A dynamic feature processing chain utilizing DySample for morphology-aware upsampling and ODConv2d for adaptive feature extraction allows the network to dynamically adjust its sampling field to fit the amorphous boundaries of peel fragments.
Occlusion Management: The implementation of Soft-NMS with Gaussian decay effectively mitigates the recall drop caused by dense stacking on conveyor belts.

(3)

Validation of High-Performance Edge Deployment: We demonstrated the practical viability of the model on an NVIDIA Jetson Xavier NX (NVIDIA Corporation, Santa Clara, CA, USA) embedded platform. MDB-YOLO achieves a state-of-the-art mAP50-95 of 69.7%, significantly outperforming the baseline YOLOv8s and heavy transformer-based models (RT-DETR-L). Crucially, it maintains a real-time inference speed of 27 FPS (INT8 quantization), satisfying the strict throughput requirements of industrial manufacturing.

1.4. Organization of the Paper

The remainder of this article is structured as follows. Section 2 delineates the materials and methods, specifically detailing the acquisition and preprocessing of the Taro Peel Industrial Dataset (TPID) and the architectural innovations of the MDB-YOLO framework. Section 3 presents the experimental validation, including the experimental setup, ablation studies, comparative analysis against state-of-the-art models, and performance benchmarks on edge computing hardware. Section 4 offers a critical discussion of the results, interpreting the efficacy of the bionic mechanisms and acknowledging current limitations. Finally, Section 5 summarizes the contributions and proposes avenues for future investigation.

2. Methodology

After defining the theoretical challenges and the architectural philosophy, it is essential to initially establish the data foundation on which the model is constructed. In the domain of data-driven deep learning, the quality and representativeness of the dataset substantially determine the upper limit of model performance. Therefore, prior to elaborating on the complex mechanisms of the MDB-YOLO architecture, we first expound on the meticulous construction of the Taro Peel Industrial Dataset (TPID), demonstrating how physical production variables were transformed into data attributes, as depicted in Figure 2.

2.1. Taro Peel Industrial Dataset (TPID) Construction

Data is the cornerstone of any deep learning initiative. In the absence of a publicly available dataset for taro peeling defects—a niche but economically significant domain—this study constructed the Taro Peel Industrial Dataset (TPID). The construction process was meticulously designed to ensure the data reflects the complexity and variability of the actual deployment environment, adhering to the principle that “data quality determines the upper bound of model performance”. As illustrated in Figure 3, the dataset construction involved a systematic pipeline from raw acquisition to final augmentation.

Figure 3a details the rigorous workflow, emphasizing the “Human-in-the-Loop” annotation strategy employed to ensure ground truth fidelity. Figure 3b,c reveal a critical characteristic of our dataset: the high instance density. While the total number of images (1056) might appear modest, the sheer volume of annotated instances (over 18,000) provides a rich, granular supervisory signal, which is essential for training the model to recognize tiny, scattered defects.

2.1.1. Data Acquisition Infrastructure

To ensure high fidelity between training data and deployment conditions, image acquisition was conducted directly on an operating taro deep-processing production line. The imaging setup featured an industrial-grade web camera with a resolution of 1920 × 1080 pixels. This camera was mounted approximately 50 cm directly above the conveyor belt in a top-down configuration (Bird’s Eye View), a vantage point selected to maximize the visibility of surface details and minimize perspective distortion that could obscure defects on the sides of the tubers.

Lighting plays a pivotal role in machine vision. To secure consistent image quality and mitigate the adverse effects of ambient factory lighting, we employed a custom lighting system using industrial LED strips. These were configured to generate a high-intensity environment with illuminance maintained between 750–1000 lux. Crucially, high Color Rendering Index (CRI > 85) LEDs were selected to simulate a neutral daylight condition with a color temperature of 5000 K. This spectral specification is vital for ensuring accurate color reproduction, which is the primary cue for distinguishing the brownish peel from the cream-colored flesh. This setup not only provided sufficient photon flux for short exposure times (reducing motion blur) but also replicated the challenging lighting dynamics of an industrial floor, including specular reflections from moist taro surfaces and deep shadows cast by the irregular 3D shapes of the tubers.

2.1.2. Data Preprocessing and Augmentation Strategy

The initial data collection yielded 282 raw high-resolution images. Given the “specific single-class object” nature of the task, the density of information was remarkably high; each image contained an average of approximately 17 annotated instances of residual peel. While the raw image count might appear low compared to general datasets like COCO, the dataset contained 18,341 individual target instances. In the context of deep learning for object detection, the effective sample size is often better measured by the number of object instances (bounding boxes) rather than the number of image frames. High-density datasets, such as the VisDrone benchmark, have demonstrated that instance density can compensate for lower frame counts by providing abundant positive samples for the regression heads.

However, to ensure robust generalization and prevent overfitting to the specific conditions of the collection day, we employed a rigorous data augmentation strategy using the Albumentations (Version 2.0.7) library. This expanded the dataset to 1056 images. The augmentation techniques were not chosen arbitrarily but were selected to physically model specific variations encountered in the factory:

Photometric Distortion (Addressing Lighting Variability): The factory environment is subject to lighting fluctuations throughout the day and across different seasons. To prepare the model for this, we applied random brightness and contrast adjustments (p = 0.6) and RGB channel shifting (p = 0.4). Additionally, we utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (p = 0.2). Unlike global histogram equalization, which can amplify noise in the uniform background regions, CLAHE operates on small tiles and limits the contrast amplification. This is particularly effective for enhancing the local contrast between the peel and the flesh without overexposing the white taro body, thereby highlighting minute textural differences.
Geometric Transformations (Addressing Pose Variability): Taro tubers are roughly spherical or ellipsoidal and roll randomly on the conveyor belt. Consequently, there is no fixed orientation for the defects; “up” and “down” are relative. To force the model to learn rotation-invariant features, we applied horizontal and vertical flips (p = 0.5), random 90-degree rotations (p = 0.7), and affine transformations including translation, scaling, and slight rotation (p = 0.5). This simulates the chaotic positioning of the taro as it moves down the line.
Motion Blur Simulation (Addressing Conveyor Dynamics): Despite the use of industrial cameras with fast shutters, the relative motion of the conveyor belt (often moving at speeds > 0.5 m/s) can introduce slight motion blur. We applied Gaussian blur (p = 0.1) to a subset of training images. This forces the model to learn to “find edges in the blur,” enhancing its robustness to speed variations and mechanical vibrations inherent in the machinery.
Mosaic Augmentation: We utilized Mosaic augmentation, which stitches four training images into a single composite. This technique is particularly valuable for small object detection as it significantly increases the number of objects per training batch and varies the background context. However, it is worth noting that while helpful for the initial training phases, this was strategically disabled in the final fine-tuning epochs to align with the real-world data distribution where images are single frames, not composites.

2.1.3. Annotation and Dataset Splitting

Annotation was carried out by adopting a “Human-in-the-Loop” semi-automated approach to strike a balance between efficiency and strict accuracy, utilizing the X-Anylabeling tool (Version 3.2.3). The process began with a manual annotation of a seed set (100–200 images), cross-validated by multiple experts to establish a “Gold Standard.” A preliminary YOLOv8 model was then trained on this seed set to pre-annotate the remaining raw data. Finally, every image underwent a strict manual review by trained annotators to correct false positives and identify missed detections. This iterative process ensured a high-quality ground truth for all 18,341 instances. The dataset was split into Training (739 images, 13,071 instances), Validation (212 images, 3225 instances), and Testing (105 images, 2045 instances) sets using a fixed random seed (7:2:1 ratio). This split ensures that the evaluation on the test set—containing over 2000 real-world targets—is statistically significant and unbiased, providing a reliable proxy for production performance.

2.2. The MDB-YOLO Architecture

The proposed MDB-YOLO model is a systematic reconstruction of the YOLOv8s baseline. It is not merely a collection of add-on modules but a cohesive architecture where each modification targets a specific physical challenge identified in the problem analysis, as depicted in Figure 4. The architecture is designed to be “Multi-Dimensional” in its adaptability (spatial, channel, and scale) and “Bionic” in its mechanism (mimicking biological attention and focus).

2.2.1. Addressing Challenge 1: The C2f_EMA Module and WIoU

The first major challenge is the detection of tiny, low-contrast peel residues that are easily confused with surface shadows. Standard CNN architectures tend to apply filters uniformly across the image, which is inefficient when the target signal is weak and localized. To address this, we evaluated several mainstream attention mechanisms to identify the most effective solution for our specific data characteristics. Table 1 presents a comparative analysis of these mechanisms integrated into the baseline YOLOv8s.

As evident from Table 1, while all attention mechanisms improved performance over the baseline, the Efficient Multi-Scale Attention (EMA) module demonstrated the most significant gains, particularly in the stringent mAP50-95 metric (+7.92%). This empirical evidence guided our decision to integrate EMA into the backbone. Figure 5 visualizes the structural differences between these candidate mechanisms, highlighting why EMA’s design is superior for this task [19].

To deepen the understanding of this choice, Figure 6 details the internal architecture of the EMA module and its precise integration point within the C2f block. The C2f_EMA module operates by reshaping the feature extraction process to be more content-aware. Unlike simpler attention mechanisms like Squeeze-and-Excitation (SE), which primarily focus on global channel descriptors (often losing spatial specifics), or Coordinate Attention (CA), which encodes position but can be computationally heavier, EMA employs a “Cross-Spatial Learning” strategy. It utilizes parallel paths: a 1 × 1 convolution branch for modeling local cross-channel interactions and a 3 × 3 convolution branch for capturing broader spatial context. Crucially, it aggregates pixel-level relationships without downsampling the spatial dimensions, building global context-aware attention maps. This allows the network to amplify the feature response of the peel texture while suppressing the background noise of the taro flesh. By inserting this module immediately after the initial convolution in the C2f block (before the bottleneck layers), we ensure that the network focuses on salient regions early in the feature extraction hierarchy, preventing the loss of tiny target signals in deep layers.

Complementing this feature enhancement is the Wise-IoU (WIoU) loss function [25]. In a dataset dominated by background (negative samples) and containing fuzzy-bordered targets, standard IoU losses (like CIoU) can be detrimental [26,27,28]. CIoU considers overlap, distance, and aspect ratio, but it treats all samples with equal “rigor.” In our case, some peel fragments have amorphous, blurry boundaries that make defining a “perfect” ground truth box difficult. Standard losses impose high penalties on these “low-quality” anchor boxes, causing the model to oscillate in an attempt to fit impossible boundaries. WIoU introduces a dynamic non-monotonic focusing mechanism. It calculates an “outlier degree” (β) for each anchor box. The loss is weighted by a focusing coefficient (r) derived from β:

r = \frac{β}{δ α^{β - δ}},

(1)

For boxes with a high outlier degree (indicating a poor match or a difficult/ambiguous sample), the mechanism reduces the focusing coefficient (r). This effectively down-weights the gradient contribution from these extreme outliers, preventing them from destabilizing the training. Conversely, it allows the model to focus its learning capacity on samples where it can meaningfully improve (“ordinary quality” samples), rather than wasting capacity on ambiguous edge cases. This synergy of C2f_EMA (better feature representation) and WIoU (smarter regression penalty) significantly improves the detection of tiny, indistinct targets.

2.2.2. Addressing Challenge 2: The Dynamic Feature Processing Chain

The second challenge is the extreme irregularity of the peel shapes. To ensure the network can accurately represent these forms, we constructed a “Reconstruct-Fuse-Preserve” processing chain within the Feature Pyramid Network (Neck).

(1): Morphology Reconstruction (DySample): Standard upsampling (Nearest Neighbor Interpolation) is static; it simply duplicates pixels to increase resolution. For a curved, thin piece of peel, this results in a jagged, blocky edge that loses the original shape information—a phenomenon known as aliasing. We replaced this with DySample, a dynamic upsampling module [29]. We experimentally validated this choice against other upsampling methods, as detailed in Table 2.

Table 2 demonstrates that whereas other approaches such as CARAFE (based on the International Conference on Computer Vision 2019 (v3) version) [30] presented minor enhancements, DySample achieved the optimal equilibrium between precision and recall. Specifically, it elevated the mAP50-95 by more than 2.5% in comparison to the default setting. Figure 7 visually contrasts these upsampling strategies, offering a schematic illustration of their operational disparities. The internal mechanism of DySample is further expounded upon in Figure 8. DySample does not rely on a fixed grid. Instead, it predicts a set of offsets for each pixel position based on the content of the input feature map. This allows it to “sample” points along the semantic boundaries of the object. If the peel is curved, DySample moves its sampling points to follow that curve, ensuring that the upsampled feature map retains smooth, accurate edges. This point-sampling approach effectively reconstructs the morphology of irregular defects during the upsampling phase.

(2): Weighted Fusion (BiFPN_Concat2): Merging features from different scales is critical for detecting objects of varying sizes. The standard YOLOv8 uses a simplistic addition (summation) fusion [31,32]. However, this implicitly assumes that deep semantic features (from P5) and shallow textural features (from P3) are equally important. For our task, the shallow texture information is often more critical for identifying the peel than the abstract semantics. We adopted a modified BiFPN_Concat2 structure. This module utilizes learnable weights (w_i) to balance the contribution of different feature inputs: $O = \sum \frac{w_{i}}{ε + \sum w_{i}} \cdot I_{i} .$ It allows the network to automatically assign higher importance to the shallow layers when processing tiny texture-heavy targets. Furthermore, we use Concatenation rather than addition, preserving the full dimensionality of the features for subsequent processing. Figure 9 contrasts our weighted concatenation approach with the standard BiFPN design. This ensures that the subtle texture signals of tiny residues are not “washed out” by the stronger signals of larger objects [33,34].

(3): Geometric Preservation (ODConv2d): Finally, in the downsampling path, we employ Omni-Dimensional Dynamic Convolution (ODConv2d) [35]. Standard static convolutions use a single kernel weight matrix for all inputs. ODConv2d, however, learns a multi-dimensional attention mechanism that dynamically modulates the convolution kernel across four dimensions: the spatial kernel size (α_s), the input channels (α_c), the output channels (α_o), and the kernel number (α_w). This means the convolution filter itself changes shape and emphasis based on the input. The superiority of ODConv over other dynamic convolution methods for our task is demonstrated in Table 3.

Table 3 confirms that ODConv outperforms other dynamic variants like DSConv and PConv, yielding the highest Recall (0.8326), which is critical for minimizing missed detections. Figure 10 visually summarizes the different convolution mechanisms compared. If the input contains a long, thin strip of peel, the kernel adapts its spatial weights to fit that strip. This adaptability prevents the geometric distortion that often occurs when rigid square kernels are applied to fluid, organic shapes, ensuring that the feature extraction process preserves the integrity of the irregular peel morphology [40]. A detailed breakdown of the ODConv2d architecture is illustrated in Figure 11.

2.2.3. Addressing Challenge 3: Soft-NMS for Dense Occlusion

The final challenge is the dense stacking of taro on the conveyor belt. Traditional “Hard” NMS is a greedy algorithm: if two boxes overlap significantly (IoU > threshold), the one with the lower score is deleted. In our case, two adjacent taros might both have valid defects that overlap in the image space. Hard NMS would incorrectly suppress one of them, leading to a missed detection (False Negative).

We implemented Soft-NMS to resolve this [41]. Instead of a binary “keep or delete” decision, Soft-NMS applies a continuous Gaussian decay function to the scores of overlapping boxes.

s_{i} = s_{i} e^{- \frac{i o u {(M, b_{i})}^{2}}{σ}},

(2)

where s_i is the score of the box and M is the box with the maximum score. If a box overlaps heavily with M, its score is reduced, but not necessarily to zero. Since our improved front-end (C2f_EMA + ODConv2d) generates high-confidence predictions for valid targets, even the decayed score of a valid “neighbor” box often remains above the detection threshold (Conf > 0.001 during NMS, filtered later). This allows the system to retain valid detections in dense clusters, significantly boosting recall in crowded scenarios. Figure 12 illustrates the algorithmic difference between the standard NMS and our adopted Soft-NMS.

To effectively synthesize and integrate all these architectural innovations into a cohesive framework, Figure 13 provides a comprehensive and detailed schematic representation of the complete MDB-YOLO model. This diagram meticulously illustrates the entire architecture, showcasing the interconnected components and the flow of information throughout the system, ensuring a clear and thorough understanding of the model’s overall design and functionality.

2.2.4. Model Layer Configuration and Complexity Analysis

The specific configuration of MDB-YOLO, detailing the placement and parameters of each module, is summarized in Table 4. It confirms that despite the addition of advanced modules, the model remains lightweight (13.44 M parameters), making it suitable for deployment.

2.3. Implementation Details and Training Configuration

Table 5 presents the comprehensive hardware environment and parameter settings employed in this study. All experiments were carried out on a workstation furnished with a single NVIDIA A100 (40 GB) graphics processing unit (GPU), utilizing the Ultralytics YOLOv8 framework (v8.2.0), operating on CUDA 11.4 and PyTorch (Version 2.3.1) [42]. To guarantee training stability, the basic configuration adhered to official recommendations: the Batch Size was set to 32, the Epochs to 200, and the AdamW optimizer was utilized with an initial learning rate of 0.002.

Regarding the specificity of taro peel detection, key adjustments were made to the default settings in the “Strategy” section of Table 5: The confidence threshold (Conf) in the inference phase was elevated from 0.25 to 0.3 to more rigorously filter background noise. In the final model, Mosaic and Mixup data augmentations were entirely disabled (set to 0). As indicated in the literature, disabling these augmentations facilitates the alignment of training data “with real world data distribution,” thus preventing unnatural textures from interfering with the model’s learning of taro peel features. Copy-Paste augmentation was retained and set to 0.1. This is intended to enhance the visibility of target instances and promote the model’s comprehensive learning of residual taro peel features. These strategy adjustments aim to enhance the model’s capacity to capture minute textures, and their effectiveness is verified in subsequent ablation experiments [43,44,45].

3. Experimental Results and Analysis

3.1. Experimental Setup and Evaluation Metrics

The training protocol was highly rigorous, encompassing 200 epochs, a batch size of 32, and the utilization of the AdamW optimizer with an initial learning rate of 0.002. A crucial adjustment was made to the final training strategy. Specifically, Mosaic and Mixup augmentations were deactivated (set to 0) during the final fine-tuning phase. This “domain alignment” strategy was implemented to enable the model to converge on the “real” data distribution, free from the artifacts induced by image stitching. This approach was demonstrated to be advantageous in minimizing false positives. In contrast, Copy-Paste augmentation was retained at a low probability (0.1) to preserve data density [46,47,48].

Performance was evaluated using standard industry metrics: Precision (P), Recall (R), and mean Average Precision (mAP). We specifically focused on mAP50-95, a stringent metric that averages performance across a range of IoU thresholds (0.5 to 0.95). This metric is particularly telling for our task as it rewards high localization accuracy, which is essential for distinguishing the peel from the background.

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(3)

where N is the number of classes (here N = 1).

These metrics, along with the associated loss functions, are monitored throughout the training process to assess the model’s learning progress and convergence. Figure 14 illustrates the evolution of these key metrics for the MDB-YOLO model over 200 training epochs, demonstrating a stable and effective learning trajectory.

3.2. Ablation Studies: Validating the Bionic Mechanisms

To isolate the contribution of each proposed module, a comprehensive ablation study was conducted (referring to Table 6). The results provide empirical validation for the architectural design.

Impact of C2f_EMA and WIoU: Comparing the baseline (Exp. 1) to the enhanced model (Exp. 2), the introduction of C2f_EMA and WIoU improved Precision from 0.892 to 0.905 and mAP50 from 90.8% to 91.7%. This confirms that the attention mechanism effectively highlights the tiny targets, while the robust loss function improves the quality of the bounding boxes by preventing overfitting to ambiguous boundaries.
Impact of Soft-NMS: Perhaps the most dramatic single-step improvement came from replacing NMS with Soft-NMS (Exp. 1 vs. Exp. 4). With no other changes, mAP50-95 jumped from 65.7% to 68.6%. This 2.9% increase serves as powerful proof of the “dense occlusion” hypothesis—valid targets were indeed being suppressed by the baseline model, and Soft-NMS successfully recovered them.
Impact of the Dynamic Chain: The combination of DySample, BiFPN_Concat2, and ODConv2d (Exp. 14) yielded a balanced improvement in both precision and recall. Specifically, DySample contributed a net growth of 0.9% in mAP50-95 compared to standard upsampling (Exp. 11 vs. Exp. 14), validating its ability to reconstruct irregular edges. ODConv2d further pushed Precision to 0.916 in intermediate experiments (Exp. 6), demonstrating its capacity to fit feature extraction to target geometry.
Cumulative Performance: The final configuration (Exp. 20), which combined all modules with the optimized training strategy (closing Mosaic), achieved the peak performance: mAP50-95 of 69.7% and Recall of 88.0%. This represents a substantial leap over the baseline’s 65.7% mAP and 85.2% Recall, demonstrating the additive value of the proposed modifications.

3.3. Hyperparameter Experiments

To quantify the specific contribution of different parameter adjustments to the identification of “residual taro peel adhering to the surface of peeled taro,” we divided the experiments into two logical blocks (see Table 7): Block A focuses on evaluating the effect of data augmentation strategies in the training phase (with inference parameters at default); Block B further verifies the overall performance after optimizing inference parameters.

In the first stage (Block A), we focused on analyzing the impact of data augmentation methods on feature learning. Experimental results show that when Mosaic was adjusted from default on to off (0), mAP50-95 increased by 0.9%. The fundamental reason is that Mosaic stitches four original images into one, introducing splicing boundaries (such as obvious cut lines or brightness mutations) and altering the original image structure. For fine targets like residual taro peel that rely on continuous texture, these discontinuous connection areas interfere with the model’s learning of real edges, affecting feature expression stability. After closing Mosaic, the model can train based on complete, continuous real images without splicing traces, thereby improving its ability to model texture details.

On the basis of closing Mosaic, we retained a 0.1 probability of Copy-Paste. This setting achieved the highest mAP50-95 of 0.701 in Block A. While closing Mosaic reduces artifacts generated by image stitching (such as obvious boundary lines, brightness mutations, and texture discontinuity caused by “residual taro peel” being cut at seams), it also reduces diversity in layout and scene composition. To this end, Copy-Paste moderately increases instances of “residual taro peel” on real backgrounds, enabling the model to encounter more valid target samples, thereby enhancing its ability to learn appearance variations of “residual taro peel”.

In the second stage (Block B), we further adjusted the Conf (confidence threshold) in the inference phase based on the above training strategy, raising it from the default 0.25 to 0.3. Since the contrast between residual taro peel and taro flesh is low, the target confidence given by the model often lies in the middle range of 0.3–0.5. If the threshold is too high, some real targets will be erroneously deleted; if too low, conveyor belt reflections or shadows are easily misjudged as residual peel. Experimental verification shows that setting the threshold to 0.3 achieves a reasonable balance between “retaining real targets” and “excluding background interference,” and provides sufficient candidate boxes for subsequent filtering by Soft-NMS.

Although Block A achieved the highest mAP50-95 of 0.701, while Block B’s final configuration was slightly lower at 0.697, its Precision increased from 0.905 to 0.909. In industrial sorting scenarios, higher precision means a lower false alarm rate, reducing material waste caused by misjudging qualified taro as unqualified. Therefore, the practice of further improving precision while maintaining a relatively high recall rate (0.880) is more in line with the needs for stability and cost control in actual production.

3.4. Comparative Analysis with Prominent Models

To rigorously validate the proposed cost-driven engineering approach, the MDB-YOLO architecture was benchmarked against its baseline and a comprehensive spectrum of prominent object detection models characterized by varying degrees of architectural complexity. This analysis extends beyond standard metric comparisons to encompass a deep technical verification of engineering dilemmas, specifically addressing the critical trade-offs between theoretical state-of-the-art (SOTA) performance and the rigid constraints of industrial hardware deployment.

All comparative experiments were conducted on a standardized high-performance workstation equipped with a single NVIDIA A100 (40 GB) graphics processing unit (GPU). The software environment utilized the Ultralytics YOLOv8 framework (v8.3.18) operating within a CUDA 11.4 and PyTorch ecosystem to ensure a unified and unbiased testing ground. The comparative cohort was meticulously selected to represent the trajectory of Convolutional Neural Network (CNN) evolution and the emerging Transformer frontier. This includes multiple generations of the YOLO series (v5s, v8s, v9s, v10s, v11s, v12s, v13s) and the transformer-based RT-DETR-L. Furthermore, in response to the rapid evolution of the field, we extended the evaluation to include the absolute frontier of real-time detection: the RT-DETRv4-S (representing Vision Foundation Model distillation) and D-FINE-S (representing fine-grained distribution refinement) [49,50,51].

The initial phase of the comparative analysis focused on determining whether the optimized lightweight MDB-YOLO architecture could achieve performance parity with, or superiority over, the standard “Small” (s) variants of the YOLO family. The results, summarized in Table 8, reveal a counter-intuitive trend that challenges the prevailing assumption that newer architectures universally yield superior performance on specialized industrial datasets. While MDB-YOLO achieves a dominant mAP50-95 of 69.7%, significantly outperforming the baseline YOLOv8s (65.7%) and the heavy transformer-based RT-DETR-L (67.6%), the most recent state-of-the-art models, specifically YOLOv12s and YOLOv13s, exhibit unexpected underperformance.

Specifically, YOLOv12s records a mAP50-95 of only 64.1%, falling below even the older YOLOv5s (64.9%). Similarly, YOLOv13s achieves 65.1%, failing to surpass the v8s baseline. This anomaly can be attributed to Domain Misalignment. These newer architectures often employ complex attention-centric or hypergraph mechanisms optimized for the COCO dataset, which prioritizes semantic object understanding (e.g., distinguishing a “dog” from a “cat”). In contrast, the taro peeling task is Texture-Centric, relying on high-frequency local features to define the boundaries of peel fragments. The global context integration in v12 and v13 likely dilutes these subtle texture signals, leading to poorer localization accuracy. Further analysis of the training dynamics in Figure 15 corroborates this. While MDB-YOLO’s curve separates early and maintains a stable asymptotic trajectory, the curves for YOLOv12s and v13s exhibit noticeable volatility. This oscillation suggests that their loss landscapes are ill-conditioned for the ambiguous boundaries of the TPID dataset, causing the models to struggle with convergence on the “tiny target” class.

To provide a comprehensive verification of MDB-YOLO’s standing against the absolute bleeding edge of detection technology, we extended our analysis to include RT-DETRv4-S and D-FINE-S. These models represent the pinnacle of current research: RT-DETRv4 utilizes knowledge distillation from massive Vision Foundation Models (VFMs), while D-FINE introduces fine-grained distribution refinement for regression tasks. We fully reproduced the training process for these models on the TPID dataset, ensuring all training configurations (Epochs, Batch Size, Optimizer) were kept consistent with MDB-YOLO to ensure fairness. The results are detailed in Table 9.

The data reveals a fascinating dichotomy in performance characteristics. The transformer-based SOTA models (RT-DETRv4 and D-FINE) demonstrate exceptional Average Recall (AR), driven by their global attention mechanisms. However, a critical observation arises concerning the “performance shift” of MDB-YOLO when transitioning from the YOLO internal validation tool to the standard pycocotools library. As shown in Table 9, although MDB-YOLO achieves a superior mAP50 of 92.1% in its native framework, the metric converges to 85.3% under the stricter COCO standard. According to the comparative study by Padilla et al. [52,53], such discrepancies are primarily attributed to the PR-curve interpolation methods: while COCO uses a rigorous 101-point interpolation, many YOLO-based evaluators employ an all-points interpolation that yields more granular but less conservative precision estimates. Furthermore, the YOLO evaluator’s use of Letterbox adaptive resizing preserves the original aspect ratio of industrial images, whereas standard resizing in pycocotools often introduces geometric distortion that degrades localization accuracy, particularly for small-scale industrial targets [54]. This explains the decrease in mAP50-95 from 69.7% to 61.3% during the transition. While the standardized COCO metrics provide a transparent baseline for academic comparison, the framework-specific results (69.7% mAP50-95) more accurately reflect MDB-YOLO’s robustness in real-world industrial deployment. This suggests that MDB-YOLO’s architecture is more adept at precise bounding box refinement in texture-dense scenes than its standardized scores might initially suggest.

To demonstrate the model’s readiness for practical application, a user-friendly detection system with a graphical user interface (GUI) was developed using QT (Version 6.10.1), as shown in Figure 16. This system allows operators to easily load an image, initiate the detection process with a single click, and view the results in real-time. The interface displays the original image alongside the processed output, which clearly highlights any detected residual peel fragments with bounding boxes and confidence scores. This application serves as a tangible proof-of-concept, showcasing how the high-performance MDB-YOLO model can be integrated into an accessible tool for on-site quality control.

Beyond the quantitative inference speed benchmarks, a qualitative evaluation was conducted to assess the model’s practical detection robustness in a real-world industrial context. For this test, MDB-YOLO was visually compared against six other YOLO models (YOLOv8s, v9s, v10s, v11s, v12s, and v13s) and the RT-DETR-L model. As shown in Figure 17, the models processed images of peeled taro on a moving conveyor belt within the factory setting. The top row shows that MDB-YOLO consistently and accurately detects all instances of residual peel across multiple frames without error. In contrast, all other models exhibit significant performance deficiencies, as highlighted by the red circles. These errors fall into two categories:

False Negatives (Missed Detections): All competitor models frequently fail to identify actual peel fragments, particularly those that are small or have low contrast. This is a critical failure for a quality control system;
False Positives (Incorrect Detections): Multiple models, particularly the baseline YOLOv8s along with the YOLOv9s, v12s, and v13s, misidentify features of the conveyor belt, including the black seams, as defects. This phenomenon would result in an unacceptably high false-alarm rate in a production setting.

Although the newer small-variant models (v10s, v11s, v12s, v13s) exhibit a certain degree of reduction in false positives compared to the baseline model, they still encounter challenges in maintaining consistency and frequently fail to detect smaller fragments. This qualitative assessment conducted in a realistic deployment scenario effectively demonstrates the superior robustness and reliability of MDB-YOLO. The architectural improvements not only enhance quantitative metrics but also directly result in more reliable performance within the target industrial environment, minimizing both undetected defects and false alarms.

The Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps of MDB-YOLO display precise, “foveal” activation that is strictly concentrated on the peel residues. In sharp contrast, the heatmaps of YOLOv12s and YOLOv13s often focus on the geometric center of the taro tuber rather than surface defects or incorrectly activate on the dark seams of the conveyor belt. This phenomenon indicates a bias towards object detection rather than defect segmentation, which leads to the higher false negative rates observed in real-time comparisons.

3.5. Edge Device Deployment and Quantitative Analysis

To validate the practical applicability and efficiency of MDB-YOLO for cost-sensitive industrial applications, the trained model was deployed on a representative edge computing platform, the NVIDIA Jetson Xavier NX. This section details the hardware environment, the model optimization pipeline, and the quantitative inference performance.

3.5.1. Deployment Hardware and Software Environment

The model was deployed on an NVIDIA Jetson Xavier NX Developer Kit (NVIDIA Corporation, Santa Clara, CA, USA). This embedded platform features a 6-core ARM v8.2 64-bit CPU (1.9 GHz), a 384-core Volta-architecture GPU with 48 Tensor Cores, and 8 GB of LPDDR4 memory. The system runs an Ubuntu 20.04 LTS operating system with the JetPack 5.1.5 development kit. The core software environment includes CUDA 11.4, cuDNN 8.6, TensorRT 8.5.2.2, and Python 3.8.10. To ensure maximum computational output and stable performance for benchmarking, the device was set to the 20W 6 CORE power mode, which activates all six CPU cores and allocates the highest power budget. This configuration represents a typical, high-performance edge-computing scenario.

3.5.2. Model Export and Optimization Pipeline

The model, originally trained in PyTorch on the A100 server, underwent a multi-stage optimization pipeline to prepare it for high-speed inference on the Jetson platform.

Export to ONNX: The trained PyTorch (.pt) weight file was first converted to the ONNX (Open Neural Network Exchange) intermediate representation using the Ultralytics framework’s built-in “export” command. During this conversion, two key strategies were enabled: operator simplification, which merges redundant computational nodes to enhance model compactness, and fixed input dimensions (locking the input to 640 × 640), which improves compatibility for hardware acceleration.

Compilation to TensorRT Engine: Subsequently, on the Jetson device itself, the ONNX model was compiled into a “.trt” engine file using the TensorRT toolchain (Version 8.6.1). This critical step performs several platform-specific optimizations, including layer fusion (combining multiple layers like convolution, bias, and activation into a single kernel), precision calibration, and kernel auto-tuning to select the fastest algorithms for the target Volta GPU. This process significantly optimizes the model for low-latency inference and energy efficiency on the edge.

3.5.3. TensorRT Quantization Standards

TensorRT supports multiple precision standards, allowing for a trade-off between computational efficiency and model accuracy. The choice of standard is critical for resource-constrained embedded scenarios. The three standards evaluated are:

FP32 (Single-Precision): Uses 32-bit floating-point numbers for weights and activations. This maintains the highest theoretical accuracy, identical to the training environment, but has the slowest inference speed and highest memory footprint.
FP16 (Half-Precision): Uses 16-bit floating-point format. It provides a significant boost in computation speed and memory bandwidth with a minimal and often negligible (typically <0.5%) loss in precision, making it ideal for real-time detection tasks.
INT8 (8-bit Integer Quantization): Converts weights and activations to 8-bit integers. This standard maximizes inference efficiency and achieves the lowest latency, making it perfectly suited for highly resource-constrained edge devices, provided the (typically <2%) accuracy loss is acceptable for the task.

The selection among these standards depends on the dynamic trade-off between the task’s specific accuracy requirements and the hardware’s constraints. Figure 18 presents the actual measured inference performance of MDB-YOLO on the Jetson Xavier NX under these three precision standards. The analysis provides a clear performance hierarchy.

Under the FP32 (Single-Precision) standard, which serves as the baseline by maintaining the original model weights, MDB-YOLO achieved a measured real-world throughput of 12 FPS. This performance serves as the baseline for non-optimized operation on the edge platform. A significant performance improvement was achieved by leveraging the FP16 (Half-Precision) standard. This optimization, which directly utilizes the Volta architecture’s 48 Tensor Cores, doubled the throughput to 24 FPS. This performance level is crucial, as it meets or exceeds the frame rate of many standard industrial cameras, enabling genuine real-time processing. For applications prioritizing maximum throughput, INT8 (8-bit Integer) quantization provided the peak measured performance. This standard further increased the speed to 27 FPS, representing a total 2.25× speedup over the FP32 baseline. This quantitative validation demonstrates that MDB-YOLO is not only accurate but also highly scalable, offering a spectrum of performance options. The FP16 and INT8 standards, in particular, confirm its ability to operate effectively in real-time on a cost-effective, industrially relevant edge platform, successfully balancing computational efficiency with practical application needs.

4. Discussion

This study aimed to develop a high-precision, lightweight object detection model capable of operating within the constraints of an industrial taro processing line. The results indicate that MDB-YOLO significantly outperforms existing state-of-the-art models, particularly in localization accuracy (mAP50-95), while maintaining computational efficiency suitable for edge deployment.

4.1. Interpretation of Architectural Improvements

The superior performance of MDB-YOLO can be attributed to the targeted resolution of three specific morphological and physical challenges defined in the problem analysis. Firstly, the integration of the C2f_EMA module and WIoU loss addressed the “tiny target and texture interference” challenge. Standard convolutions often fail to distinguish the subtle texture of residual peel from the taro surface. The Cross-Spatial Learning mechanism in EMA allowed the model to aggregate pixel-level relationships, effectively amplifying the feature signal of the peel residue. Simultaneously, WIoU prevented the training instability typically associated with the fuzzy boundaries of these small targets by dynamically focusing on varying quality anchor boxes. Secondly, the “irregular shape” challenge was mitigated through the “Reconstruct-Fuse-Preserve” link. The ablation studies confirmed that replacing standard upsampling with DySample and standard convolution with ODConv2d allowed the network to adaptively adjust sampling points and kernel weights. This dynamic adaptability is crucial for organic defects like taro peel, which do not conform to the rigid geometric assumptions of standard CNNs. Thirdly, the implementation of Soft-NMS proved decisive for the “dense occlusion” challenge. In crowded industrial scenarios, the greedy elimination strategy of Hard-NMS is a primary source of recall loss. By decaying rather than discarding overlapping boxes, Soft-NMS retained valid detections in high-density clusters, directly contributing to the observed 2.9% increase in mAP50-95 (Comparison of Exp. 1 and Exp. 4).

4.2. Performance vs. Efficiency Trade-Off

In industrial computer vision, model selection is governed by the Return on Compute (RoC)—the equilibrium between detection accuracy and computational resource consumption. Our benchmarking reveals a significant efficiency gap in generalist architectures like YOLOv13s. Despite a moderate computational load of 21.3 GFLOPS, YOLOv13s exhibits a high inference latency of 3.2 ms—nearly triple that of MDB-YOLO. This disparity indicates that v13’s architectural components, likely due to unoptimized attention mechanisms or high memory-access costs, create bottlenecks on edge hardware that GFLOPS metrics fail to capture. Consequently, YOLOv13s offers a suboptimal trade-off: lower accuracy (65.1%) at a significantly higher latency cost.

Conversely, the transformer-based RT-DETR-L represents the “brute force” upper bound. While it achieves a commendable mAP50-95 of 67.6%, it demands 108 GFLOPS and 3.3 ms of inference time. Deploying such a model necessitates high-wattage, expensive hardware, increasing the system’s Bill of Materials (BoM) without surpassing the performance of specialized solutions. MDB-YOLO, utilizing 13.44 M parameters and 28.4 GFLOPS, demonstrates the superiority of domain-specific optimization. By integrating DySample (replacing expensive deconvolution with point-sampling) and BiFPN_Concat2, the model achieves a peak accuracy of 69.7% with a minimum latency of 1.1 ms. Validation on the NVIDIA Jetson Xavier NX (27 FPS at INT8) confirms that MDB-YOLO provides a practical, cost-effective solution for real-world industrial environments.

4.3. Economic Impact of the Precision–Recall Trade-Off

The operational success of high-throughput agricultural sorting depends on the specific balance between Precision and Recall. While transformer-based State-of-the-Art (SOTA) models often prioritize high Recall, they do so at the expense of Precision, leading to distinct industrial disadvantages.

In our qualitative analysis, models like RT-DETRv4 and D-FINE frequently “hallucinate” defects on clean surfaces or misidentify conveyor belt seams as peel residue. In a sorting line, a False Positive results in the unnecessary rejection of healthy product. This leads to material waste through redundant re-peeling, energy inefficiency due to product recirculation, and a gradual loss of operator trust in the automated system.

MDB-YOLO utilizes a pure CNN architecture (C2f_EMA + ODConv2d) optimized for local textural sensitivity. By focusing on definitive textural features rather than broad global context, the model maintains an extremely low False Positive Rate. In food processing economics, ensuring that a triggered alarm is valid (High Precision) is often more valuable than capturing every marginal case at the cost of high waste. MDB-YOLO is thus specifically tuned for the economic constraints of the taro processing industry.

4.4. Hardware Compatibility and Deployment Feasibility

Beyond algorithmic metrics, the “Hardware Barrier” determines whether a model can be deployed in fragmented industrial environments. A technical verification of operator compatibility reveals a significant divide between SOTA models and MDB-YOLO.

Models like RT-DETRv4 and D-FINE rely on specialized operators such as Multi-scale Deformable Attention and GridSample. While these are accelerated by TensorRT kernels on flagship hardware (e.g., NVIDIA A100), they often lack support on the cost-effective Microcontroller Units (MCUs) or legacy industrial PCs typically found in agricultural facilities. On such devices, these operators fall back to CPU execution, causing latency to spike and rendering “real-time” detection impossible.

MDB-YOLO adheres to a foundational CNN framework. The network is constructed using universally supported operators: standard convolutions (Conv2d), Batch Normalization, and SiLU activations. These operators are natively optimized across all major inference frameworks (ONNX Runtime, TFLite, NCNN) and hardware tiers. This “Hardware Agnostic” approach ensures deployment certainty, allowing the model to function reliably across the entire spectrum from high-end GPUs to minimal edge devices.

4.5. Limitations and Future Work

Despite these advancements, limitations remain. The current dataset relies on images captured under controlled industrial lighting (LED strips). While data augmentation strategies (e.g., color shifts) were employed to simulate lighting variations, the model’s robustness under uncontrolled, natural lighting conditions remains to be verified. Additionally, while the model is optimized for taro, its generalization to other root vegetables with similar skin-to-flesh contrast issues (e.g., cassava or sweet potato) requires further validation. Future research will focus on expanding the dataset to include multi-crop domains and exploring model pruning techniques to further reduce latency on ultra-low-power microcontrollers [55,56,57,58,59,60,61,62].

5. Conclusions

In this study, we proposed MDB-YOLO, a specialized object detection framework designed to address the difficulties of detecting incomplete peeling in taro deep-processing lines. By analyzing the physical characteristics of the target, specifically size, shape irregularity, and distribution density, we introduced a series of bionic and dynamic architectural optimizations. The creation of the Taro Peel Industrial Dataset (TPID) provided a standardized benchmark for this specific agricultural task.

The incorporation of the C2f_EMA attention mechanism and WIoU loss function significantly enhanced the detection of tiny, low-contrast residues. The construction of a dynamic feature processing chain, utilizing DySample, BiFPN_Concat2, and ODConv2d, ensured the accurate morphological reconstruction of irregular peel shapes. Furthermore, the adoption of Soft-NMS effectively resolved occlusion issues in dense stacking scenarios. Experimental validation shows that MDB-YOLO achieves a state-of-the-art mAP50-95 of 69.7%, surpassing the baseline YOLOv8s by a significant margin while maintaining a lower computational load. The successful deployment on the Jetson Xavier NX platform, achieving 27 FPS, demonstrates the model’s practical readiness for cost-sensitive industrial applications. This research provides a reference for the development of visual inspection systems for root vegetables and demonstrates the effectiveness of integrating dynamic, content-aware mechanisms into lightweight detection architectures.

Author Contributions

Conceptualization, H.S.; methodology, X.F.; software, L.Y. and Y.Z.; validation, X.Y.; formal analysis, L.Y.; investigation, L.Y., X.F. and W.G.; resources, W.G.; data curation, X.F.; writing—original draft preparation, L.Y.; writing—review and editing, X.Y. and H.S.; visualization, Y.Z.; supervision, Y.T., X.L. and C.S.; project administration, X.Z.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangdong University of Science and Technology, Key Research Projects grant numbers GKY-2024KYZDK-13 and GKY-2025KYZDK-21, the Guangdong Province Key Area Project for General Universities, grant numbers 2025ZDZX1047 and 2023ZDZX3049, the Guangdong Provincial Key Discipline Research Capacity Enhancement Project, grant number 2022ZDJS147, the Guangdong Provincial Specialized Innovation Projects for Regular Higher Education Institutions, grant number 2024KTSCX189 and the Research Project on General Topics by the Guangdong Provincial Education Evaluation Association, grant number BDPG25064.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://github.com/jackcr7900/MDB-YOLO (accessed on 10 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDB-YOLO	Multi-Dimensional Bionic YOLO
YOLO	You Only Look Once
TPID	Taro Peel Industrial Dataset
EMA	Efficient Multi-Scale Attention
C2f	CSP 2-stage Feature fusion
DySample	Dynamic Upsampling
ODConv	Omni-Dimensional Dynamic Convolution
BiFPN	Bidirectional Feature Pyramid Network
WIoU	Wise-IoU
NMS	Non-Maximum Suppression
Soft-NMS	Soft Non-Maximum Suppression
mAP	mean Average Precision
TP	True Positives
FP	False Positives
FN	False Negatives
CLAHE	Contrast Limited Adaptive Histogram Equalization
FPS	Frames Per Second
ONNX	Graphics Processing Units
GFLOPS	Open Neural Network eXchange
TensorRT	NVIDIA Tensor RunTime
FPN	Feature Pyramid Network
CIoU	Complete IoU
CRI	Color Rendering Index
INT8	8-bit Integer
FP16	16-bit Floating Point (Half-Precision)
FP32	32-bit Floating Point (Single-Precision)
AIoT	Artificial Intelligence of Things
IoU	Intersection over Union
P	Precision
R	Recall
AP	Average Precision

References

Temesgen, M.; Retta, N. Nutritional potential, health and food security benefits of taro Colocasia esculenta (L.): A review. Food Sci. Qual. Manag. 2015, 36, 23–30. [Google Scholar]
Kühlechner, R. Object detection survey for industrial applications with focus on quality control. Prod. Eng. 2025, 19, 1271–1291. [Google Scholar] [CrossRef]
Ogidi, O.I.; Wenapere, C.M.; Chukwunonso, O.A. Enhancing Food Safety and Quality Control With Computer Vision Systems. In Computer Vision Techniques for Agricultural Advancements; IGI Global: Hershey, PA, USA, 2025; pp. 51–88. [Google Scholar]
Dhal, S.B.; Kar, D. Leveraging artificial intelligence and advanced food processing techniques for enhanced food safety, quality, and security: A comprehensive review. Discov. Appl. Sci. 2025, 7, 75. [Google Scholar] [CrossRef]
Mann, S.; Dixit, A.K.; Shrivastav, A. Development and performance optimization of a taro (Colocasia esculenta) peeling machine for enhanced efficiency in small-scale farming. Sci. Rep. 2025, 15, 11336. [Google Scholar] [CrossRef]
Tadesse, B.; Gebeyehu, S.; Kirui, L.; Maru, J. The contribution of potato to food security, income generation, employment, and the national economy of Ethiopia. Potato Res. 2025, in press. [Google Scholar]
Lin, Y.; Ma, J.; Wang, Q.; Sun, D.W. Applications of machine learning techniques for enhancing nondestructive food quality and safety detection. Crit. Rev. Food Sci. Nutr. 2023, 63, 1649–1669. [Google Scholar] [CrossRef]
Che, C.; Xue, N.; Li, Z.; Zhao, Y.; Huang, X. Automatic cassava disease recognition using object segmentation and progressive learning. PeerJ Comput. Sci. 2025, 11, e2721. [Google Scholar] [CrossRef]
Li, X.; Wang, F.; Guo, Y.; Liu, Y.; Lv, H.; Zeng, F.; Lv, C. Improved YOLO v5s-based detection method for external defects in potato. Front. Plant Sci. 2025, 16, 1527508. [Google Scholar] [CrossRef]
Yu, K.; Zhong, M.; Zhu, W.; Rashid, A.; Han, R.; Virk, M.; Duan, K.; Zhao, Y.; Ren, X. Advances in computer vision and spectroscopy techniques for non-destructive quality assessment of citrus fruits: A comprehensive review. Foods 2025, 14, 386. [Google Scholar] [CrossRef]
Ma, B.; Hua, Z.; Wen, Y.; Deng, H.; Zhao, Y.; Pu, L.; Song, H. Using an improved lightweight YOLOv8 model for real-time detection of multi-stage apple fruit in complex orchard environments. Artif. Intell. Agric. 2024, 11, 70–82. [Google Scholar] [CrossRef]
Wang, H.; Yun, L.; Yang, C.; Wu, M.; Wang, Y.; Chen, Z. OW-YOLO: An improved YOLOv8s lightweight detection method for obstructed walnuts. Agriculture 2025, 15, 159. [Google Scholar] [CrossRef]
Wang, X.; Gao, H.; Jia, Z.; Li, Z. BL-YOLOv8: An improved road defect detection model based on YOLOv8. Sensors 2023, 23, 8361. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zhang, K.; Wang, L.; Wu, L. An improved YOLOv8 algorithm for rail surface defect detection. IEEE Access 2024, 12, 44984–44997. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Payne, K.; O’Bryan, C.A.; Marcy, J.A.; Crandall, P.G. Detection and prevention of foreign material in food: A review. Heliyon 2023, 9, e02262. [Google Scholar] [CrossRef]
Sun, D.W. Computer Vision Technology for Food Quality Evaluation, 1st ed.; Academic Press: Amsterdam, The Netherlands, 2016. [Google Scholar]
Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion Loss: Detecting Pedestrians in a Crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7774–7783. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU Loss: More than a Penalty Term. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 128–139. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N. Bifpn-yolo: One-stage object detection integrating bi-directional feature pyramid networks. Pattern Recognit. 2025, 160, 111209. [Google Scholar] [CrossRef]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6070–6079. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Chen, X.; Wu, Z.; Zhang, W.; Bi, T.; Tian, C. An Omni-Dimensional Dynamic Convolutional Network for Single-Image Super-Resolution Tasks. Mathematics 2025, 13, 2388. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 June 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Hu, J.; Zheng, J.; Wan, W.; Zhou, Y.; Huang, Z. RT-DETR-EVD: An Emergency Vehicle Detection Method Based on Improved RT-DETR. Sensors 2025, 25, 3327. [Google Scholar] [CrossRef]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Liao, Z.; Zhao, Y.; Shan, X.; Yan, Y.; Liu, C.; Lu, L.; Ji, X.; Chen, J. RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models. arXiv 2025, arXiv:2510.25257. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Khalili, B.; Smyth, A.W. SOD-YOLOv8—Enhancing YOLOv8 for Small Object Detection in Traffic Scenes. arXiv 2024, arXiv:2408.04786. [Google Scholar]
Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-Scale Feature Aggregation and Grouping Feature Reconstruction-Based UAV Image Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621411. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-Aware Feature Fusion for Dense Image Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4756–4771. [Google Scholar] [CrossRef]
Liu, Y.; Yu, C.; Cheng, J.; Wang, Z.J.; Chen, X. MM-Net: A Mixformer-Based Multi-Scale Network for Anatomical and Functional Image Fusion. IEEE Trans. Image Process. 2024, 33, 2197–2212. [Google Scholar] [CrossRef]
Wang, N.; Er, M.J.; Chen, J.; Wu, J.G. Marine object detection based on improved YOLOv5. In Proceedings of the 2022 5th International Conference on Intelligent Autonomous Systems (ICoIAS), Fuzhou, China, 13–15 May 2022; pp. 43–47. [Google Scholar]
Peng, J.; Zhao, H.; Zhao, K.; Wang, Z.; Yao, L. Dynamic Background Reconstruction via Masked Autoencoders for Infrared Small Target Detection. Eng. Appl. Artif. Intell. 2024, 135, 108762. [Google Scholar] [CrossRef]
Trinh, C.D.; Le, T.M.D.; Do, T.H.; Bui, N.M.; Nguyen, T.H.; Ngo, Q.U.; Ngo, P.T.; Bui, D.T. Improving YOLOv8 Deep leaning model in rice disease detection by using Wise-IoU loss function. J. Meas. Control Autom. 2025, 29, 1–6. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]

Figure 1. Schematic illustration of the industrial taro processing workflow. The diagram illustrates a potato (or product) processing line where items undergo washing and peeling, followed by a quality check that either approves them for market (green arrow and check mark) or diverts (blue arrow) them to a rework loop (red cross mark): (a) Preliminary Cleaning, where raw tubers undergo high-pressure washing to remove soil and surface impurities; (b) Mechanical Peeling, utilizing abrasive rollers to strip the epidermis; and (c) The Quality Check (QC) Stage, where peeled tubers are inspected for residual skin. The proposed MDB-YOLO system is strategically deployed at stage (c) to automate the detection of incomplete peeling defects, replacing the labor-intensive manual sorting process.

Figure 2. The methodological framework and research roadmap.

Figure 3. An overview of the Taro Peel Industrial Dataset (TPID). (a) The workflow for dataset creation, showing the progression from data acquisition and annotation to data augmentation. (b) The distribution of image samples across the training, validation, and testing sets. (c) The corresponding distribution of annotated instances (residual peel fragments) for each set.

Figure 4. The architecture of the proposed MDB-YOLO model.

Figure 5. Comparison of mainstream attention mechanisms, including YOLOv8s (default: None), CBAM [20], Coordinate [21], ECA [22], GAM [23], SimAM [24], and EMA. Note that EMA (bottom right) uniquely combines parallel sub-networks to capture cross-spatial learning, distinguishing it from the simpler channel-only or spatial-only focus of other methods.

Figure 6. The architecture of the Efficient Multi-Scale Attention (EMA) module and the proposed C2f_EMA module.

Figure 7. Experimentally compared solutions for upsampling, comparing YOLOv8s (default: Upsample), Upsample + BiFPN, CARAFE, and DySample. Note how standard upsampling (top) relies on fixed pixel duplication, whereas DySample (bottom) employs a learnable field to adjust sampling content.

Figure 8. The structure of the DySample dynamic upsampling module, illustrating the dynamic point-sampling mechanism used to reconstruct feature morphology. The “Offset” generator (middle) predicts sampling coordinates based on input content, allowing the “Sample” step to pick pixels that respect the object’s curvature.

Figure 9. The simplified structure of the BiFPN_Concat2 module and the structure of a standard BiFPN.

Figure 10. Experimental comparison of Convolution variants, comparing YOLOv8s (default: Conv), DSConv, DWConv, PConv, SCConv, and ODConv. The diagram highlights how ODConv (bottom right) uniquely integrates multiple attention dimensions compared to the simpler structures of others.

Figure 11. The structural implementation of the ODConv2d module, highlighting the four complementary attention mechanisms (spatial, channel, filter, kernel) in ODConv. This multi-dimensional adaptability allows the kernel to “morph” to fit the irregular shapes of peel fragments.

Figure 12. The pseudo-code comparison between standard NMS (red) and Soft-NMS (green), showing the transition from a hard threshold (if IoU > N_t then discard) to a Gaussian decay function. This change allows valid but overlapping detections to survive suppression.

Figure 13. The architecture of the proposed MDB-YOLO model. This holistic view demonstrates how each “bionic” module contributes to the data flow.

Figure 14. Training and validation performance curves for the MDB-YOLO model over 200 epochs. The top row displays the training loss components (box, classification, and DFL) and the validation metrics for precision and recall. The bottom row shows the corresponding validation loss components and the validation metrics for mAP50 and mAP50-95. The curves demonstrate stable convergence and effective learning throughout the training process.

Figure 15. Performance comparison curves of MDB-YOLO against other prominent models across four key metrics over 200 training epochs.

Figure 16. The graphical user interface (GUI) of the MDB-YOLO-based detection system developed for practical deployment. (a) The main interface of the application. (b) The real-time detection interface, showing the input image on the left and the output with detected residual peel fragments, bounding boxes, and confidence scores on the right.

Figure 17. Visual comparison of model performance and attention mechanisms under industrial LED strip lighting. (left) for each model shows the detection results, with red circles highlighting errors (false negatives and false positives). (right) displays heat map visualizations, indicating the model’s focus areas. The heat maps for MDB-YOLO show precise activation on true defects, while others exhibit diffuse or misplaced attention, corresponding to their detection failures.

Figure 18. Actual measured performance results of the MDB-YOLO model on the Jetson Xavier NX platform under three different TensorRT precision standards (FP32, FP16, and INT8).

Table 1. Comparisons showing EMA performing best with mAP50 of 0.903 and mAP50-95 of 0.6217.

	Epochs	mAP50	mAP50-95	Precision	Recall
default (-)	100	0.8466	0.5425	0.8157	0.7755
GAM	100	0.9	0.6184	0.8392	0.8424
SimAM	100	0.887	0.6112	0.8484	0.8223
ECA	100	0.8995	0.6196	0.8732	0.8099
Coordinate	100	0.8891	0.6104	0.8499	0.8284
CBAM	100	0.8904	0.6046	0.8528	0.8192
EMA	100	0.903	0.6217	0.8909	0.8078

Table 2. Comparisons showing DySample performing best with mAP50 of 0.8917.

	Epochs	mAP50	mAP50-95	Precision	Recall
default (Upsample)	100	0.8466	0.5425	0.8157	0.7755
Upsample + BiFPN	100	0.8648	0.501	0.8395	0.7788
CARAFE	100	0.8706	0.5414	0.841	0.8003
Dysample	100	0.8917	0.5681	0.891	0.7896

Table 3. Comparisons showing ODConv performing best with mAP50 of 0.905.

	Epochs	mAP50	mAP50-95	Precision	Recall
default (Conv)	100	0.8466	0.5425	0.8157	0.7755
DSConv [36]	100	0.8934	0.5916	0.87	0.824
DWConv [37]	100	0.8688	0.5456	0.8571	0.7758
PConv [38]	100	0.8936	0.5931	0.8673	0.8251
SCConv [39]	100	0.8735	0.5533	0.8693	0.7839
ODConv	100	0.905	0.6142	0.8895	0.8326

Table 4. Detailed Configuration & Function of Modules.

Module Type	Configuration & Function	Feature Map Size (Stride)
Conv & C2f	CSPDarknet Backbone. Standard feature extraction path (P1–P5). Note: SPPF is used at the end (Layer 9).	P3: (80 × 80)/P4: (40 × 40)/P5: (20 × 20)
DySample	Dynamic Upsampling. Replaces nearest interpolation. Upsamples P5 features (20 × 20 → 40 × 40) with point-sampling.	40 × 40
BiFPN_Concat2	Weighted Fusion. Fuses upsampled P5 with P4 backbone features using learnable weights.	40 × 40
C2f	Feature processing after fusion.	40 × 40
DySample	Dynamic Upsampling. Upsamples P4 features (40 × 40 → 80 × 80).	80 × 80
BiFPN_Concat2	Weighted Fusion. Fuses upsampled P4 with P3 backbone features.	80 × 80
C2f_EMA	Small Object Refinement. Processes the high-resolution P3 feature map using EMA Attention to focus on tiny peel residues.	80 × 80 (Stride 8)
ODConv2d	Dynamic Downsampling. Compresses P3 features (80 × 80 → 40 × 40) using Omni-Dimensional Dynamic Convolution for shape adaptability.	40 × 40
BiFPN_Concat2	Weighted Fusion. Fuses downsampled features with previous P4 features.	40 × 40
C2f_EMA	Medium Object Refinement. Refines P4 features with EMA Attention.	40 × 40 (Stride 16)
ODConv2d	Dynamic Downsampling. Compresses P4 features (40 × 40 → 20 × 20).	20 × 20
BiFPN_Concat2	Weighted Fusion. Fuses downsampled features with P5 features (from Layer 9).	20 × 20
C2f_EMA	Large Object Refinement. Refines P5 features with EMA Attention.	20 × 20 (Stride 32)
Detect	Decoupled Head. Performs final bounding box regression (using WIoU) and classification.	Output Layers

Total Parameters: 13.44 M. Total GFLOPs: 28.4 G.

Table 5. Parameters for General, Optimizer, Augmentation (Standard), Inference, and Strategy categories.

Category	Parameter	Value/Configuration
General	Input Resolution	640 × 640
	Batch Size	32
	Epochs	200
	Workers	16
	Cache Images	FALSE
Optimizer	Optimizer	AdamW
	Initial Learning Rate (lr₀)	0.002
	Momentum	0.9
	Weight Decay	0.0005
	Scheduler	Linear Warm-up (warmup_epochs = 3.0)
Augmentation (Standard)	HSV-H/S/V	0.015/0.7/0.4
	Translate/Scale	0.1/0.5
	Flip (Left-Right)	0.5
	Mixup/Mosaic	0.0/1.0
	Copy-Paste	0
Inference	Conf	0.25
Inference	Iou	0.7
Strategy	Mixup/Mosaic	0/0 (Closed to align with real-world data distribution)
	Copy-Paste	0.1 (Retained for density maintenance)
	Conf	0.3 (Calibrated for sensitivity)

Table 6. Results of ablation experiments.

Exp. ID	Backbone Head	Upsampling	Fusion Neck	Down Sampling Conv	Loss Function	Post-Processing	P	R	mAP50 (%)	mAP50-95 (%)	Params (M)	FLOPs (G)
1	C2f	Upsample	FPN + PAN	Conv	CIoU	NMS	0.892	0.852	90.8	65.7	11.13	28.6
2	C2f_EMA	Upsample	FPN + PAN	Conv	WIoU	NMS	0.905	0.859	91.7	65.1	11.19	29.5
3	C2f_EMA	Upsample	FPN + PAN	ODConv2d	CIoU	NMS	0.883	0.856	91.0	65.0	13.39	27.7
4	C2f	Upsample	FPN + PAN	Conv	CIoU	Soft-NMS	0.901	0.846	90.5	68.6	11.14	28.6
5	C2f_EMA	DySample	FPN + PAN	Conv	WIoU	NMS	0.899	0.857	91.5	65.8	11.21	29.5
6	C2f_EMA	DySample	FPN + PAN	ODConv2d	WIoU	NMS	0.916	0.856	91.7	65.6	13.42	29.5
7	C2f_EMA	Upsample	FPN + PAN	Conv	WIoU	Soft-NMS	0.912	0.856	91.5	68.1	11.19	29.5
8	C2f	Upsample	FPN + PAN	ODConv2d	CIoU	Soft-NMS	0.878	0.858	90.3	68.3	13.39	27.7
9	C2f_EMA	DySample	FPN + PAN	ODConv2d	WIoU	Soft-NMS	0.882	0.863	90.7	67.4	13.42	29.5
10	C2f	DySample	FPN + PAN	ODConv2d	WIoU	Soft-NMS	0.883	0.866	91.0	68.3	13.45	28.6
11	C2f_EMA	Upsample	FPN + BiFPN_Concat2	ODConv2d	WIoU	Soft-NMS	0.903	0.850	91.0	67.9	13.42	28.5
12	C2f_EMA	Upsample	BiFPN_Concat2	ODConv2d	WIoU	Soft-NMS	0.890	0.862	91.2	67.9	13.42	28.5
13	C2f	DySample	FPN + BiFPN_Concat2	ODConv2d	WIoU	Soft-NMS	0.900	0.855	91.0	68.5	13.45	28.6
14	C2f_EMA	DySample	BiFPN_Concat2	ODConv2d	WIoU	Soft-NMS	0.908	0.859	90.8	68.8	13.45	28.6
15	C2f_EMA	DySample	FPN + BiFPN_Concat2	Conv	WIoU	NMS	0.907	0.839	90.3	64.9	11.21	29.5
16	C2f_EMA	DySample	BiFPN_Concat2	Conv	CIoU	NMS	0.900	0.842	91.1	65.4	11.21	29.5
17	C2f_EMA	DySample	BiFPN_Concat2	ODConv2d	CIoU	NMS	0.894	0.851	90.9	65.2	13.45	28.6
18	C2f_EMA	DySample	BiFPN_Concat2	ODConv2d	WIoU	NMS	0.899	0.855	91.2	65.2	13.45	28.6
19	C2f_EMA	DySample	BiFPN_Concat2	ODConv2d	CIoU	Soft-NMS	0.880	0.861	90.6	68.6	13.45	28.6
20	C2f_EMA	DySample	BiFPN_Concat2	ODConv2d	WIoU	Soft-NMS	0.909	0.880	92.1	69.7	13.44	28.4

Table 7. Logical Block A and B experiments showing parameters and results for mAP50, mAP50-95, Precision, and Recall.

Logical Block	Experiment Parameters (default: iou = 0.7; mixup = 0)	mAP50 (%)	mAP50-95 (%)	P	R
Block A	conf = 0.25; copy_paste = 0; mosaic = 1	0.907	0.682	0.899	0.858
	conf = 0.25; copy_paste = 0; mosaic = 0	0.914	0.691	0.899	0.881
	conf = 0.25; copy_paste = 0.1; mosaic = 1	0.908	0.684	0.905	0.853
	conf = 0.25; copy_paste = 0.1; mosaic = 0	0.918	0.701	0.905	0.88
Block B	conf = 0.3; copy_paste = 0; mosaic = 1	0.906	0.682	0.891	0.858
	conf = 0.3; copy_paste = 0; mosaic = 0	0.91	0.688	0.921	0.857
	conf = 0.3; copy_paste = 0.1; mosaic = 1	0.907	0.679	0.893	0.861
	conf = 0.3; copy_paste = 0.1; mosaic = 0	0.921	0.697	0.909	0.88

Table 8. Performance and Efficiency Comparison with Prominent Models on the TPID Test Set.

Model ID	Parameters (M)	GFLOPS	Inference Time (ms)	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)
YOLOv5s	9.11	23.8	4.1	88.9	84.1	90.8	64.9
YOLOv8s	11.14	28.6	2.7	89.2	85.2	90.8	65.7
YOLOv9s	7.17	26.7	1.4	88.2	84.8	90.5	64.4
YOLOv10s	7.22	21.4	1.2	88.2	82.8	89.2	64.2
YOLOv11s	9.41	21.3	1.2	88.6	84.1	90.6	65.7
YOLOv12s	9.23	21.2	1.8	87.8	84.0	90.3	64.1
YOLOv13s	9.53	21.3	3.2	89.3	85.9	91.2	65.1
RT-DETR-L	32.87	108	3.3	90.8	86.4	92.1	67.6
MDB-YOLO	13.44	28.4	1.1	90.9	88.0	92.1	69.7

Table 9. Performance Comparison: MDB-YOLO vs. Latest SOTA Models ([49,50,51]).

Model ID	Parameters (M)	GFLOPS	mAP50 (%)	mAP50-95 (%)	mAR50 (%)	mAR50-95 (%)
RT-DETRv4-S	10.00	25.0	90.8	65.5	96.1	72.2
D-FINE-S	10.00	25.0	92.0	67.8	96.7	74.6
MDB-YOLO (in standard pycocotools library)	13.44	28.4	85.3	61.3	87.7	66.7
MDB-YOLO (in YOLO internal validation tool)	13.44	28.4	92.1	69.7	N/A	N/A

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, L.; Feng, X.; Zeng, Y.; Guo, W.; Yang, X.; Zhang, X.; Tan, Y.; Sun, C.; Lu, X.; Sun, H. MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling. Electronics 2026, 15, 97. https://doi.org/10.3390/electronics15010097

AMA Style

Yu L, Feng X, Zeng Y, Guo W, Yang X, Zhang X, Tan Y, Sun C, Lu X, Sun H. MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling. Electronics. 2026; 15(1):97. https://doi.org/10.3390/electronics15010097

Chicago/Turabian Style

Yu, Liang, Xingcan Feng, Yuze Zeng, Weili Guo, Xingda Yang, Xiaochen Zhang, Yong Tan, Changjiang Sun, Xiaoping Lu, and Hengyi Sun. 2026. "MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling" Electronics 15, no. 1: 97. https://doi.org/10.3390/electronics15010097

APA Style

Yu, L., Feng, X., Zeng, Y., Guo, W., Yang, X., Zhang, X., Tan, Y., Sun, C., Lu, X., & Sun, H. (2026). MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling. Electronics, 15(1), 97. https://doi.org/10.3390/electronics15010097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MDB-YOLO: A Lightweight, Multi-Dimensional Bionic YOLO for Real-Time Detection of Incomplete Taro Peeling

Abstract

1. Introduction

1.1. Industrial Context and Motivation

1.2. Problem Analysis and Data Characteristics

1.2.1. Challenge 1: Tiny Targets and Texture Interference

1.2.2. Challenge 2: Irregular Morphology

1.2.3. Challenge 3: Dense Occlusion and Stacking

1.3. Contributions

1.4. Organization of the Paper

2. Methodology

2.1. Taro Peel Industrial Dataset (TPID) Construction

2.1.1. Data Acquisition Infrastructure

2.1.2. Data Preprocessing and Augmentation Strategy

2.1.3. Annotation and Dataset Splitting

2.2. The MDB-YOLO Architecture

2.2.1. Addressing Challenge 1: The C2f_EMA Module and WIoU

2.2.2. Addressing Challenge 2: The Dynamic Feature Processing Chain

2.2.3. Addressing Challenge 3: Soft-NMS for Dense Occlusion

2.2.4. Model Layer Configuration and Complexity Analysis

2.3. Implementation Details and Training Configuration

3. Experimental Results and Analysis

3.1. Experimental Setup and Evaluation Metrics

3.2. Ablation Studies: Validating the Bionic Mechanisms

3.3. Hyperparameter Experiments

3.4. Comparative Analysis with Prominent Models

3.5. Edge Device Deployment and Quantitative Analysis

3.5.1. Deployment Hardware and Software Environment

3.5.2. Model Export and Optimization Pipeline

3.5.3. TensorRT Quantization Standards

4. Discussion

4.1. Interpretation of Architectural Improvements

4.2. Performance vs. Efficiency Trade-Off

4.3. Economic Impact of the Precision–Recall Trade-Off

4.4. Hardware Compatibility and Deployment Feasibility

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI