A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement

Wu, Xibao; Yang, Kexin; Zhao, Wei; Wang, Yiqun; Chen, Wenbai; Zhao, Chunjiang

doi:10.3390/agronomy16040419

Open AccessArticle

A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement

by

Xibao Wu

¹,

Kexin Yang

¹,

Wei Zhao

¹,

Yiqun Wang

¹,

Wenbai Chen

^1,*

and

Chunjiang Zhao

²

¹

School of Automation, Xiaoying Campus, Beijing Information Science and Technology University, Beijing 100192, China

²

National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(4), 419; https://doi.org/10.3390/agronomy16040419

Submission received: 14 January 2026 / Revised: 2 February 2026 / Accepted: 4 February 2026 / Published: 9 February 2026

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

This research presents an advanced YOLOv8-MMD framework specifically designed for intelligent white radish harvesting systems, addressing the critical need for simultaneous species recognition and quality evaluation. The proposed architecture is built upon a dual-branch detection system (YOLOv8-Dual) with a shared Backbone network, and is further enhanced by two novel components: the Multi-Scale Attention Aggregation (MSAA) module that strategically combines channel-wise and spatial attention mechanisms to refine feature representation, and the Multi-scale Feature Enhancement (MAFE) module that facilitates effective information fusion across different hierarchical levels of the network. Extensive experimental validation reveals that the YOLOv8-MMD model achieves remarkable performance metrics, including a species detection precision of 0.945 and a quality assessment precision of 0.812, representing improvements of 1.4% and 4%, respectively, over the baseline YOLOv8-Dual model. Under the comprehensive mAP@50 evaluation standard, the model reaches 0.949 for species identification and 0.859 for quality classification, while maintaining impressive recall rates of 0.924 and 0.836 for the respective tasks. The system demonstrates exceptional robustness when deployed in challenging field conditions, consistently performing well under varying lighting intensities, different growth stages, and partial occlusion scenarios. Computational analysis confirms the model’s practical viability, achieving a processing throughput of 112 frames per second with 8.1 GFLOPs of computational overhead, thereby meeting stringent real-time operational requirements for agricultural robotic applications. Comparative studies with existing methods further substantiate the superiority of the proposed approach in balancing detection accuracy with computational efficiency. The integration of multi-scale attention mechanisms and hierarchical feature enhancement strategies provides a comprehensive solution for automated agricultural harvesting in complex, unstructured environments, offering significant potential for practical implementation in precision agriculture systems.

Keywords:

white radish intelligent harvesting; multi-task detection; multi-scale attention aggregation; multi-scale feature enhancement; YOLOv8

1. Introduction

1.1. Background and Motivation

As a crucial root vegetable in China, white radish holds significant importance in both national dietary patterns and agricultural economies. Its high yield, storability, and broad adaptability make it an indispensable resource in regional agricultural markets. Traditional white radish harvesting predominantly relies on manual labor, which is time-consuming, labor-intensive, and inefficient. This approach not only increases labor costs but also hinders large-scale stable supply [1]. To address these challenges, mechanized harvesting equipment has gradually been introduced to improve efficiency and reduce labor demands [2]. However, existing mechanized harvesting technologies still face critical limitations in practical applications. These include inadequate robustness to the large intra-class variation in radish size and shape, frequent occlusion by dense foliage, significant lighting fluctuations in open fields, and the difficulty in precisely grading quality based on subtle, often obscured, morphological and textural features. These issues undermine subsequent grading processes and market distribution planning.

The rapid advancement of artificial intelligence (AI) and deep learning technologies offers promising solutions by integrating advanced visual detection and classification algorithms into mechanized harvesting workflows. Such integration could enable precise identification and separation of white radishes from their foliage, as well as intelligent quality grading to meet diverse market demands. Nevertheless, practical implementation faces two core challenges: (1) achieving real-time processing while maintaining high accuracy, and (2) ensuring robustness in complex agricultural environments. Traditional detection methods often handle target detection and classification as isolated tasks, lacking mechanisms for multi-task collaboration and feature sharing, which limits their effectiveness in fine-grained multi-category recognition.

To overcome these limitations—specifically, the scale variation, occlusion, and fine-grained perception challenges outlined above—this study proposes a multi-task detection method based on Multi-Scale Attention Aggregation and Feature Enhancement. First, a quality assessment branch is integrated into the original YOLOv8 architecture, forming a dual-branch framework with a shared Backbone network to simultaneously perform species detection and quality classification. Second, a Multi-Scale Attention Aggregation (MSAA) module is introduced to enhance multi-scale feature perception. This module adaptively aggregates critical features across different scales through attention mechanisms, significantly improving detection accuracy for white radishes of varying sizes and morphologies. Furthermore, given that white radish quality grading is a fine-grained task reliant on subtle morphological and textural features often obscured by foliage and soil, a Multi-scale Attention Feature Enhancement (MAFE) module is designed to strengthen feature representation and facilitate cross-task interaction. By employing an adaptive feature fusion strategy, MAFE maintains task-specific feature independence while enabling complementary information exchange between species detection and quality assessment. This module effectively integrates hierarchical features, enhancing the model’s ability to recognize quality-related attributes (e.g., morphology, color, texture) and improving species detection precision. The proposed method not only optimizes feature extraction and representation but also enhances robustness in complex agricultural environments, providing a reliable technical foundation for intelligent harvesting systems.

The core contributions of this work are:

A multi-task detection framework that unifies species detection and quality assessment within a dual-branch YOLOv8 architecture, addressing the integrated perception needs of harvesting robots.
A novel Multi-Scale Attention Aggregation (MSAA) module that combines channel and spatial attention across multiple receptive fields to robustly handle large-scale variations of radishes and foliage in field images.
A feature enhancement scheme via the MAFE module, which integrates Shape Attention and texture modeling to strengthen representations against field challenges like occlusion and lighting variation, enabling reliable fine-grained quality grading.
Comprehensive validation demonstrating high-performance detection and classification on a white radish dataset, providing direct visual input for automated harvest-and-sort decisions.

This work presents a novel, integrated solution tailored for underground root crop harvesting. The synergistic design of the dual-branch framework, MSAA, and MAFE modules specifically addresses the distinct complexities of this domain, such as partial occlusion and morphology-based quality evaluation.

1.2. Related Work

Under the research context of intelligent agriculture and precision harvesting, the continuous advancements in object detection algorithms, multi-task learning strategies, and the YOLOv8 framework have provided the technical foundation and theoretical support for this study. The following sections review relevant studies from two perspectives: the evolution of object detection and research related to multi-task learning.

1.2.1. Development of Object Detection

Object detection, a core problem in computer vision, has evolved from traditional methods to deep learning-based approaches. Early researchers primarily relied on handcrafted features and sliding window mechanisms to identify object locations and categories. These methods were often limited by their feature representation capabilities, resulting in suboptimal detection performance and generalization [3]. With the rise of Convolutional Neural Networks (CNNs), the R-CNN series (e.g., R-CNN [4], Fast R-CNN [5], Faster R-CNN [6]) significantly improved detection accuracy and speed by introducing region proposal networks and efficient feature extraction modules. However, these two-stage detection frameworks typically suffered from complex architectures and insufficient real-time performance. To address these limitations, single-stage detectors such as YOLO (You Only Look Once) [7] and SSD (Single Shot MultiBox Detector) [8] emerged, dramatically enhancing real-time processing and usability. Subsequently, the YOLO family of algorithms has undergone continuous iterations—from YOLOv2 [9], YOLOv3 [10], YOLOv4 [11], YOLOv5 [12], and YOLOv8 to YOLOv11 [13]—achieving significant improvements in detection accuracy, speed, lightweight design, and adaptability. These advancements have laid a robust technical foundation for rapid and accurate detection of white radishes and their foliage in agricultural scenarios.

1.2.2. Research on Multi-Task Learning

Multi-task learning (MTL) aims to enhance model generalization and data utilization efficiency by simultaneously optimizing multiple related tasks, thereby exploiting shared information and feature representations across tasks [14,15]. In computer vision, MTL has been widely applied to subtasks such as object detection, semantic segmentation, instance segmentation, keypoint detection, and image classification. Several multi-task frameworks—including Cross-Stitch Networks [16] and UberNet [17]—have enabled efficient joint learning of diverse visual tasks. For agricultural applications, Du et al. proposed a multi-task CNN-based vision system for tomato-picking robots, demonstrating that running multiple tasks on a shared Backbone network does not compromise performance [18]. Tham et al. utilized MTL for joint disaster classification and victim detection, highlighting its flexibility and advantages in emergency scenarios [19]. In autonomous driving, Guo et al. introduced YOLO-ODL, a hard parameter-sharing multi-task model for joint detection of traffic objects, drivable areas, and lane lines, achieving high efficiency [20]. Chen et al. developed MTD-YOLO, a multi-task deep CNN for maturity detection of cherry tomato clusters, improving production efficiency and generalization [21]. Wang et al. further advanced this field with A-YOLOM, an adaptive, real-time, lightweight multi-task model for object detection, drivable area segmentation, and lane line segmentation [22].

These studies provide critical references for extending YOLOv8 to multi-task detection and classification in this work. In the context of white radish harvesting, the chosen tasks of species detection and quality assessment are intrinsically complementary. The species detection task (locating radishes and distinguishing them from tassels) learns robust, general-purpose features about object presence, shape, and boundary. These features provide a precise spatial prior and foundational visual understanding for the subsequent quality assessment branch. Conversely, the quality assessment task, which requires fine-grained discrimination based on morphology, color uniformity, and surface texture, encourages the shared Backbone to extract richer and more discriminative features. This mutual reinforcement—where detection aids precise localization for grading, and grading pushes the model to learn subtler visual cues—creates a synergistic effect that is difficult to achieve with separately trained models. Nevertheless, the unique challenges of underground/root crop harvesting—such as occlusion, similarity to soil, and the need for precise quality evaluation based on morphology—demand a more tailored feature-sharing and enhancement strategy. This justifies our creation of a new dual-branch system with a shared Backbone, specifically enhanced by the MSAA and MAFE modules to address these gaps.

2. Materials and Methods

2.1. YOLOv8 Base Model

The YOLOv8 model is an advanced object detection framework comprising three core components, the Backbone, Neck, and Head networks, as illustrated in Figure 1.

The Backbone employs an enhanced version of CSPNet (Cross-Stage Partial Network), composed of multiple convolutional (Conv) layers and C2f modules. It progressively downsamples features while reducing redundant computations through cross-stage feature fusion, preserving rich semantic information. The C2f module incorporates shortcut connections and balances computational efficiency with feature representation by adjusting flexible parameters (e.g., n and d). At the end of the Backbone, the Spatial Pyramid Pooling-Fast (SPPF) module enhances the global receptive field via multi-scale MaxPool2d operations, improving localization accuracy and feature representation.

The Neck integrates and enhances features extracted by the Backbone to address multi-scale detection challenges. By combining a Feature Pyramid Network (FPN) and a Path Aggregation Network (PANet), it merges feature maps of varying resolutions to detect both small and large targets. Upsampling and concatenation operations fuse low-level and high-level features, while additional C2f modules optimize feature representation, ensuring efficient and intact feature propagation.

The Head generates final predictions, including object categories, bounding boxes (bboxes), and confidence scores. The Detect module performs detection on multi-scale feature maps (P3, P4, P5), where each scale specializes in detecting targets of specific sizes. Convolutional layers process these feature maps to produce detection outputs. A center-based detection strategy is adopted to streamline training, reduce model complexity, and achieve superior real-time performance.

2.2. Feature Sharing Mechanism

Feature sharing is a key technique that enhances network performance and reduces parameter count. In the dual-branch detection model proposed in this paper, the feature sharing mechanism permeates the design of the entire network architecture. Specifically, efficient multi-task learning is achieved between species detection and quality assessment tasks by sharing low-level features.

The dual-branch detection model proposed in this paper is modified from the base architecture of YOLOv8 and is named YOLOv8-Dual. Its feature extraction backbone network adopts a shared mode, while the detection heads are independently implemented to perform species detection and quality assessment tasks separately. The network structure of the entire feature sharing mechanism is shown in Figure 2.

The input image is denoted as X, with dimensions

h \times ω \times c_{1}

. The shared features

F_{s h a r e d}

are extracted through the Backbone network, as formalized in Equation (1). The feature extraction module outputs shared features

F_{s h a r e d}

for both tasks, with their dimensions annotated as

F_{s h a r e d} \in R^{h^{'} \times w^{'} \times c_{2}}

. Subsequently,

F_{s h a r e d}

is fed into the species detection head and quality detection head to obtain their respective detection branches:

Y_{s p e c i e s}

(species detection branch) and

Y_{q u a l i t y}

(quality detection branch). Both branches generate final outputs using shared parameters passed through, while leveraging latent correlations between the two tasks to enhance the model’s overall performance.

\begin{matrix} F_{shared} = B a c k b o n e (X) \\ Y_{species} = SpeciesHead (F_{shared}) \\ Y_{quality} = QualityHead (F_{shared}) \end{matrix}

(1)

2.3. Multi-Task Detection Module

The multi-task detection module consists of three key components, Backbone Outputs, detection heads, and outputs, with the specific module structure shown in Figure 3. Backbone Outputs are multi-scale feature maps generated by the backbone network, including P3, P4, and P5. P3 is the highest-resolution feature map, used for detecting large targets. Each feature map is simultaneously fed into two independent detection head modules (species detection head and quality detection head). P4 is a medium-resolution feature map for detecting medium-sized targets. P5 represents the lowest-resolution feature map, used for detecting small targets.

The detection heads include two task branches: the species detection head and the quality detection head. Each branch processes the input feature maps through three convolutional layers (3 Conv Layers) to extract features relevant to the target tasks. The species detection head consists of three convolutional layers, designed to extract features related to species classification and bounding box regression. The function of this detection head is to adjust the dimensions and content of the input feature maps, outputting features for subsequent classification and regression modules. The formula for the output feature map of the species detection head is shown in Equation (2).

F_{species, i} = {Conv}_{3} (P_{i})

(2)

where

F_{species, i}

represents the output feature map of the species detection head,

{Conv}_{3}

denotes the three-layer convolutional network, and

P_{i}

represents the output of the

i

-th layer of the Backbone.

The quality detection head consists of three convolutional layers, designed to extract features related to quality classification and bounding box regression. The function of this quality detection head is to adjust the dimensions and content of the input feature maps, outputting features for subsequent classification and regression modules. The formula for the output feature map of the quality detection head is shown in Equation (3).

F_{quality, i} = {Conv}_{3} (P_{i})

(3)

where

F_{quality, i}

represents the output feature map of the quality detection head,

{Conv}_{3}

denotes the three-layer convolutional network, and

P_{i}

represents the output of the

i

-th layer of the Backbone.

The outputs are similarly divided into two parts: Species Outputs and Quality Outputs. The Species Outputs include Cls (Species) and Bbox (Species), where Cls (Species) generates species classification results, and Bbox (Species) generates species bounding box regression results, as specified in Equation (4).

\begin{matrix} {Cls}_{species, i} = Softmax ({Conv}_{cls} (F_{species, i})) \\ {Bbox}_{species, i} = {Conv}_{bbox} (F_{species, i}) \end{matrix}

(4)

where

{Cls}_{species, i}

denotes the classification output of the species detection task on feature layer

i

;

S o f t m a x

represents the normalization operation of the classification function, ensuring that the sum of probabilities for all categories equals 1; and

{Conv}_{cls}

indicates a convolutional operation for generating feature channels required for classification. For the regression part,

{Bbox}_{species, i}

denotes the bounding box regression output of the species detection task on feature layer

i

, and

{Conv}_{bbox}

indicates a convolutional operation for generating regression parameters of the bounding boxes.

Quality Outputs has the same structure, including Cls (Species) and Bbox (Species). Cls (Species) generates quality classification results, and Bbox (Species) generates quality bounding box regression results, as specified in Equation (5).

\begin{matrix} {Cls}_{quality, i} = Softmax ({Conv}_{cls} (F_{quality, i})) \\ {Bbox}_{quality, i} = {Conv}_{bbox} (F_{quality, i}) \end{matrix}

(5)

where

{Cls}_{quality, i}

represents the classification output of the quality detection task on feature layer

i

,

S o f t m a x

normalizes the classification scores to generate probability values for each category, and

{Conv}_{cls}

denotes a convolutional operation for generating feature channels required for classification. For the regression part,

{Bbox}_{qualitu, i}

represents the bounding box regression output of the quality detection task on feature layer

i

, and

{Conv}_{bbox}

denotes a convolutional operation for generating regression parameters of the bounding boxes.

2.4. MSAA Module

The Multi-Scale Attention Aggregation (MSAA) module is a feature refinement module designed to effectively enhance feature representation capabilities. By integrating multi-scale feature extraction and attention mechanisms, it refines features across different dimensions to adapt to the complexity and diversity of images. The module comprises three main components, Multi-Scale Fusion (MSF), Spatial Aggregation (SA), and Channel Aggregation (CA), with the specific structure shown in Figure 4.

The MSF (Multi-Scale Fusion) extracts multi-scale receptive fields from input features using parallel convolutional kernels of different sizes (3 × 3, 5 × 5, 7 × 7). These sizes were selected to capture fine-grained textures, mid-level structures, and broader contextual information, corresponding to the scale variation of radishes and their foliage in the field of view. This approach enables the module to better capture multi-scale contextual information and enhance the network’s global perception capability, as detailed in Equation (6), where a 1 × 1 convolution first reduces channel dimensionality for efficiency, followed by parallel convolutions to extract multi-scale features, which are then summed for fusion.

\begin{matrix} F_{reduced} = {Conv}_{1 \times 1} (F_{input}) \\ F_{3 \times 3} = {Conv}_{3 \times 3} (F_{reduced}) \\ F_{5 \times 5} = {Conv}_{5 \times 5} (F_{reduced}) \\ F_{7 \times 7} = {Conv}_{7 \times 7} (F_{reduced}) \\ F_{fusion} = F_{3 \times 3} + F_{5 \times 5} + F_{7 \times 7} \end{matrix}

(6)

In this context,

F_{reduced}

is the dimensionality-reduced feature tensor;

F_{input}

is the input feature tensor;

{Conv}_{1 \times 1}

represents a 1 × 1 convolution;

F_{3 \times 3}

,

F_{5 \times 5}

, and

F_{7 \times 7}

are features computed using 3 × 3, 5 × 5, and 7 × 7 convolutional kernels, respectively; and

F_{fusion}

is the feature tensor after multi-scale fusion.

The SA (Spatial Aggregation) further refines feature representation by focusing on spatial dimension modeling. This module first models global spatial information from the multi-scale fused features, extracting spatial context through a global pooling operation that reduces the feature map

F_{fusion} \in R^{H \times W \times C}

to a channel descriptor vector

F_{pool} \in R^{1 \times 1 \times C}

. It then employs a large convolutional kernel to capture spatial correlation features. Finally, a Sigmoid activation function generates spatial attention weights to spatially weight the original features. This mechanism enables the model to concentrate on more meaningful spatial regions in the feature map, thereby suppressing redundant information, as formalized in Equation (7).

\begin{matrix} F_{pool} = Pool (F_{fusion}) \\ F_{spatial - att} = σ ({Conv}_{7 \times 7} (F_{pool})) \\ F_{spatial - out} = F_{fusion} ⊙ F_{spatial - att} \end{matrix}

(7)

where

F_{pool}

is the pooled feature tensor, where

P o o l

denotes the global pooling operation;

F_{spatial - att}

is the generated spatial attention weights;

σ

represents the Sigmoid function;

{Conv}_{7 \times 7}

denotes a 7 × 7 convolutional operation; and

F_{spatial - out}

is the spatially weighted feature tensor. The spatial attention weights

F_{spatial - att}

form a 2D probability map that highlights regions of importance within the feature map.

The CA (Channel Aggregation) aims to learn inter-channel dependencies, further enhancing the discriminative power of feature representations through attention modeling in the channel dimension. The CA module first applies a global average pooling operation across spatial dimensions to produce a channel-wise statistic vector

F_{channel - pool} \in R^{1 \times 1 \times C}

to the input features to extract global channel-wise statistics. These statistics undergo non-linear transformations via a series of fully connected layers to generate channel attention weights. These weights are used to weight the original features along the channel dimension, highlighting critical channels and suppressing less important ones, as formalized in Equation (8). The CA module effectively captures correlations between different spectral or channel features in remote sensing images, enabling the network to fully leverage channel information.

\begin{matrix} F_{channel - pool} = AvgPool (F_{input}) \\ F_{charmel - att} = {Conv}_{1 \times 1} (ReLU ({Conv}_{1 \times 1} (F_{channel - pool}))) \\ F_{channel - out} = F_{input} ⊙ F_{channel - att} \end{matrix}

(8)

where

F_{channel - pool}

is the feature tensor after global average pooling, where

A v g P o o l

denotes the average pooling operation;

F_{channel - att}

is the generated channel attention weights;

{Conv}_{1 \times 1}

denotes a 1 × 1 convolutional operation;

R e L U

represents the activation function;

F_{channel - out}

is the channel-weighted feature tensor; and ⊙ denotes the element-wise multiplication operation along the channels.

The proposed MSAA module differs from classical attention modules like Squeeze-and-Excitation Networks (SENets) and the Convolutional Block Attention Module (CBAM) in its integrated multi-scale strategy. While SENet focuses solely on channel-wise recalibration and CBAM sequentially processes channel and then spatial attention, our MSAA first performs Multi-Scale Fusion (MSF) to capture contextual features at multiple receptive fields before applying dedicated spatial (SA) and channel (CA) aggregation pathways. This design ensures that the attention weights are generated based on features that already encode multi-scale context, making it particularly effective for agricultural objects like white radishes that exhibit large size variations within a single image.

2.5. MAFE Module

The Multi-scale Attention Feature Enhancement (MAFE) module is a network module designed for efficient refinement and enhancement of input features, particularly suited for feature modeling tasks in complex scenarios. This module is strategically integrated into the Neck section of the YOLOv8-Dual framework (as detailed in Section 2.6) to process multi-scale features from the Backbone. Its structure is shown in Figure 5. This module integrates Shape Attention, Texture Branch, and Multi-scale Convolution to achieve multi-dimensional and multi-level refinement of input features, thereby enhancing the model’s receptive field and contextual information aggregation capability, and focusing on critical regions.

The Shape Attention Branch is primarily responsible for capturing the geometric and structural information of the input features. The core idea is to highlight important regions related to the shape of the input features by generating Shape Attention weights. The specific formulation is shown in Equation (9).

\begin{matrix} F_{shape - att} = σ ({Conv}_{1 \times 1} ({Conv}_{3 \times 3} ({Conv}_{1 \times 1} (F_{input})))) \\ F_{shape - feat} = F_{input} ⊙ F_{shape - feat} \end{matrix}

(9)

where

F_{input}

denotes the input features,

σ

represents the Sigmoid function, and ⊙ indicates element-wise multiplication.

The Texture Feature Branch is primarily designed to capture texture details in the input features. These details are typically closely related to local patterns in the features. By incorporating depthwise separable convolution and non-linear activations, this branch reduces computational complexity while effectively extracting texture features. The specific formulation is detailed in Equation (10).

F_{texture} = {Conv}_{1 \times 1} (ReLU (BatchNorm ({DepthConv}_{3 \times 3} (F_{input}))))

(10)

where

R e L U

is the activation function,

F_{texture}

denotes the texture features, and

B a t c h N o r m

represents the normalization operation.

The Multi-scale Convolution Branch aims to capture multi-scale contextual information from input features using convolutional kernels of varying sizes (3 × 3, 5 × 5, 7 × 7). This selection follows the same rationale as in the MSAA module, ensuring comprehensive coverage of feature scales from fine details to global context, which is critical for distinguishing radishes from complex backgrounds. The specific structure is detailed in Equation (11).

F_{scale - k} = {DepthConv}_{k \times k} (F_{input}), k \in {3, 5, 7}

(11)

where

F_{scale - 3}

,

F_{scale - 5}

, and

F_{scale - 7}

represent the output features of convolutional kernels with different scales.

The Feature Fusion Module is responsible for the unified integration of features from different branches and generates enhanced output features. The fusion process is achieved through channel concatenation and 1 × 1 convolution. This implementation process is detailed in Equation (12).

\begin{matrix} F_{concat} = Concat ([F_{shape - feat}, F_{texture}, F_{scale - 3}, F_{scale - 5}, F_{scale - 7}], \dim = 1) \\ F_{enhanced} = ReLU (BatchNorm ({C o n ν}_{1 \times 1} (F_{concat}))) \end{matrix}

(12)

where

F_{concat}

is the concatenated feature tensor, where

F_{shape - feat}

represents features from the Shape Attention Branch;

F_{texture}

denotes features from the Texture Branch;

F_{scale - 3}

,

F_{scale - 5}

, and

F_{scale - 7}

are multi-scale contextual features extracted by the 3 × 3, 5 × 5, and 7 × 7 convolutional kernels from the Multi-scale Branch;

F_{enhanced}

is the enhanced feature tensor after fusion; and

F_{concat}

is formed by channel-wise concatenation of features from different dimensions.

Unlike standard Feature Pyramid Networks or simple skip connections that primarily fuse features across levels, the MAFE module introduces a task-aware, multi-branch refinement process. It distinctively incorporates: (1) a Shape Attention Branch to explicitly guide the model towards geometric structures; (2) a Texture Branch using depthwise separable convolutions to efficiently capture surface details; and (3) a Multi-scale Convolution Branch (with kernels of 3 × 3, 5 × 5, 7 × 7) to capture contextual information at various receptive fields. The innovation lies not in the use of Multi-scale Convolution per se, but in its synergistic integration with the dedicated shape and texture pathways within an adaptive fusion scheme. This holistic design, which explicitly decouples and enhances shape, texture, and multi-scale contextual cues, is specifically tailored for fine-grained quality assessment where these attributes are collectively critical. This offers a more targeted solution than general feature enrichment modules (e.g., transformer blocks) for agricultural product grading.

2.6. Improved YOLOv8 Network Model Architecture Diagram

The architecture of the enhanced dual-branch attention feature enhancement detection model is shown in Figure 6. In the Backbone and Neck sections of the original YOLOv8 model, Channel Adjust modules, MSAA modules, and MAFE modules are introduced, with optimized designs for the P3, P4, and P5 branches.

For the P5 branch improvement, Channel Adjust, MSAA, and MAFE modules are sequentially added after the SPPF module, allowing low-resolution features to be processed by these modules before being input into the P5 branch via Upsample and Concat operations; for the P4 branch improvement, Channel Adjust, MSAA, and MAFE modules are introduced between the 6th-layer C2f module and the first Concat module to process medium-resolution features before transferring them to the P4 branch; and for the P3 branch improvement, Channel Adjust, MSAA, and MAFE modules are similarly added between the 4th-layer C2f module and the second Concat module to refine high-resolution features before passing them to the P3 branch. Specifically, the MAFE module serves as a feature enhancer within this pipeline: it takes the features from preceding layers (Backbone Outputs or upsampled features), performs its multi-branch refinement as described in Section 2.5, and outputs the enhanced features to subsequent Concat operations and ultimately to the detection heads. The dual-branch detection model is named YOLOv8-Dual, the YOLOv8-Dual model with MSAA modules is named YOLOv8-MD, and the final improved model incorporating both MSAA and MAFE modules is named YOLOv8-MMD.

Finally, the outputs of the enhanced P3, P4, and P5 branches are fed into the dual-branch detection heads, achieving overall model optimization. The improved model supports multi-task object detection, adapts to the requirements of this study for diverse features in species recognition and quality assessment, and significantly enhances feature representation capability and model performance.

3. Results and Discussion

3.1. Dataset

The dataset used in this study comprises 2976 images collected from the white radish experimental fields at the Xiaotangshan Base of the Beijing Academy of Agriculture and Forestry Sciences. Images were captured with a camera mounted 2.0–2.5 m above ground, under varying natural lighting and across different growth stages to ensure robustness. The dataset was divided into training and validation sets in an 8:2 ratio to ensure balanced model training and evaluation. To meet multi-task detection requirements, the dataset was split into two independent sub-datasets: a species dataset and a quality dataset. The species dataset includes annotations for “white radish” and “white radish tassels”, while the quality dataset further categorizes “white radish” into three quality grades: “good”, “middle”, and “bad”. Both sub-datasets share the same images but employ distinct labeling systems. This design aims to enhance feature extraction efficiency by simultaneously addressing species detection and quality assessment tasks.

Detailed dataset statistics are provided in Table 1. The 2976 images collectively contain a total of 8542 annotated instances, each with an axis-aligned bounding box for object localization and classification. For the species detection task, these 8542 instances comprise 4321 “white radish” objects and 4221 “white radish tassel” objects. The quality assessment task is applied to the subset of 4321 radish instances, classifying them into 1851 “good”, 1580 “middle”, and 890 “bad” samples. This distribution reflects the natural prevalence of higher-quality radishes in the fields. To characterize the dataset’s environmental diversity, we note that approximately 30% of the images contain significant occlusion (e.g., radishes partially covered by soil or leaves), and the lighting conditions vary broadly across the collection to include direct sunlight, overcast shadows, and varying daylight hours.

Sample images from the dataset are shown in Figure 7.

3.2. Preparation Work and Evaluation Metrics

The experimental configuration required for this study is shown in Table 2. The experiment utilized an RTX 4050 GPU, ran on a Windows 11 system, and employed the PyTorch 1.10.0+cu113 framework for training.

The experimental parameter settings are shown in Table 3: the number of training epochs was set to 300; the number of species categories was set to 2; the number of quality categories was set to 3; the batch size was set to 8; and the image size was set to 640 × 640.

The evaluation metrics in this study include precision, recall, mean average precision (mAP) at different thresholds (specifically mAP@0.5 and mAP@0.5:0.95), inference time, and FPS.

The calculation formula for precision is shown in Equation (13).

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

The calculation formula for recall is shown in Equation (14).

R e c a l l = \frac{T P}{T P + F N}

(14)

The calculation formula for mAP is shown in Equation (15).

m A P = \frac{\sum_{i = 1}^{C} A P_{i}}{C}

(15)

where C represents the total number of classes,

A P

measures the detection performance for a class, and

A P_{i}

denotes the

A P

of the i-th class.

The model is trained using the default YOLOv8 loss functions, which are well-suited for our multi-task detection framework. The total loss for each task branch (species or quality) is a weighted sum of three components: (1) bounding box loss (box_loss), which uses a combination of CIoU (Complete Intersection over Union) and Distribution Focal Loss (dfl_loss) to ensure precise localization and shape alignment; (2) classification loss (cls_loss), implemented with Binary Cross-Entropy (BCE) for multi-class discrimination; and (3) Distribution Focal Loss (dfl_loss) itself, which helps the model learn a flexible representation of the bounding box distribution. This combination is particularly effective for our dual-task setting because the box_loss and dfl_loss provide strong geometric supervision crucial for both detecting the radish body and accurately locating it for quality assessment, while the cls_loss handles the fundamental classification for both species and fine-grained quality categories. The shared Backbone learns features optimized by this composite objective, balancing localization and classification accuracy across tasks.

3.3. Experimental Results and Analysis

3.3.1. YOLOv8-Dual Model Performance

This experiment includes the YOLOv8-Dual, YOLOv8-MD, and YOLOv8-MMD models. The loss curves of the YOLOv8-Dual model are shown in Figure 8. Figure 8a displays the training loss curves of YOLOv8-Dual, where train/box_loss-species represents the bounding box training loss of the species branch, train/cls_loss-species is the classification training loss of the species branch, train/dfl_loss-species is the distribution training loss of the species branch, train/box_loss-quality denotes the bounding box training loss of the quality branch, train/cls_loss-quality is the classification training loss of the quality branch, and train/dfl_loss-quality is the distribution training loss of the quality branch. In Figure 8a, all losses decrease and converge smoothly. The bounding box losses of both branches are the lowest, indicating that the predicted bounding boxes closely match the ground truth with high accuracy. The classification loss of the quality branch is the highest, suggesting some error in quality category prediction. Notably, the distribution losses of the species and quality branches are nearly identical, with only a slight initial difference in species distribution loss, demonstrating comparable bounding box prediction performance between the two branches.

Figure 8b shows the validation loss curves of YOLOv8-Dual, where val/box_loss-species represents the bounding box validation loss of the species branch, val/cls_loss-species is the classification validation loss of the species branch, val/dfl_loss-species is the distribution validation loss of the species branch, val/box_loss-quality denotes the bounding box validation loss of the quality branch, val/cls_loss-quality is the classification validation loss of the quality branch, and val/dfl_loss-quality is the distribution validation loss of the quality branch. In Figure 8b, the classification loss curve for the quality branch exhibits significant fluctuations with scattered points, indicating errors during validation that affect prediction accuracy. The distribution losses of both branches show minor fluctuations but eventually converge. The bounding box losses of both branches perform better than the training losses, suggesting that the predicted bounding boxes during validation also closely align with the ground truth, demonstrating robust performance.

However, YOLOv8-Dual exhibits slower convergence speed and moderate fluctuations during training, particularly evident in the validation curves, indicating potential stability issues. In terms of performance metrics, the species classification accuracy is slightly lower. A significant gap exists between training and validation losses in YOLOv8-Dual, especially within the species classification branch, suggesting possible overfitting issues and slightly weaker generalization capability of the model.

The performance of the dual-branch detection model is shown in Figure 9. All white radish tassels are successfully detected with high confidence scores, and all white radishes are labeled with dual tags (species label and quality label). Targets under the conveyor belt are accurately detected, demonstrating ideal detection outcomes. Experimental validation confirms that YOLOv8-Dual exhibits foundational multi-task processing capabilities. The model simultaneously achieves detection of white radishes and their tassels, along with quality grading, with species detection confidence maintained between 0.65 and 0.75. In quality assessment, the model shows relatively stable recognition of medium-grade (middle) radishes (confidence: 0.70–0.80), but exhibits fluctuations in identifying high-quality (good) and defective (bad) radishes. Straight radishes with wider diameters are classified as high-quality, moderately curved ones as medium-grade, and highly curved or forked malformed radishes as low-quality. The results indicate that YOLOv8-Dual provides a feasible framework for multi-task detection in intelligent white radish harvesting systems, though further improvements are needed in detection stability and quality assessment accuracy.

The evaluation metrics of the YOLOv8-Dual model are shown in Table 4. In the target species detection task, the model demonstrates high precision (0.931) and recall (0.918), with a corresponding AP50 of 0.947 and AP50-95 of 0.709, indicating robust detection performance even under high IoU thresholds. Further analysis of specific categories reveals optimal performance for “white radish” detection, where a precision (0.976), recall (0.978), and AP50 (0.99) all approach 1, and AP50-95 reaches 0.789, confirming exceptional accuracy and robustness for “white radish” targets. For “white radish tassels,” performance is slightly lower, with precision and recall at 0.887 and 0.859, respectively, AP50 at 0.904, and AP50-95 at 0.629.

In the target quality assessment task, the overall detection performance is moderate, with precision at 0.772, recall at 0.828, AP50 at 0.849, and AP50-95 at 0.636. This indicates that the model underperforms in quality recognition compared to species detection, particularly with a more significant decline in performance under higher IoU thresholds (AP50-95), likely due to the subjective nature of quality assessment and imbalanced data distribution. For specific categories, the “good” class achieves the best performance, with precision and recall reaching 0.847 and 0.899, AP50 at 0.908, and AP50-95 at 0.681, demonstrating precise identification of high-quality targets. In contrast, the “middle” and “bad” classes show weaker performance, especially the “bad” class, with precision at 0.684, recall at 0.74, AP50 at 0.758, and AP50-95 at 0.561, reflecting the need for improved accuracy in detecting low-quality targets.

In terms of operational efficiency, the model maintains a high real-time performance of 125 frames per second (FPS) in detection tasks, demonstrating its potential for efficient processing of large-scale data in practical applications. Additionally, the floating-point operations (FLOPs) remain at 8.1 G, reflecting relatively low computational costs, making the model suitable for resource-constrained edge computing scenarios. YOLOv8-Dual achieves excellent overall performance in object detection tasks, particularly excelling in object category detection, with extremely high detection accuracy and robustness for the primary class “white radish.” However, the model slightly underperforms in object quality assessment, especially requiring further optimization in detecting low-quality targets.

3.3.2. YOLOv8-MD Model Performance

Figure 10 shows the detection loss curves of the YOLOv8-MD model, indicating significant improvements compared to the baseline YOLOv8-Dual model. In Figure 10a, the training losses of the dual-branch model decrease markedly, with faster convergence speed, rapidly declining within the first 50 epochs. The species branch loss drops to around 0.4, a 50% reduction from the baseline model, while the quality branch loss stabilizes near 0.8. All curves exhibit smoother convergence and enhanced performance. In Figure 10b, the validation curves of the quality branch show particularly notable improvement, with minimal scattered points and better convergence. Both branches achieve the lowest classification losses, confirming significantly improved prediction accuracy. The concentrated loss values indicate stronger generalization capability of the model.

Compared to the baseline model YOLOv8-Dual, the YOLOv8-MD model improves feature extraction capability through effective aggregation of multi-scale features, enhances perception of critical regions by incorporating attention mechanisms, and reduces the gap between training and validation losses, indicating mitigated overfitting. However, scattered points observed after training suggest that the model’s stability has not yet reached optimal performance.

The detection results of YOLOv8-MD are shown in Figure 11. Comparative analysis of two detection groups demonstrates that the improved YOLOv8-MD model achieves significant enhancements over the baseline YOLOv8-Dual model in multiple aspects. In the species detection task, the confidence score for white radish detection increases from 0.65–0.75 to 0.80–0.85, and the confidence for tassel detection improves from 0.70–0.80 to 0.82–0.89, reflecting stronger feature extraction and object recognition capabilities. For quality assessment, the enhanced model delivers more accurate and stable evaluations across quality grades. Notably, confidence in identifying high-quality (good) radishes rises from 0.45–0.65 to 0.70–0.85, while medium-quality (middle) evaluations stabilize within 0.75–0.82. Additionally, YOLOv8-MD exhibits improved robustness in complex agricultural environments, maintaining high detection accuracy and reliable quality assessment even under occlusion and lighting variations. These advancements validate the effectiveness of the MSAA module, enabling superior performance in real-world scenarios. Compared to YOLOv8-Dual, YOLOv8-MD significantly enhances detection stability, quality assessment accuracy, and environmental adaptability, providing more dependable technical support for intelligent white radish harvesting systems.

The evaluation metrics of the YOLOv8-MD model are shown in Table 5. The detection performance of the species branch is significantly improved: precision increases from 0.931 to 0.939, recall from 0.918 to 0.91, AP50 from 0.947 to 0.951, and AP50-95 from 0.709 to 0.723. The performance for “white radish” is further enhanced, with precision rising from 0.976 to 0.981, AP50 from 0.99 to 0.991, and AP50-95 from 0.789 to 0.802, indicating improved accuracy and robustness for this category. However, the detection performance for “white radish tassels” shows relatively modest gains—precision improves from 0.887 to 0.897, AP50 from 0.904 to 0.911, and AP50-95 from 0.629 to 0.645—remaining a weaker aspect of species detection.

Quality detection performance also shows corresponding improvements. Overall quality detection precision increases from 0.772 to 0.804, recall from 0.828 to 0.831, AP50 from 0.849 to 0.862, and AP50-95 from 0.636 to 0.644, indicating progress in object quality detection. The “good” class achieves the most notable gains, with precision rising from 0.847 to 0.879, AP50 from 0.908 to 0.919, and AP50-95 from 0.681 to 0.691. Both “middle” and “bad” classes exhibit improved performance, particularly “middle,” where AP50 increases from 0.882 to 0.889 and AP50-95 from 0.665 to 0.678, demonstrating enhanced stability in detecting medium-quality targets. For the “bad” class, AP50 and AP50-95 rise from 0.758 and 0.561 to 0.777 and 0.564, respectively, reflecting incremental improvements despite smaller gains.

The frames per second (FPS) decreased from the previous 125 FPS in the table to 100–101 FPS. This reduction is attributed to the introduction of the MSAA module, which, while enhancing feature aggregation and accuracy, adds additional computational layers and operations to the network architecture. The floating-point operations (FLOPs) remain unchanged at 8.1 G, indicating that model optimizations did not significantly increase computational resource consumption, and it remains suitable for resource-constrained practical application scenarios.

3.3.3. YOLOv8-MMD Model Performance

Figure 12 shows the detection loss curves of YOLOv8-MMD, which exhibit significant advantages in detection loss curves compared to the YOLOv8-Dual and YOLOv8-MD models. In Figure 12a, all loss curves are more concentrated. The classification losses of both the species and quality branches show notable reductions compared to YOLOv8-Dual, with tightly clustered loss values and significantly reduced prediction accuracy errors, indicating substantial performance improvements. The distribution loss, bounding box loss, and classification loss in training are relatively similar, with the species branch achieving the lowest classification loss, reflecting higher prediction accuracy and optimal training effectiveness. Compared to YOLOv8-MD, the YOLOv8-MMD model converges faster, rapidly descending to stable levels in early training stages, with smoother loss curves. Notably, the scattered points observed in the YOLOv8-MD model disappear after incorporating the MAFE module, confirming that YOLOv8-MMD achieves superior stability and performance. This suggests that the MAFE module, by providing enhanced and stabilized multi-scale feature representations, works synergistically with the MSAA module to regularize the training process and reduce optimization oscillations, leading to smoother convergence.

The validation loss curves in Figure 12b are more convergent compared to YOLOv8-Dual, with no excessive scattered points, particularly evident in the validation classification loss of the quality branch. The YOLOv8-Dual model exhibits noticeable oscillations in validation classification loss, whereas the YOLOv8-MMD model demonstrates more stable and smoother validation classification loss. In Figure 12b, the loss value for classification metrics stabilizes around 0.4, indicating the highest prediction accuracy for species. Compared to the validation loss of the YOLOv8-MD model, the loss curves in Figure 12b achieve faster convergence and better generalization capability, with reduced curve fluctuations. Additionally, the classification loss in Figure 12b aligns with that in Figure 12a, demonstrating a significant improvement in the prediction accuracy of the enhanced YOLOv8-MMD model.

Compared to the YOLOv8-Dual and YOLOv8-MD models, YOLOv8-MMD exhibits significantly faster convergence speed. The loss values of the species branch rapidly drop below 0.5 within the first 50 epochs, whereas the YOLOv8-Dual model requires nearly 200 epochs to reach similar levels. The initial curve rapidly declines, particularly evident in the loss of species classification, quickly falling below 1.0. In particular, for the quality assessment task, the loss curves of the quality branch in YOLOv8-MMD are smoother and ultimately converge to lower loss values (approximately 0.7), while the YOLOv8-Dual model exhibits pronounced fluctuations and higher loss values.

The training and validation curves of the YOLOv8-MMD model indicate significantly improved stability after the initial convergence phase. The loss values remain at consistent levels with minimal oscillations, demonstrating enhanced model robustness. Compared to the YOLOv8-Dual model (approximately 0.3–0.8), the final loss values of the YOLOv8-MMD model’s classification branch (around 0.3–0.4) are lower, reflecting a notable improvement in species classification accuracy. Similarly, the quality assessment branch also shows enhancements, with the YOLOv8-MMD model maintaining greater stability and generally lower loss values throughout training. The reduced gap between training and validation losses in YOLOv8-MMD indicates better generalization capability, addressing issues like overfitting observed in the YOLOv8-Dual model. Through the synergistic integration of the MSAA and MAFE modules, YOLOv8-MMD exhibits superior stability in later training stages, with significantly reduced fluctuations in loss curves across branches, suggesting the model has identified optimal feature representations and task balance. These results conclusively demonstrate that the proposed YOLOv8-MMD model not only accelerates training but also substantially enhances performance and stability.

The YOLOv8-MMD model demonstrates exceptional detection performance in intelligent white radish harvesting scenarios. The detection results of the YOLOv8-MMD model are shown in Figure 13. In species recognition, the model maintains stable confidence scores of 0.80–0.85 for white radish detection and achieves a confidence of 0.82–0.91 in tassel recognition, showcasing robust feature extraction capabilities. For quality assessment, the model accurately classifies white radishes across quality grades—confidence for the “good” category reaches 0.77–0.81, the “middle” category stabilizes between 0.78 and −0.80, and the “bad” category improves to 0.83–0.88, highlighting superior quality evaluation accuracy. Notably, the model retains stable detection performance in complex agricultural environments, delivering high-confidence and precise results even under partial occlusion and varying lighting conditions.

Compared to the baseline YOLOv8-Dual model and the YOLOv8-MD model with only the MSAA module added, YOLOv8-MMD achieves significant improvements across multiple aspects. In species detection, the average detection confidence increases by 15–20%, far exceeding the baseline model’s 0.65–0.75 and YOLOv8-MD’s 0.75–0.80. Enhancements in quality assessment are even more pronounced, particularly for the “good” category, where confidence improves from the baseline model’s 0.45–0.65 to 0.77–0.81, marking an over 30% gain. Through the synergistic integration of the MSAA and MAFE modules, the model not only elevates detection accuracy but also strengthens feature extraction and fusion capabilities, yielding more stable and reliable results. Overall, by combining a multi-task learning framework with feature enhancement modules, the YOLOv8-MMD model comprehensively advances detection precision, quality assessment, and environmental adaptability for intelligent white radish harvesting systems, providing robust technical support for practical applications in agricultural robotics.

From the data in Table 6, YOLOv8-MMD achieves a species detection precision of 0.945, a 0.6% improvement over YOLOv8-MD’s 0.939. The recall rate increases from 0.91 to 0.924, indicating enhanced coverage in target detection. While AP50 slightly decreases from 0.951 to 0.949 (minor fluctuation), AP50-95 remains stable at 0.723. For white radish detection, both models share a precision of 0.981, but YOLOv8-MMD improves recall from 0.976 to 0.978, with AP50 and AP50-95 stable at 0.991 and 0.802, demonstrating stable and slightly optimized performance for this category. For white radish tassel detection, YOLOv8-MMD’s precision decreases from 0.897 to 0.885, but recall improves from 0.844 to 0.869, suggesting expanded coverage for complex targets despite the precision drop. AP50 declines marginally from 0.911 to 0.908, and AP50-95 decreases from 0.645 to 0.643, highlighting room for further improvement in tassel detection.

The observed lower precision for the “bad” quality class (0.691) and “white radish tassels” (0.885) compared to the main “white radish” class (0.981) can be attributed to several factors. First, data imbalance existed within the dataset: the number of “bad” quality samples was naturally lower than the “good” and “middle” grades, and tassels often occupied smaller, less distinct regions in the images compared to the prominent radish bodies. This imbalance can hinder the model’s ability to learn robust features for these minority classes. Second, the inherent difficulty of the tasks plays a role: distinguishing low-quality radishes (“bad”) often relies on subtle, irregular morphological defects that are highly variable, while detecting thin, elongated, and often occluded tassels against a complex soil and leaf background is inherently challenging. Subjectivity in quality labeling, especially at the boundary between “middle” and “bad” grades, may also introduce label noise, further impacting the precision of the “bad” class.

To further elucidate the classification performance of the YOLOv8-MMD model, we present the confusion matrices for both tasks. Figure 14a shows the 2 × 2 confusion matrix for species detection. The model correctly identifies 798 white radish instances, with 66 misclassified as tassels (7.6% error rate). Similarly, 780 tassel instances are correctly detected, with 46 misclassified as radishes (5.4% error rate). This symmetrical error pattern indicates that the primary confusion occurs between the radish body and its foliage, which is visually plausible given their adjacency and partial occlusion in the images.

Figure 14b presents the 3 × 3 confusion matrix for quality assessment. The model performs well on the “good” class (305 correct, 55 misclassified as “middle”, 5 as “bad”), demonstrating reliable identification of high-quality radishes. For the “middle” class, 247 are correctly classified, with 43 misclassified as “good” and 26 as “bad”, showing reasonable confusion with adjacent quality grades. The “bad” class shows the expected challenges: while 148 are correctly identified, 25 are misclassified as “middle” and 5 as “good”. This confusion matrix quantitatively supports our earlier analysis regarding the difficulty of the “bad” class, showing that most misclassifications occur with the adjacent “middle” class rather than the distant “good” class, which aligns with the subjective nature of quality grading boundaries.

In terms of quality detection performance, YOLOv8-MMD achieves an overall quality detection precision of 0.812, surpassing YOLOv8-MD’s 0.804 (an increase of approximately 0.8%). The recall rate improves from 0.831 to 0.836, indicating expanded coverage in quality assessment. AP50 slightly decreases from 0.862 to 0.859, while AP50-95 rises from 0.644 to 0.655, reflecting improved performance under higher IoU thresholds. For the good category, precision drops from 0.879 to 0.86, AP50 declines from 0.919 to 0.91, but AP50-95 remains stable at 0.691, maintaining robust detection under high IoU despite the precision dip. The middle category shows minor declines—precision decreases from 0.822 to 0.815, AP50 from 0.889 to 0.887, and AP50-95 from 0.678 to 0.676, with minimal performance fluctuations. For the bad category, YOLOv8-MMD’s precision drops from 0.709 to 0.691, AP50 increases marginally from 0.777 to 0.78, and AP50-95 improves from 0.564 to 0.565, suggesting slight high-IoU detection gains for low-quality targets, though overall performance remains suboptimal.

Despite the sequential addition of modules, YOLOv8-MMD achieves a faster detection speed (107–112 FPS) than YOLOv8-MD (100–101 FPS). This recovery in efficiency, even with the extra MAFE module, occurs because its efficient design (e.g., depthwise separable convolutions) minimizes overhead, and its synergy with MSAA yields a more stable model that requires less computational redundancy during inference. The floating-point operations (FLOPs) remain unchanged at 8.1G, indicating that the speed optimization does not introduce additional computational complexity.

Compared to YOLOv8-MD, YOLOv8-MMD demonstrates superior recall rates and detection performance under high IoU thresholds (AP50-95), along with significantly improved detection speed (FPS), showcasing enhanced real-time capabilities. However, its detection precision and certain metrics (e.g., AP50) show slight declines, particularly for complex categories like “white radish tassels” and low-quality targets such as “bad,” indicating remaining room for improvement.

3.4. Ablation Study

To comprehensively evaluate the performance of the proposed method, this paper conducts detailed comparative experiments on three models using the white radish detection dataset, and the comparative results are shown in Figure 15. Figure 15a–d illustrate the training process and final performance of each model on key metrics including Precision, Recall, mAP@50, and mAP@50-95, respectively.

From the Precision metric, the YOLOv8-MMD model performs the best, with its species detection precision stabilizing above 0.92 in later training stages, representing an improvement of approximately 12 percentage points over the baseline YOLOv8-Dual model. The precision of the quality assessment branch also reaches 0.88, a significant increase from the baseline model’s 0.78. The YOLOv8-MD model, by introducing the MSAA module, achieves a species detection precision of 0.85, demonstrating the effectiveness of the attention mechanism, though it still lags behind the full YOLOv8-MMD solution.

In terms of the Recall metric, all three models exhibit good convergence characteristics, but there are notable differences in convergence speed and final performance. The YOLOv8-MMD model achieves rapid convergence within the first 50 epochs, ultimately attaining a species detection recall of 0.91 and a quality assessment recall of 0.87. These results significantly outperform the YOLOv8-Dual model (species detection: 0.82, quality assessment: 0.75) and the YOLOv8-MD model (species detection: 0.86, quality assessment: 0.82).

From the more challenging metrics of mAP@50 and mAP@50-95, the improvements are even more pronounced. YOLOv8-MMD achieves an outstanding mAP@50 of 0.93, a 15.2% increase over the baseline model. Notably, under the stringent mAP@50-95 evaluation standard, YOLOv8-MMD maintains a high performance of 0.72, fully demonstrating its stable detection capability across varying IoU thresholds. The YOLOv8-MD model with only the MSAA module achieves 0.88 in mAP@50 and 0.65 in mAP@50-95, which, while superior to the baseline, still lags significantly behind the full YOLOv8-MMD framework. This underscores the critical role of the MAFE module in enhancing model performance.

The comprehensive evaluation results across all metrics demonstrate that the proposed YOLOv8-MMD model achieves significant improvements in key indicators such as detection precision, recall, and mean average precision (mAP) through the synergistic integration of the MSAA and MAFE modules. Particularly in handling the challenging quality assessment task, the model exhibits enhanced feature learning and representation capabilities, providing robust technical support for the practical application of intelligent white radish harvesting systems.

Table 7 presents a detailed performance comparison of three models (YO-LOv8-Dual, YOLOv8-MD, and YOLOv8-MMD) on the white radish detection task. Across all evaluation metrics, the improved models demonstrate steady enhancements. For species detection, compared to the baseline YOLOv8-Dual model (species precision: 0.931), the YOLOv8-MD model with the MSAA module increases to 0.939, and the YOLOv8-MMD model incorporating both MSAA and MAFE modules achieves 0.945, showcasing the best species recognition capability. Similarly, quality precision exhibits a progressive improvement trend, rising from 0.772 in the baseline model to 0.812 in the final version.

In terms of recall rate, the species recall rate improves from 0.918 to 0.924, and the quality recall rate increases from 0.828 to 0.836, indicating enhanced target detection capability in the improved models. The improvement in mean average precision (mAP) metrics is even more significant: the species mAP@50 rises from 0.947 to 0.949, and the quality mAP@50 improves from 0.849 to 0.859. Under the stricter mAP@50-95 evaluation standard, both species and quality detection performance show further enhancements.

In terms of processing speed, the proposed modules introduce a moderate parameter increase (from 3.01 M to 3.35 M) while maintaining a constant FLOPs count of 8.1 G. The inference speed (FPS) shows a recoverable dip: it decreases after adding MSAA but improves in the full YOLOv8-MMD model, reaching 112 FPS. This demonstrates that the combined model achieves an effective balance between enhanced accuracy and real-time performance.

The ablation study clearly demonstrates the synergistic interaction between the MSAA and MAFE modules. While the YOLOv8-MD model (with MSAA) already improves upon the baseline, the incorporation of the MAFE module (forming YOLOv8-MMD) yields disproportionate gains, particularly in the more challenging quality assessment task. For instance, quality precision sees a further increase from 0.804 to 0.812, and more notably, the quality mAP@50-95—a strict measure of localization and classification accuracy—jumps from 0.644 to 0.655. This indicates that the MAFE module does not merely add incremental improvements; it effectively leverages and refines the multi-scale, context-aware features aggregated by MSAA, specializing them for fine-grained discrimination. The convergence stability observed in YOLOv8-MMD’s loss curves (Figure 12), absent in YOLOv8-MD, further evidences this positive interaction, leading to a more robust and well-optimized model.

These experimental results fully validate the effectiveness of the proposed improvements, significantly enhancing the model’s detection accuracy and quality assessment capabilities while maintaining computational complexity.

3.5. Comparative Experiments

To validate the superiority of the YOLOv8-MMD model, we designed a series of comparative experiments against mainstream object detection models, including YOLOv5, YOLOv7, YOLOv8, and YOLOv11, for systematic evaluation. The experiments quantitatively assessed model performance across multiple dimensions: species precision, quality precision, species recall, quality recall, species mean average precision (mAP@50), quality mean average precision (mAP@50), species mean average precision (mAP@50-95), quality mean average precision (mAP@50-95), species frame rate, quality frame rate, and floating-point operations (GFLOPs). The comparative results are summarized in Table 8.

The results demonstrate that the YOLOv8-MMD model achieves leading performance across multiple key metrics. In species precision, YOLOv8-MMD reaches 0.945, a 1.4% improvement over the baseline YOLOv8 (0.931), and outperforms YOLOv7 and YOLOv5 by 2.3% and 2.7%, respectively. For quality precision, YOLOv8-MMD’s 0.812 also surpasses all compared models, exceeding YOLOv8 by 2% and YOLOv7 by 8%. This significant improvement highlights YOLOv8-MMD’s superior robustness and generalization capability in object classification and feature representation.

In recall rate comparisons, YOLOv8-MMD achieves species recall and quality recall rates of 0.924 and 0.836, respectively, representing 3.1% and 0.5% improvements over the baseline YOLOv8’s 0.893 and 0.831, demonstrating higher stability and detection capability in complex scenarios and multi-task environments. Additionally, in mAP@50 and mAP@50-95 metrics, YOLOv8-MMD reaches 0.949 and 0.723, respectively. Compared to YOLOv8’s 0.951 and 0.647, the mAP@50-95 metric improves by 11.8%, further validating the MMD module’s exceptional performance in multi-scale dynamic feature fusion, particularly its significant optimization for small and occluded targets. This gain in high-IoU localization accuracy is critical for robotic harvesting, as it enables more precise targeting of the radish body, thereby reducing the risk of grasp misalignment and potential damage to the crop.

In terms of resource consumption, YOLOv8-MMD achieves 8.1 GFLOPs, comparable to the baseline YOLOv8, while its species and quality frame rates reach 107 and 112 FPS, respectively, demonstrating real-time inference capabilities while maintaining high precision.

In conclusion, the experimental results demonstrate that the YOLOv8-MMD model, through the integration of the Multi-Scale Attention Aggregation and Feature Enhancement (MMD) modules, significantly outperforms existing models in detection precision, recall rates, and mean average precision (mAP) metrics, while maintaining high efficiency and balance. This establishes YOLOv8-MMD as a high-performance object detection solution adaptable to diverse scenarios, providing a new paradigm and technical benchmark for object detection tasks.

While the proposed model achieves robust performance, analysis of the validation set reveals consistent failure modes. These primarily occur under extreme occlusion (where over 60% of the radish is covered by soil or leaves), leading to missed detections, and in borderline quality cases where morphological defects are subtle, resulting in confusion between middle and bad grades. These cases underscore the inherent difficulty of vision-based harvesting in unconstrained environments and point to the potential benefit of integrating additional sensory modalities (e.g., tactile sensing) in future work.

4. Conclusions

This study addresses the dual-task challenge of species detection and quality assessment in white radish harvesting by proposing a YOLOv8-based multi-task detection framework. The introduction of the Multi-Scale Attention Aggregation (MSAA) and Multi-scale Feature Enhancement (MAFE) modules significantly enhances the model’s feature extraction and representation capabilities.

Experimental results demonstrate that the proposed YOLOv8-MMD model achieves remarkable and balanced improvements in both detection accuracy and real-time processing capability. A key contribution is the latency-sharing framework built upon a shared Backbone with dual task-specific heads. This design enables efficient computation by extracting common visual features once for both tasks, thereby eliminating redundant processing and minimizing overall latency. Coupled with the lightweight design of the MSAA and MAFE modules, this framework allows the model to achieve a high processing speed of 112 FPS while maintaining a low computational footprint of 8.1 GFLOPs, directly fulfilling the real-time requirements of robotic harvesting systems.

The proposed framework exhibits promising scalability to other root vegetable harvesting scenarios, such as carrots or turnips, which share similar visual challenges: partial soil occlusion, varied morphology, and the need for in-field quality sorting. While direct application would require training on new task-specific datasets, the core architectural innovations—the dual-branch design for efficient multi-task learning, the MSAA module for handling size variance, and the MAFE module for robustness against occlusion and fine-grained feature enhancement—provide a generalized solution template. Adaptation would primarily involve adjusting the classification heads and fine-tuning the model on the new crop’s imagery.

For future work, we plan to focus on two directions to promote practical application: (1) deploying and optimizing the YOLOv8-MMD model on embedded systems (e.g., NVIDIA Jetson) with limited hardware resources for field operation; (2) exploring network pruning or quantization techniques to further increase the inference speed (FPS) without compromising the achieved accuracy. The proposed method provides an effective and efficient technical solution for intelligent decision making in complex agricultural environments.

Author Contributions

Conceptualization, X.W.; methodology, X.W. and K.Y.; software, X.W. and K.Y.; validation, X.W., K.Y., W.Z. and W.C.; formal analysis, X.W.; investigation, X.W., W.Z. and Y.W.; resources, Y.W., C.Z. and W.C.; data curation, K.Y. and W.Z.; writing—original draft preparation, X.W.; writing—review and editing, K.Y., W.Z., Y.W., W.C. and C.Z.; visualization, X.W.; supervision, Y.W. and W.C.; project administration, W.C.; funding acquisition, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Project of Scientific and Technological Innovation 2030, grant number 2021ZD0113603; the National Natural Science Foundation of China, grant number 62276028; and the Major Research Plan of the National Natural Science Foundation of China, grant number 92267110. The APC was funded by the above funds.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it did not involve human participants, animal subjects, or any sensitive personal data. The research utilized only computer vision algorithms and publicly available/simulated agricultural image datasets.

Informed Consent Statement

Not applicable. This study did not involve human participants.

Data Availability Statement

The curated white radish image dataset (2976 images) and the corresponding bounding box annotations for species and quality tasks presented in this study are not publicly available due to ongoing collaborative project agreements but are available from the corresponding author upon reasonable request for academic research purposes. The detailed statistical summary of the dataset is provided within the article (Section 3.1). The model codes are available from the author upon reasonable request.

Acknowledgments

The authors thank all contributors who supported this research. We confirm that all individuals acknowledged here have consented to the acknowledgement.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, F.; Fan, W. Research status and development trend of white radish harvesting mechanization. Xinjiang Agric. Mech. 2021, 3, 25–27+31. (In Chinese) [Google Scholar] [CrossRef]
Chinese Academy of Agricultural Machinery. The Chinese Academy of Agricultural Machinery successfully developed a self-propelled white radish combine harvester. Agric. Mach. 2024, 8, 32. (In Chinese) [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. YOLOv5 by Ultralytics. Available online: https://github.com/ultralytics/yolov5 (accessed on 9 June 2020).
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 January 2023).
Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-Stitch Networks for Multi-task Learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 3994–4003. [Google Scholar]
Kokkinos, I. UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 5454–5463. [Google Scholar]
Du, X.; Meng, Z.; Ma, Z.; Zhao, L.; Lu, W.; Cheng, H.; Wang, Y. Comprehensive visual information acquisition for tomato picking robot based on multitask convolutional neural network. Biosyst. Eng. 2024, 238, 51–61. [Google Scholar] [CrossRef]
Tham, M.L.; Wong, Y.J.; Kwan, B.H.; Owada, Y.; Sein, M.M.; Chang, Y.C. Joint Disaster Classification and Victim Detection using Multi-Task Learning. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON); IEEE: Piscataway, NJ, USA, 2021; pp. 407–412. [Google Scholar]
Guo, J.; Wang, J.; Wang, H.; Xiao, B.; He, Z.; Li, L. Research on Road Scene Understanding of Autonomous Vehicles Based on Multi-Task Learning. Sensors 2023, 23, 14. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 12. [Google Scholar] [CrossRef]
Wang, J.; Wu, Q.M.J.; Zhang, N. You Only Look at Once for Real-Time and Generic Multi-Task. IEEE Trans. Veh. Technol. 2024, 73, 12625–12637. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 model architecture diagram.

Figure 2. Feature sharing mechanism.

Figure 3. YOLOv8 multi-task detection module.

Figure 4. MSAA basic module.

Figure 5. MAFE model architecture diagram.

Figure 6. Improved YOLOv8 network model architecture diagram.

Figure 7. Sample images of the dataset: (a) white radish under normal lighting; (b) white radish with partial leaf occlusion; (c) white radish and its tassels at different growth stages.

Figure 8. Loss curves of the YOLOv8-Dual model: (a) training loss curves, showing the bounding box, classification, and distribution losses for both the species and quality branches; (b) validation loss curves, illustrating the loss variations of each branch on the validation set.

Figure 9. Detection results of YOLOv8-Dual: (a) an example scene with clear lighting; (b) detection under partial occlusion; (c) radish size variation; (d) complex background with multiple tassels. All white radish tassels are successfully detected with high confidence scores, and all white radishes are labeled with dual tags (species and quality).

Figure 10. Detection loss diagram of YOLOv8-MD: (a) Training loss curves for both the species and quality branches, showing faster convergence; (b) Validation loss curves, where the quality branch exhibits minimal scattered points and improved stability.

Figure 11. Detection results of YOLOv8-MD: (a) typical detection scene with clear lighting; (b) detection performance under partial occlusion; (c) detection with varying lighting conditions; (d) detection in complex background with multiple targets. The model demonstrates enhanced confidence scores for both white radish (0.80–0.85) and tassel detection (0.82–0.89), providing reliable quality assessment even in challenging environments.

Figure 12. Detection loss diagram of YOLOv8-MMD: (a) Training loss curves showing concentrated and rapidly converging loss values for both species and quality branches, with the MAFE module eliminating scattered points observed in earlier models; (b) Validation loss curves demonstrating improved convergence and stability, particularly in the quality branch, with classification loss stabilizing around 0.4.

Figure 13. Detection results of YOLOv8-MMD: (a) an example scene with clear lighting; (b) detection under partial occlusion; (c) radish size variation; (d) complex background with multiple tassels. All white radish tassels are successfully detected with high confidence scores, and all white radishes are labeled with dual tags (species and quality).

Figure 14. Confusion matrices of the YOLOv8-MMD model for (a) species detection and (b) quality assessment tasks.

Figure 15. Model metrics comparison diagram: (a) Precision; (b) Recall; (c) mAP@50; (d) mAP@50-95.

Table 1. Annotation statistics of the white radish dataset.

Category	Instances
White Radish-good	1851
White Radish-middle	1580
White Radish-bad	890
White Radish Tassel	4221

Table 2. Experimental configuration.

Name	Detailed Configuration
GPU	NVIDIA GeForce RTX 4050 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA)
CPU	13th Gen Intel(R) Core(TM) i5-13500H 2.60 GHz
Operating System	Windows 11 (Microsoft Corporation, Redmond, WA, USA)
CUDA	11.3
cuDNN	8.2
PyTorch	1.10.0+cu113

Table 3. Experimental parameter settings.

Parameter	Detailed Configuration
Epochs	300
Number of Species Classes	2
Number of Quality Classes	3
Batch Size	8
Image Size	640 × 640

Table 4. YOLOV8-Dual evaluation metrics.

Class	Precision	Recall	mAP@50	mAP@50-95	FPS	FLOPs
speicies-all	0.931	0.918	0.947	0.709	125	8.1
white radish	0.976	0.978	0.99	0.789	125	8.1
white radish tassels	0.887	0.859	0.904	0.629	125	8.1
quality-all	0.772	0.828	0.849	0.636	111	8.1
good	0.847	0.899	0.908	0.681	111	8.1
middle	0.786	0.845	0.882	0.665	111	8.1
bad	0.684	0.74	0.758	0.561	111	8.1

Table 5. YOLOv8-MD evaluation metrics.

Class	Precision	Recall	mAP@50	mAP@50-95	FPS	FLOPs
speicies-all	0.939	0.91	0.951	0.723	100	8.1
white radish	0.981	0.976	0.991	0.802	100	8.1
white radish tassels	0.897	0.844	0.911	0.645	100	8.1
quality-all	0.804	0.831	0.862	0.644	101	8.1
good	0.879	0.907	0.919	0.691	101	8.1
middle	0.822	0.846	0.889	0.678	101	8.1
bad	0.709	0.741	0.777	0.564	101	8.1

Table 6. YOLOV8-MMD evaluation metrics.

Class	Precision	Recall	mAP@50	mAP@50-95	FPS	FLOPs
speicies-all	0.945	0.924	0.949	0.723	107	8.1
white radish	0.981	0.978	0.991	0.802	107	8.1
white radish tassels	0.885	0.869	0.908	0.643	107	8.1
quality-all	0.812	0.836	0.859	0.655	112	8.1
good	0.86	0.911	0.91	0.685	112	8.1
middle	0.815	0.834	0.887	0.676	112	8.1
bad	0.691	0.764	0.78	0.565	112	8.1

Table 7. Ablation study. Note: “↑” and “↓” denote performance improvement and decline, respectively, compared to the previous model in the ablation progression; “-” denotes no change.

a Accuracy metrics
Model	Precision		Recall		mAP@50
Model	Species	Quality	Species	Quality	Species	Quality
YOLOv8-Dual	0.931	0.772	0.918	0.828	0.947	0.849
YOLOv8-MD	0.939 ↑	0.804 ↑	0.91 ↓	0.831 ↑	0.951 ↑	0.862 ↑
YOLOv8-MMD	0.945 ↑	0.812 ↑	0.924 ↑	0.836 ↑	0.949 ↓	0.859 ↓
b Efficiency and comprehensive metrics
Model	mAP@50-95		FPS		FLOPs	Params (M)
Model	Species	Quality	Species	Quality	FLOPs	Params (M)
YOLOv8-Dual	0.709	0.636	125	111	8.1	3.01
YOLOv8-MD	0.723 ↑	0.644 ↑	100 ↓	101 ↓	8.1	3.18
YOLOv8-MMD	0.723 -	0.655 ↑	107 ↑	112 ↑	8.1	3.35

Table 8. Comparative experiments.

a Accuracy metrics
Model	Precision		Recall		mAP@50
Model	Species	Quality	Species	Quality	Species	Quality
YOLOv5	0.918	0.783	0.939	0.837	0.951	0.860
YOLOv7	0.922	0.732	0.931	0.821	0.947	0.841
YOLOv8	0.931	0.792	0.930	0.831	0.951	0.861
YOLOv11	0.922	0.786	0.932	0.824	0.948	0.863
YOLOv8-MMD	0.945	0.812	0.924	0.836	0.949	0.859
b Efficiency metrics
Model	mAP@50-95		FPS		FLOPs
Model	Species	Quality	Species	Quality	FLOPs
YOLOv5	0.717	0.646	142	135	7.1
YOLOv7	0.705	0.638	111	123	6.9
YOLOv8	0.722	0.647	213	133	8.1
YOLOv11	0.713	0.644	131	128	6.3
YOLOv8-MMD	0.723	0.655	107	112	8.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, X.; Yang, K.; Zhao, W.; Wang, Y.; Chen, W.; Zhao, C. A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement. Agronomy 2026, 16, 419. https://doi.org/10.3390/agronomy16040419

AMA Style

Wu X, Yang K, Zhao W, Wang Y, Chen W, Zhao C. A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement. Agronomy. 2026; 16(4):419. https://doi.org/10.3390/agronomy16040419

Chicago/Turabian Style

Wu, Xibao, Kexin Yang, Wei Zhao, Yiqun Wang, Wenbai Chen, and Chunjiang Zhao. 2026. "A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement" Agronomy 16, no. 4: 419. https://doi.org/10.3390/agronomy16040419

APA Style

Wu, X., Yang, K., Zhao, W., Wang, Y., Chen, W., & Zhao, C. (2026). A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement. Agronomy, 16(4), 419. https://doi.org/10.3390/agronomy16040419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Task Detection Approach with Multi-Scale Attention Aggregation and Feature Enhancement

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Related Work

1.2.1. Development of Object Detection

1.2.2. Research on Multi-Task Learning

2. Materials and Methods

2.1. YOLOv8 Base Model

2.2. Feature Sharing Mechanism

2.3. Multi-Task Detection Module

2.4. MSAA Module

2.5. MAFE Module

2.6. Improved YOLOv8 Network Model Architecture Diagram

3. Results and Discussion

3.1. Dataset

3.2. Preparation Work and Evaluation Metrics

3.3. Experimental Results and Analysis

3.3.1. YOLOv8-Dual Model Performance

3.3.2. YOLOv8-MD Model Performance

3.3.3. YOLOv8-MMD Model Performance

3.4. Ablation Study

3.5. Comparative Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI