Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method

Bu, Xiaosheng; Wu, Yongfeng; Lv, Hongtai; Yu, Youling

doi:10.3390/agriculture15070760

Open AccessArticle

Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method

by

Xiaosheng Bu

¹,

Yongfeng Wu

²,

Hongtai Lv

¹ and

Youling Yu

^1,*

¹

College of Electronic and Information Engineering, Tongji University, Shanghai 200082, China

²

School of Sports and Health, Shanghai University of Sport, Shanghai 200438, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 760; https://doi.org/10.3390/agriculture15070760

Submission received: 11 February 2025 / Revised: 18 March 2025 / Accepted: 25 March 2025 / Published: 1 April 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of fruits and vegetables is a key task in agricultural automation. However, existing detection methods typically focus on identifying a single type of fruit or vegetable and are not equipped to handle complex and diverse environments. To address this, we introduce the first large-scale benchmark dataset for fruit and vegetable detection—FV40. This dataset contains 14,511 images, covering 40 different categories of fruits and vegetables, with over 100,000 annotated bounding boxes. Additionally, we propose a novel framework for fruit and vegetable detection—FVRT-DETR. Based on the Transformer architecture, this framework features an end-to-end real-time detection algorithm. FVRT-DETR enhances feature extraction by integrating the Mamba backbone network and improves detection performance for objects of varying scales through the design of a multi-scale deep feature fusion encoder (MDFF encoder) module. Extensive experiments show that FVRT-DETR performs excellently on the FV40 dataset. In particular, it demonstrates a significant performance advantage in detection of small objects and under complex scenarios. Compared to existing state-of-the-art detection algorithms, such as YOLOv10, FVRT-DETR achieves better results across multiple key metrics. The FVRT-DETR framework and the FV40 dataset provide an efficient and scalable solution for fruit and vegetable detection, offering significant academic value and practical application potential.

Keywords:

fruits and vegetables; agricultural automation; large-scale benchmark dataset; Transformer; Mamba

1. Introduction

Fruits and vegetables, as essential foods in daily life, come in a wide variety of types and shapes. Taking fruits as an example, from common fruits such as apples and bananas to tropical fruits like mangoes and pineapples, each fruit has its unique appearance. However, these characteristics often undergo significant variations under different lighting, angles, and background conditions, posing a great challenge for automatic recognition. With the rapid development of computer vision technology, the automatic recognition of fruits and vegetables [1,2,3,4,5] has gradually become a core technology in various industries. Especially in agriculture, retail, intelligent warehousing, and logistics management, the ability to efficiently and accurately identify fruit types has become one of the key technologies for improving production efficiency, optimizing resource allocation, and reducing labor costs.

Although deep learning technology has made significant progress in image recognition and classification [6,7], there are still many challenges in tasks that involve recognizing multiple types of fruits and vegetables. The variety in fruit and vegetable shapes, textures, and color changes, as well as environmental factors such as lighting, shooting angles, and occlusions, often interfere with image quality, leading to subpar performance in existing detection and classification methods. Especially in the task of recognizing multiple types of fruits and vegetables, current datasets and algorithms still struggle to handle complex real-world scenarios. For example, in terms of datasets, there is a lack of high-quality and open-source fruit and vegetable detection datasets. Datasets such as Fruits360 [8], FVIRD [9], and DeepFruit [10] have obvious shortcomings. The Fruits360 dataset includes 131 categories, but its images only depict individual fruits, lacking real-world background distractions. Additionally, the only variation within each category is the direction of the fruit, making it ineffective for simulating complex environments required for object detection. Furthermore, this dataset is only designed for classification tasks and does not provide bounding box annotations, making it unsuitable for object detection research. According to its paper description, the FVIRD dataset has a small sample size, with no more than 500 images per category, and the image quality is relatively low, making it insufficient for providing adequate training and evaluation data for deep learning models. The DeepFruit dataset has a narrow coverage, containing only 20 types of fruits, and the image quality is relatively low, making it inadequate for high-quality object detection tasks. Moreover, this dataset is only intended for classification and lacks bounding box annotations, making it inapplicable to object detection research. In summary, existing publicly available fruit and vegetable datasets suffer from various shortcomings in terms of data diversity, annotation quality, and real-world adaptability, making them insufficient for object detection tasks.

To address these challenges, this paper presents an innovative solution, contributing primarily in the areas of datasets and algorithms:

1. Construction of a fine-grained fruit and vegetables Dataset: This paper introduces a dataset named FV40, which contains 40 of the most common fruits and vegetables in daily life. Each category contains at least 500 annotated objects, with a total of over 14,000 images and 100,000 annotated bounding boxes. The images in this dataset vary in quality and shooting conditions, with data collected under different angles, lighting conditions, and occlusions to ensure that the dataset accurately reflects the complexity and diversity of real-world scenarios, including objects of different scales and fine-grained categories. This dataset provides a more comprehensive foundation for fruit and vegetable recognition tasks and offers strong data support for future research and practical applications.

2. Proposal of a More Powerful Baseline Detection Algorithm: To address the performance bottlenecks in existing multi-fruit and vegetable recognition detection algorithms, this paper proposes an end-to-end real-time detection algorithm based on Transformer, called FVRT-DETR. The algorithm consists of three main components—the backbone, encoder, and decoder modules—aiming to enhance the fusion and decoding efficiency of multi-scale features. Firstly, we leveraged the advantages of the Mamba structure [11,12] to build the backbone module. Mamba, by introducing the structured self-modulation (SSM) mechanism, effectively enhances the model’s ability to capture long-range dependencies, thereby improving feature extraction and representation capabilities in complex scenarios. Secondly, we designed a multi-scale deep feature fusion encoder (MDFF encoder) module to enhance the integration of features across different scales. This module processes input images from multiple resolutions and improves the perception of fruit and vegetable variations and scales by integrating hierarchical features. Finally, based on RT-DETR, we decode the features from five different scales output by the encoder.

The contributions of this paper are summarized as follows:

Introduction of the FV40 Dataset: We develope a fine-grained fruit and vegetable object detection dataset. The initial version contains over 14,000 images and 100,000 annotated bounding boxes, covering 40 distinct fruit and vegetable categories. The dataset is continuously being expanded and updated.
Development of the FVRT-DETR Algorithm: We propose the first algorithm for fast and accurate detection of multiple fruit and vegetable types, called FVRT-DETR. This algorithm innovatively uses Mamba as the backbone and introduces a novel multi-scale deep feature fusion encoder (MDFF encoder) module, which effectively enhances the model’s performance in handling multi-scale features while reducing parameters, improving detection accuracy for fruits and vegetables of various sizes and variations.
Improved Multi-Scale Feature Handling: To address the challenge of multi-scale fruit and vegetable detection, we propose an efficient feature fusion method within the MDFF encoder module that better integrates multi-scale feature maps, thus enhancing the detection capabilities across 40 different fruit and vegetable categories with varying sizes.
Scalable and Highly Adaptable Model: Unlike previous models tailored to a single type of fruit or vegetable, FVRT-DETR is designed to be scalable and highly adaptable, capable of accurately detecting a wide variety of fruit and vegetable types. This allows it to handle the diversity of produce in real-world agricultural scenarios, making it a highly versatile solution in the industry.

2. Related Work

Fruit and Vegetable Detection Algorithm

Fruit and vegetable detection [13,14,15,16,17] has always been a core application in the field of object detection. With the increasing demand, its applications in agriculture, the food industry, and automation have become more widespread. The variety, shape, color, size, and other characteristics of fruits vary greatly and are often set against complex backgrounds, which makes fruit and vegetable detection a challenging task. Over the past few decades, fruit and vegetable detection algorithms have undergone significant transformations, evolving from traditional feature-based methods to modern deep learning approaches. This evolution has effectively driven the development of the field.

Early fruit and vegetable detection methods relied heavily on hand-crafted features [18,19,20], such as color histograms, texture features, and shape descriptors, combined with traditional machine learning models such as support vector machines and decision trees for object classification and localization. Although these methods laid the foundation for fruit object detection, their limitations became apparent as the complexity of the problem increased. They struggled with scalability, accuracy, and robustness in handling complex scenes. Additionally, these traditional methods had poor adaptability to diverse scenarios and tasks, making them less effective for detecting fruits and vegetables in varied environments.

With the rise of deep learning, the field of fruit and vegetable detection has undergone a fundamental transformation, with the YOLO series algorithms and end-to-end object detection algorithms being the most successful.

YOLO Series Algorithms. The YOLO series has become the dominant family of real-time object detection algorithms, continuously pushing the limits of efficiency. YOLOv1 introduced a unified detection framework, while YOLOv2 and YOLOv3 enhanced performance through improved feature extraction networks and multi-scale prediction. Subsequent versions, such as YOLOv4 [21] and YOLOv5 [22], introduced CSPNet as the backbone, path aggregation networks (PAN), and advanced data augmentation techniques. YOLOv6 [23] introduced BiC and SimCSPSPPF modules for the backbone and neck structures, incorporating anchor-assisted training and self-distillation strategies. YOLOv7 [24] brought in the E-ELAN module, optimizing gradient flow paths and exploring several “free” trainable techniques. YOLOv8 [25] enhanced feature extraction and fusion through its C2f module, while Gold-YOLO [26] improved multi-scale feature fusion with its advanced GD mechanism. Recently, YOLOv9 [27] proposed GELAN architecture optimizations and the PGI method to further improve the training process.

With the development of the YOLO series algorithms, they have also been widely applied to fruit and vegetable detection. Wang et al. [28] proposed an apple young fruit detection algorithm based on YOLOv5s. Wang et al. [15] introduced a strawberry detection algorithm based on YOLOv5. Chen et al. [29] proposed a multi-task detection network for cherry tomatoes based on YOLOv7. Wang et al. [30] introduced a pomegranate young fruit detection algorithm based on YOLOv8.

End-to-End Object Detection Algorithms. End-to-end object detection represents a paradigm shift from traditional pipelines to simplified architectures, aiming to eliminate manually designed components such as anchor boxes, region proposals, and non-maximum suppression (NMS). DETR (DEtection TRansformer) [31] is a groundbreaking work in this field. It introduced the Transformer architecture and the Hungarian Loss function for one-to-one matching predictions, significantly simplifying the detection pipeline.

Despite the elegant architecture of DETR, its initial version suffered from slow convergence, which prompted a wave of research aimed at improving its efficiency. Deformable-DETR [32] accelerated convergence by incorporating multi-scale deformable attention modules, while DINO [33] integrated contrastive denoising, hybrid query selection, and bidirectional forward mechanisms, significantly boosting performance. In addition to Transformer-based methods, CNN-based end-to-end detectors have also seen rapid development. Learnable NMS and relational networks have replaced traditional post-processing steps with neural network components, while OneNet [34] and DeFCN [35] introduced one-to-one matching strategies that allow fully convolutional networks to perform end-to-end detection. These studies demonstrate that the end-to-end detection paradigm continues to simplify and optimize detection architectures.

However, these methods still face significant challenges in real-time performance, as high latency limits their practical applications in real-world scenarios. To address this issue, the first real-time end-to-end detection model, RT-DETR [36], was proposed. This method significantly reduces inference latency by designing an efficient hybrid encoder and a minimal uncertainty query selection strategy, while maintaining excellent detection accuracy. This breakthrough provides an end-to-end solution for real-time object detection. Additionally, the YOLO series launched its first real-time end-to-end detection model—YOLOv10—which eliminates NMS, adopts a consistent dual-assignment strategy, and implements a globally optimized efficiency-accuracy design. This significantly reduces inference latency and computational overhead while improving detection performance.

With the development of DETR-based algorithms, such models have also been widely applied to fruit detection. Guo et al. [37] proposed a fast detection algorithm for tomato fruits based on RT-DETR. Huang et al. [38] introduced a RT-DETR-based detection model for pears in natural environments.

However, the existing YOLO and DETR-based fruit detection algorithms are typically optimized for specific fruit and vegetable types or particular scenarios. For example, many studies focus on detecting single fruits like apples, strawberries, or tomatoes, which leads to limitations when dealing with a variety of fruit types or complex environmental conditions. In reality, fruits come in a wide range of types, shapes, colors, sizes, and growth environments, so detection models optimized for a single fruit often fail to perform well in multi-fruit detection tasks.

In addition, one major limitation of these algorithms is the lack of rich datasets that include a variety of fruit types. To address this gap, this paper proposes the first dataset, FV40, which includes 40 different types of fruits and vegetables. Furthermore, we also introduce a new baseline algorithm, FVRT-DETR, aimed at improving performance in multi-fruit detection.

3. Our Method

In this section, we will elaborate on the technical details of FVRT-DETR. It is important to note that FVRT-DETR is an enhancement based on RT-DETR, with the primary improvements focused on the backbone and encoder components. Therefore, this paper will primarily discuss these two structures in detail.

3.1. Mamba Backbone

The backbone architecture of FVRT-DETR is shown in Figure 1A. Specifically, the Mamba backbone [12] is constructed by stacking the Simple Stem, Vision Clue Merge, and ODSS block modules alternately.

Simple Stem. The Simple Stem module [12] is formed by two convolution operations with a stride of 2 and a kernel size of 3, as shown in Figure 2.

Vision Clue Merge. The Vision Clue Merge module [12] is designed to optimize the downsampling process of the feature maps while retaining more visual information flow to better support subsequent object detection tasks. Traditional convolution operations during downsampling can sometimes destroy important feature information, especially when selective operations like SS2D [39] are involved. To address this issue, Vision Clue Merge adopts an innovative strategy to preserve and merge more visual cues through a series of operations. Specifically, Vision Clue Merge first removes the normalization operation and then splits and reconfigures the dimensions of the feature maps to improve information integration. It also enhances the model’s ability to perceive different features by appending extra feature information to the channel dimension. Additionally, a

4 \times

compressed pointwise convolution is used for downsampling. The detailed structure of Vision Clue Merge is shown in Figure 3.

ODSS Block. The ODSS block (optimal dynamic state space block) [12] is the core module of the Mamba backbone architecture, designed to efficiently process input features while enhancing the model’s ability to capture both local and global dependencies in visual data. Its structure is shown in Figure 4. Firstly, the input feature (

Z^{l - 3}

) undergoes a

1 \times 1

convolution operation, followed by batch normalization (BN) to stabilize the training process, and is then activated using a non-linear activation function (SiLU) to produce the intermediate output

Z^{l - 2}

Next, the ODSS block adopts a Transformer-like design, incorporating layer normalization (LN) and residual connections to ensure efficient information flow even in deep stacking scenarios. These designs help maintain stable calculations and prevent gradient vanishing or explosion issues. The intermediate output

Z^{l - 1}

and final output

Z^{l}

are computed as follows:

Z^{l - 1} = S S 2 D (L N (L S (Z^{l - 2}))) + Z^{l - 2}

(1)

Z^{l} = R G (L N (Z^{l - 1})) + Z^{l - 1}

(2)

In this process, the SS2D (2D selective scan; see Figure 4a) module captures multi-dimensional information by performing a scan expansion on the input feature map in four symmetric directions. These sub-images are processed through an S6 block for feature extraction and then merged back to the original size of the output image through a scanning merge operation.

To further enhance the ability to capture local features, the ODSS block introduces the local spatial block (LS, see Figure 4b), which uses depth-separable convolutions to effectively extract local spatial information while reducing computational cost. After batch normalization, the resulting feature map

F^{l - 1}

mixes channel information via a

1 \times 1

convolution and is fused with the original input through a residual connection. This enhances feature representation and improves robustness to scale variations. The specific formula is as follows:

F^{l} = {Conv}_{1 \times 1} (Φ ({Conv}_{1 \times 1} (F^{l - 1}))) \oplus F^{l - 2}

(3)

Additionally, the ODSS block introduces the residual gated block (RG), a design based on gating mechanisms to improve feature extraction performance. The RG block (see Figure 4c) splits the input into two branches, processing each with a

1 \times 1

convolution. One branch uses depth-separable convolutions as a position encoding module, and the two branches are then fused through element-wise multiplication. This design not only preserves spatial information but also makes the model more sensitive to fine-grained features. Finally, the output feature

X^{l}

is computed as follows:

X^{l} = {Conv}_{1 \times 1} (X_{1}^{l - 1} ⊙ Φ ({DWConv}_{3 \times 3} (X_{2}^{l - 1}) \oplus X_{2}^{l - 1})) \oplus X^{l - 2}

(4)

Through these designs, the ODSS block efficiently captures and integrates both local and global features, enhancing the model’s performance in complex tasks, especially in handling scale variations and contextual information.

3.2. MDFF Encoder

The MDFF encoder is a sophisticated architecture designed to enhance feature extraction for object detection tasks, combining multi-level feature representations and progressive refinement through specialized modules. The structure is designed to handle both fine-grained local features and global semantic information across various scales, ensuring that the model can efficiently detect objects of different sizes and complexities. The detailed structure of MDFF encoder is shown in Figure 1B.

CARAFE. The architecture begins by using CARAFE (content-aware reassembly of features) [40], which is employed multiple times throughout the model. CARAFE is responsible for content-aware upsampling, effectively recovering high-resolution details from lower-level feature maps. This helps mitigate the loss of spatial information during downsampling, ensuring that the model retains critical fine-grained details for precise feature extraction.

Specifically, CARAFE predicts reorganization kernels at each position using low-level content information and reorganizes features within predefined nearby regions. By incorporating content information, CARAFE can utilize adaptive and optimized reorganization kernels at different positions, resulting in better performance compared to conventional upsampling operators (such as interpolation or transposed convolution). CARAFE consists of two steps: First, it predicts the reorganization kernels for each target position, and then it uses the predicted kernels to reorganize the features. Given a feature map of size H×W×C and an upsampling ratio U, CARAFE generates a new feature map of size UH×UW×C. The kernel prediction module of CARAFE generates position-specific kernels based on the content of the input features, and then the content-aware reorganization module uses these kernels to reorganize the features.

Feature concatenation. Feature concatenation plays a significant role in the MDFF encoder. Feature maps from different stages of the backbone network are concatenated with intermediate features processed through CARAFE. This fusion of features at various scales allows the model to combine high-level semantic information from deeper layers with low-level details from shallower layers. The concatenation is carried out at different stages (e.g., P1, P2, P3, P4, and P5), progressively building richer feature maps that integrate information from multiple resolutions, which is vital for handling objects of various sizes.

ODSS block. At the core of the MDFF encoder is the ODSS block [12], a crucial element that processes concatenated feature maps and refines them progressively. The ODSS block enhances feature representations by applying convolutions, activations, and residual connections. It includes operations such as the selective scan, local spatial block (LSB), and residual gated block (RGB), which capture both local and global dependencies within the feature maps. These operations ensure that the model efficiently learns rich, multi-scale contextual features while maintaining spatial resolution. The ODSS block is used at multiple stages of the architecture, where the channel depth varies, reflecting the increasing abstraction and refinement of feature maps.

In addition to CARAFE and ODSS block, the encoder uses convolution layers with 3x3 kernels and stride 2 at several stages to reduce the spatial dimensions of the feature maps while increasing their channel depth. These convolutions help refine the features further, enabling the network to process more abstract, high-level semantic information as the feature maps progress through the encoder. The increasing channel depth at deeper stages of the model provides richer representations that are more suitable for high-level object recognition tasks.

Finally, the encoder outputs are passed through the RTDETRDecoder, which consolidates the multi-scale feature maps from different encoder stages and produces the final predictions. The decoder processes the concatenated feature maps (from P1 to P5) to generate bounding box coordinates, object class labels, and other related outputs for object detection. The decoder’s design ensures that the model can make accurate predictions by leveraging both local and global contextual features.

It is important to emphasize that, compared to the mainstream PAFPN [41] used in the neck structures of YOLO-series algorithms, our MDFF encoder design provides more scale-specific feature outputs. This multi-scale feature output is particularly important when handling inputs of varying resolutions, as it effectively captures details at different levels. Traditional PAFPN structures typically focus on a few predefined scales, whereas MDFF encoder introduces multiple scale layers and feature maps of different sizes. This allows the model to simultaneously fuse information and enhance features at multiple scales, improving its ability to detect targets of various sizes and proportions.

Specifically, MDFF encoder gradually introduces feature maps of different sizes for fusion, ensuring that features are passed through layers from coarse to fine. This enables the model to perceive both global information and capture local details effectively. This is especially beneficial when dealing with complex scenes and images with large scale variations. Additionally, MDFF encoder further optimizes feature propagation and recombination through dynamic feature fusion modules such as CARAFE and ODSS block. This ensures that features at each scale are enhanced at different levels, providing richer hierarchical feature outputs that help the model better handle a wide range of targets and scenes.

4. Experimental Details

In this section, we first introduce the FV40 dataset used for model training and testing (Section 4.1), followed by specific implementation details and parameters used during both training and testing (Section 4.2). Finally, we outline the criteria for evaluating the model’s performance in detail (Section 4.3).

4.1. Dataset

The FV40 dataset constructed in this study is a large-scale, fine-grained benchmark dataset for fruit and vegetable detection, designed to closely reflect real-world application scenarios such as agricultural production and daily life. The dataset covers 40 common categories of fruits and vegetables, each with no fewer than 500 annotated objects. In total, FV40 contains 14,511 images of varying resolutions and provides over 100,000 precisely labeled bounding boxes. Compared with existing datasets, FV40 places greater emphasis on simulating complex real-world environments and covering fine-grained object variations.

To ensure that the dataset accurately reflects real-world challenges, we incorporated images from diverse environments and conditions, including variations in lighting (natural daylight, artificial light, and dim environments), different angles of capture (top view, side view, and occluded views), and various background complexities (cluttered supermarket shelves, outdoor farm fields, and indoor market displays). In addition, we specifically designed the dataset to support fine-grained detection tasks. It includes not only complete and standard forms of fruits and vegetables, but also a wide range of real-life processed states, such as being cut, peeled, partially consumed, or damaged, as well as cooked or mixed into dishes. These images realistically capture the diverse appearances and structural variations of fruits and vegetables in complex environments such as market stalls, kitchens, and dining scenes. This significantly increases the difficulty of detection tasks.

The images in the dataset come from two main sources: on one hand, from two open-source datasets, the LVIS dataset [42] and the CitDet dataset [43], which provide representative images of fruits and vegetables. On the other hand, from various images of fruits and vegetables collected from the internet, spanning multiple sources. We rigorously screened and cleaned the images collected from the internet to ensure that the dataset meets high standards in both quality and copyright compliance, making it suitable for research applications. Furthermore, we conducted strict clarity screening (with a resolution no less than 256 × 256 and clear, noise-free images) for these images.

To ensure the annotation quality of the unlabeled dataset, we adopted a manual annotation strategy and established a standardized annotation guideline to ensure consistency and accuracy in the annotations. Firstly, in terms of bounding box tightness, we require that bounding boxes should fit as closely as possible to the target object, minimizing the inclusion of unnecessary background. Annotators must carefully follow the object’s contours to ensure that the bounding box fully covers the target while avoiding excessive redundancy. Secondly, for occluded objects, we applied different annotation strategies. If an object is partially occluded, the bounding box should be drawn around the visible portion, preserving as much of its complete shape as possible. If an object is highly occluded, making it difficult to accurately determine its category or boundaries, it will not be annotated to avoid introducing noisy data. When the object’s shape can still be reasonably inferred, annotators are allowed to annotate it based on contextual information and include an occlusion flag. For partially visible objects (i.e., objects cropped at the image boundary), if the visible portion is sufficient for classification, the object will still be annotated, with the bounding box strictly covering only the visible area. Objects with only a minimal visible portion that makes identification difficult will not be annotated to reduce uncertainty-related errors. In cases where the object’s category cannot be confidently determined, if an annotator is unable to clearly identify the object due to poor image quality, severe occlusion, or unclear distinguishing features, it will not be annotated.

To ensure annotation consistency, we utilized the LabelImg tool for assisted annotation and established a standardized TXT-based annotation format to maintain a uniform structure. All annotators underwent systematic training to minimize human errors and improve annotation quality. Additionally, each annotated image went through multiple rounds of review, including cross-validation by different annotators and a final inspection by senior annotators, to ensure annotation accuracy and consistency.

Following this rigorous annotation process, we ultimately constructed the FV40 dataset, which features high-quality object annotations, covers a wide range of object scales, and maintains a well-balanced distribution across different categories. Furthermore, the dataset demonstrates strong applicability to multi-scale object detection tasks. The distribution of bounding boxes across different object categories is shown in Figure 5A, while Figure 5B illustrates the distribution of objects across different scales, further validating its applicability and challenges in real-world scenarios. Figure 6 presents sample images from the dataset, providing a visual representation of its diversity and quality.

4.2. Experimental Setup

The training process was conducted on a system configured with 4 × A800 80G GPUs. The system runs Ubuntu 20.04, with CUDA version 11.8 and cuDNN version 8.2. The deep learning framework used is Python 3.8 with PyTorch (version 2.0.0). Other key libraries and dependencies can be found in RT-DETR [36]. We trained each category separately. The default hyperparameter settings are shown in Table 1. It is worth noting that some hyperparameters may vary slightly across different experiments. At the same time, the same experiment may require hyperparameter adjustments in different environments to achieve the best performance.

4.3. Evaluation Metrics

In this section, to achieve a comprehensive evaluation of the model, we introduce evaluation metrics to assess the model’s detection efficiency, including the number of parameters (Params), floating point operations (FLOPs), frames per second (FPS), as well as metrics to evaluate the model’s detection accuracy (one per thousand), such as precision, recall, and various average precision (AP) metrics at different IoU thresholds and scales.

Number of Parameters (#Params). The total number of parameters (#Params) in the model is a crucial measure of model complexity and memory usage. It is computed by summing the number of parameters across all layers of the model.

FLOPs (Floating Point Operations) FLOPs is a measure of the computational complexity of the model. It quantifies the number of floating-point operations required to process an input through the model. The total FLOPs for a model can be calculated as the sum of the FLOPs across all layers.

FPS (Frames Per Second). FPS measures the speed of the model, specifically how many images can be processed per second. It is crucial for evaluating the real-time inference capability of the model.

Precision. Precision is a metric that quantifies the accuracy of the positive predictions. It is defined as the ratio of true positive predictions (

T P

) to the sum of true positive and false positive predictions (

F P

):

Precision = \frac{T P}{T P + F P}

(5)

where

T P

= true positives (correctly predicted positive samples), and

F P

= false positives (incorrectly predicted positive samples).

Recall. Recall (or sensitivity) measures how well the model identifies all the relevant positive samples. It is defined as the ratio of true positive predictions (

T P

) to the sum of true positives and false negatives (

F N

):

Recall = \frac{T P}{T P + F N}

(6)

where

F N

= false negatives (incorrectly predicted negative samples).

Mean Average Precision (mAP). Mean average precision (mAP) is is a comprehensive metric for evaluating the performance of object detection models across multiple classes. It summarizes the precision–recall curve into a single value by averaging the AP values across all classes and calculating the area under the precision–recall curve (AUC). The formula for mAP at a specific IoU threshold t is:

{mAP}_{t}^{v a l} = \frac{1}{C} \sum_{c = 1}^{C} A P_{t}^{v a l} (c)

(7)

where C is the total number of object classes, and

A P_{t}^{v a l} (c)

represents the AP for class c at the IoU threshold t. In practice, model performance is evaluated at multiple intersection-over-union (IoU) thresholds, commonly

{IoU}_{50}

,

{IoU}_{75}

, and

{IoU}_{50 : 95}^{v a l}

, where the latter represents the mean over IoU thresholds from 0.5 to 0.95 with a step size of 0.05.

mAP at Different Scales. mAP is often computed at different object sizes, such as small, medium, and large objects, to assess model performance across varying object scales:

{mAP}_{S}^{v a l}

. Mean average precision for small objects (objects with an area less than 32 × 32 pixels):

{mAP}_{S}^{v a l} = \frac{1}{C} \sum_{c = 1}^{C} A P_{S}^{v a l} (c)

(8)

{mAP}_{M}^{v a l}

. Mean average precision for medium objects (objects with an area between 32 × 32 and 96 × 96 pixels):

{mAP}_{M}^{v a l} = \frac{1}{C} \sum_{c = 1}^{C} A P_{M}^{v a l} (c)

(9)

{mAP}_{L}^{v a l}

. Mean average precision for large objects (objects with an area greater than 96 × 96 pixels):

{mAP}_{L}^{v a l} = \frac{1}{C} \sum_{c = 1}^{C} A P_{L}^{v a l} (c)

(10)

5. Results and Analysis

In this section, we present the experimental results and provide a detailed analysis of the proposed FVRT-DETR model. The evaluation covers both the overall performance compared to state-of-the-art benchmarks (Section 5.1) and an in-depth ablation study to understand the impact of key architectural components (Section 5.2). By examining these results, we aim to demonstrate the effectiveness of FVRT-DETR in terms of accuracy, efficiency, and robustness across a diverse range of scenarios.

5.1. Benchmark Algorithm Performance Evaluation

In this section, we split the FV40 dataset into training and validation sets with a 7:3 ratio. We then compared the performance of FVRT-DETR with several existing mainstream detection algorithms on fruit and vegetable detection tasks across different categories and scales. The final results are shown in Table 2.

Taking the YOLO series as an example, although YOLOv5 and YOLOv6 demonstrate high speeds in terms of FPS, their detection accuracy for multi-class objects, particularly for small-sized targets, is significantly lower than that of FVRT-DETR. While the YOLO series is widely used in practical applications due to its higher frame rates and lower computational overhead, its accuracy tends to drop in more complex tasks, especially when dealing with objects of various sizes and shapes.

For the DETR series and its variants (such as Deformable-DETR and DINO-Deformable-DETR), these algorithms show strong adaptability for multi-scale object detection (e.g., DINO-Deformable-DETR has an

{mAP}_{50 : 95}^{v a l}

of 77.3). However, the high computational cost and slower inference speed pose challenges in real-time applications. The computational overhead and relatively low inference speed of the DETR series hinder their performance in tasks requiring rapid responses.

In contrast, FVRT-DETR-L demonstrates clear advantages. For example, compared to the second-best algorithm in the end-to-end series, RT-DETRv2 (R101), FVRT-DETR-L shows a 2.9 improvement in mAP. Compared to the second-best algorithm in the YOLO series, YOLOv12, FVRT-DETR-L improves recall by 1.3. This indicates that FVRT-DETR outperforms other algorithms in both detection accuracy and recall rate for multi-class objects.

We also present the detection results of the FVRT-DETR algorithm on our proposed FV40 dataset. As shown in Figure 7, our FVRT-DETR can accurately detect very small objects, such as citrus and grapes, and also reliably detect each banana on the fruit rack and the broccoli chopped into small pieces on the plate, among others. These detection results further demonstrate the superior performance of FVRT-DETR in various complex scenarios, including precise localization of small objects, effective differentiation of dense objects, and reliable detection of multiple target categories. These results not only showcase FVRT-DETR’s high-precision detection capabilities but also highlight its practicality and reliability in real-world applications.

To further validate the performance of FVRT-DETR under different data splits, we partitioned the FV40 dataset into training, validation, and test sets with a 7:2:1 ratio. We compared FVRT-DETR with the best-performing algorithms from Table 2 on multi-class detection tasks and multi-scale object detection. The final results, shown in Table 3, confirm the conclusions from Table 2, demonstrating that FVRT-DETR maintains significant advantages in both multi-class and multi-scale object detection tasks.

Overall, FVRT-DETR consistently outperforms several variants of the YOLO and DETR series in both Table 2 and Table 3. It demonstrates notable strengths in handling multi-scale objects and balancing inference speed. Based on these results, we believe FVRT-DETR has great potential for complex object detection tasks, especially in practical scenarios that demand both high accuracy and real-time performance.

5.2. Ablation Study

Ablation studies were conducted to analyze the contributions of different architectural components within FVRT-DETR. By systematically removing or altering specific modules, we evaluated their impact on detection performance and efficiency. This process provided deeper insights into how each component influenced the overall model, enabling us to refine the design and improve the balance between accuracy and computational cost. In the following subsections, we focus on two key aspects: the Mamba backbone (Section 5.2.1) and the MDFF encoder (Section 5.2.2).

It is necessary to explain that the motivation behind the design of this experimental sequence is that our FVRT-DETR is based on RT-DETR (R18) as a foundation for improvements. The specific improvements include the first step of introducing the Mamba backbone network and the second step of constructing the MDFF encoder.

5.2.1. Mamba Backbone

In this ablation study, we evaluate the impact of different backbone architectures on the performance of the FVRT-DETR algorithm. The primary motivation for this experiment is to investigate how variations in backbone design affect the model’s efficiency and detection accuracy. The backbone plays a pivotal role in feature extraction, which is crucial for object detection performance. Therefore, selecting the right backbone can significantly influence both the computational cost and the final detection results. We begin by testing several well-known backbone architectures, including ResNet variants (R34, R50, R101) and lightweight architectures such as MobileNetV1/V2/V3, ShuffleNetV1/V2, among others. These backbones span a broad range of trade-offs between computational efficiency and detection accuracy, from heavy models with high performance to lightweight models designed for resource-constrained environments. Following this, we introduce the Mamba backbone variants (Mamba-T, Mamba-B, and Mamba-L) to explore their potential in improving both the speed and accuracy of FVRT-DETR. The Mamba backbones are specifically engineered to balance performance and efficiency, with the aim of optimizing detection across small and large object categories. The results, shown in Table 4, demonstrate clear patterns in both efficiency and accuracy. Lightweight backbones such as MobileNetV1 and ShuffleNetV2 achieve high frame rates (FPS), with MobileNetV1 reaching up to 256 FPS. However, these models tend to sacrifice some accuracy, particularly in precision and recall. In contrast, the Mamba-T, Mamba-B, and Mamba-L backbones show a remarkable balance between efficiency and accuracy. Although they require more parameters and FLOPs compared to MobileNet, they maintain competitive frame rates—375 FPS for Mamba-T, 232 FPS for Mamba-B, and 97 FPS for Mamba-L—while also delivering strong accuracy, especially in large-scale detection tasks.

In terms of accuracy, the FVRT-DETR models with Mamba backbones consistently outperform the standard ResNet-based RT-DETR models. The Mamba-T achieves an impressive mAP^valS of 57.6, demonstrating its ability to provide real-time performance without significant loss of detection quality. Mamba-B and Mamba-L further improve upon this, with Mamba-L reaching the highest

{mAP}^{v a l} S

of 61.2, offering a robust solution for applications requiring both high precision and efficiency. Furthermore, the Mamba backbones show better scalability in handling both small and large objects. As we move from Mamba-T to Mamba-L, the models demonstrate improved performance in terms of

{mAP}^{v a l} S

and

{mAP}^{v a l} L

, indicating their ability to handle a diverse range of object sizes effectively.

Implications for Future Research and Applications: The Mamba backbone introduces a unique balance between efficiency and accuracy, making it suitable for both resource-constrained environments and high-performance applications. This design opens the door for future research to explore further optimization of backbone architectures. For instance, the integration of Mamba backbones with novel techniques, such as self-supervised learning or Transformer-based methods, could yield even better accuracy without significantly increasing computational cost. Moreover, the ability of Mamba backbones to handle objects of varying scales positions them as an ideal solution for real-world applications such as autonomous driving, surveillance, and medical imaging, where detection precision across different object sizes is crucial.

5.2.2. MDFF Encoder

In this ablation study, we aim to evaluate the impact of different feature fusion strategies on the detection performance of FVRT-DETR and to validate the effectiveness of our proposed MDFF (multi-scale dynamic feature fusion) module. Feature fusion plays a crucial role in object detection tasks, as an effective fusion strategy can fully utilize multi-scale features, enhance object representation, and ultimately improve detection accuracy. Existing feature fusion methods, such as PA-FPN, Bi-FPN, and HS-FPN, have achieved certain successes in feature aggregation. However, they still face limitations in handling small objects, complex backgrounds, and multi-scale targets. To address these challenges, we introduce the MDFF module to further optimize multi-scale feature fusion, improving the model’s ability to detect objects of different scales, especially in small object detection.

Table 5 presents a performance comparison of different feature fusion methods within the FVRT-DETR framework. The experimental results demonstrate that the FVRT-DETR models incorporating MDFF outperform all other methods across all accuracy metrics, with a particularly significant improvement in small object detection. Specifically, in terms of small object average precision (

{AP}^{v a l} S

), FVRT-DETR-L with MDFF achieves 62.5, improving by 2.0 and 1.8 percentage points compared to Bi-FPN (60.5) and HS-FPN (60.7), respectively. This improvement is attributed to MDFF’s incorporation of fine-grained hierarchical feature fusion, which allows for more precise capture of local features in small objects, thereby enhancing detection accuracy. Moreover, MDFF also achieves superior performance in medium objects (

{AP}^{v a l} M

= 74.0) and large objects (

{AP}_{L}^{v a l}

= 80.5), demonstrating stronger robustness and generalization capabilities in multi-scale object detection.

In terms of overall detection accuracy, the FVRT-DETR-L model equipped with MDFF excels in key metrics such as

{mAP}^{v a l}

,

{mAP}^{v a l} 50

, and

{mAP}^{v a l} 75

, reaching values of 71.6, 90.4, and 74.6, respectively, significantly outperforming PA-FPN, Bi-FPN, and HS-FPN. Furthermore, MDFF also shows superior performance in precision (84.6) and recall (80.4), indicating its enhanced ability to reduce both false negatives and false positives.

Although the MDFF module increases the number of parameters (44.6 M) and FLOPs (270.1 G) compared to other methods, it still achieves an FPS of 63, meeting real-time detection requirements. This result suggests that despite the added computational complexity, MDFF’s efficient design ensures its practicality in real-world applications, achieving a balance between accuracy and computational cost.

Implications for Future Research and Applications: The MDFF encoder’s ability to improve multi-scale feature fusion, especially for small objects, makes it highly applicable in domains such as medical imaging, satellite imagery, and autonomous navigation, where small object detection is a critical task. Future work can explore combining MDFF with emerging multi-modal techniques to improve performance in more complex environments. Additionally, the enhanced ability of MDFF to handle diverse object sizes and its real-time detection capability positions it as an effective solution for resource-constrained settings where both accuracy and speed are vital. Further optimization of the MDFF module could reduce the computational cost, making it even more applicable to edge devices.

6. Conclusions

In this paper, we presented an innovative approach to fruit and vegetable detection by introducing both a novel dataset and an advanced detection algorithm. To address the limitations of existing datasets and the challenges posed by the diverse characteristics of fruits and vegetables in real-world scenarios, we constructed the FV40 dataset, a large-scale and diverse benchmark specifically designed for multi-class fruit and vegetable detection tasks. The FV40 dataset comprises over 14,000 high-quality images with varying conditions, including different lighting, angles, and occlusions, and provides more than 100,000 meticulously annotated bounding boxes covering 40 different fruit and vegetable categories. The dataset was collected from a combination of open-source datasets, such as LVIS and CitDet, and carefully curated internet sources, ensuring its diversity, realism, and compliance with high standards of quality and copyright. This dataset serves as a valuable resource for both academic research and practical applications in intelligent agriculture, retail, and automated warehousing. The dataset is continuously being expanded in scale while also improving annotation quality.

To complement the dataset, we proposed FVRT-DETR, a novel end-to-end real-time detection algorithm based on Transformer architecture. The algorithm addresses the performance bottlenecks of existing methods by integrating an efficient Mamba-based backbone and an innovative multi-scale deep feature fusion encoder (MDFF encoder). The Mamba backbone, leveraging the structured state space model (SSM), enhances the model’s capability to capture long-range dependencies, improving feature extraction and representation in complex environments. The MDFF encoder module further optimizes multi-scale feature integration, providing superior feature fusion across different object sizes and enhancing the model’s detection robustness for small, medium, and large-scale targets. Experimental results on the FV40 dataset demonstrate that FVRT-DETR consistently outperforms state-of-the-art detection models across various evaluation metrics, achieving superior precision, recall, and mAP scores, while maintaining real-time inference speeds, thus meeting the demands of practical applications.

Ablation studies further validate the effectiveness of the proposed architectural components. The results indicate that the MDFF encoder significantly improves multi-scale feature fusion, particularly excelling in small object detection by incorporating fine-grained hierarchical feature integration. Additionally, despite the slightly increased computational cost, FVRT-DETR achieves a favorable trade-off between accuracy and efficiency, making it highly applicable to real-world deployment scenarios.

However, it is important to note that the current work primarily deals with still images, and processing real-time and moving images presents additional challenges. Future work will focus on expanding the proposed framework to handle dynamic environments and moving objects, thereby improving its applicability to real-time video processing. Additionally, future research will involve expanding the FV40 dataset with more categories and environmental conditions, as well as further optimizing FVRT-DETR to support other downstream tasks, such as ripeness estimation and defect detection, contributing to the advancement of intelligent agriculture and food supply chain management. Moreover, due to the high complexity of the FV40 dataset in terms of category diversity, scene complexity, occlusion levels, and object scale distribution, existing detection algorithms—including our proposed FVRT-DETR—still fall significantly short of achieving 100% accuracy. In future work, we will further optimize the model architecture.

In summary, this study makes a significant contribution by providing the FV40 dataset as a comprehensive benchmark and proposing FVRT-DETR as an efficient and accurate detection framework. The proposed solution effectively addresses key challenges such as multi-scale detection, data diversity, and real-time processing, offering a scalable and flexible solution applicable to various agricultural and commercial applications. For example, in agriculture, FVRT-DETR, trained extensively on the FV40 dataset, can be deployed in automated harvesting systems to help identify ripe fruits in real time, improving the accuracy of sorting and harvesting processes. In retail, FVRT-DETR can be integrated into inventory management systems to automatically monitor stock levels on shelves, detect damaged or overripe products, ensure better quality control, and reduce waste. Furthermore, its real-time detection capabilities can also support supermarket self-checkout systems, enabling precise identification and tracking of various products. Future work will focus on expanding the FV40 dataset with additional categories and environmental conditions, as well as further optimizing FVRT-DETR to support downstream tasks such as ripeness estimation and defect detection, contributing to the advancement of intelligent agriculture and food supply chain management.

Author Contributions

Conceptualization, X.B. and Y.W.; methodology, X.B. and H.L.; software, X.B.; validation, X.B., Y.W., H.L. and Y.Y.; formal analysis, X.B. and Y.W.; investigation, X.B. and H.L.; resources, Y.Y.; data curation, Y.W. and Y.Y.; writing—original draft preparation, X.B.; writing—review and editing, Y.W. and Y.Y.; visualization, Y.W. and Y.Y.; supervision, Y.W. and Y.Y.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation project (62473295).

Institutional Review Board Statement

This study involved only observation and did not involve any handling of animals; therefore, ethical approval was not required.

Data Availability Statement

The FV40 dataset will available from https://github.com/BXS-git/FV40.git, accessed on 26 March 2025. We will continuously update the dataset’s scale and annotation quality at the provided link.

Acknowledgments

The authors would like to thank the journal editors and anonymous reviewers for their help in improving the quality of our paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Parico, A.I.B.; Ahamed, T. Real Time Pear Fruit Detection and Counting Using YOLOv4 Models and Deep SORT. Sensors 2021, 21, 4803. [Google Scholar] [CrossRef] [PubMed]
Gao, F.; Fang, W.; Sun, X.; Wu, Z.; Zhao, G.; Li, G.; Li, R.; Fu, L.; Zhang, Q. A Novel Apple Fruit Detection and Counting Methodology Based on Deep Learning and Trunk Tracking in Modern Orchard. Comput. Electron. Agric. 2022, 197, 107000. [Google Scholar]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization Strategies of Fruit Detection to Overcome the Challenge of Unstructured Background in Field Orchard Environment: A Review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar]
Zhang, W.; Wang, J.; Liu, Y.; Chen, K.; Li, H.; Duan, Y.; Wu, W.; Shi, Y.; Guo, W. Deep-Learning-Based In-Field Citrus Fruit Detection and Tracking. Hortic. Res. 2022, 9, uhac003. [Google Scholar]
Mirhaji, H.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S.A. Fruit Detection and Load Estimation of an Orange Orchard Using the YOLO Models Through Simple Approaches in Different Imaging and Illumination Conditions. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar]
Tang, Y.; Zhou, H.; Wang, H.; Zhang, Y. Fruit Detection and Positioning Technology for a Camellia Oleifera C. Abel Orchard Based on Improved YOLOv4-Tiny Model and Binocular Stereo Vision. Expert Syst. Appl. 2023, 211, 118573. [Google Scholar]
Mao, D.; Sun, H.; Li, X.; Yu, X.; Wu, J.; Zhang, Q. Real-Time Fruit Detection Using Deep Neural Networks on CPU (RTFD): An Edge AI Application. Comput. Electron. Agric. 2023, 204, 107517. [Google Scholar]
Seth, K. Fruits and Vegetables Image Recognition Dataset. 2021. Available online: https://www.kaggle.com/datasets/kritikseth/fruit-and-vegetable-image-recognition (accessed on 1 January 2025).
Muresan, H.; Oltean, M. Fruit recognition from images using deep learning. Acta Univ. Inform. 2023, 10, 26–42. [Google Scholar]
Latif, G.; Mohammad, N.; Alghazo, J. DeepFruit: A dataset of fruit images for fruit classification and calories calculation. Data Brief 2023, 50, 109524. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO for Object Detection. arXiv 2024, arXiv:2406.05835. [Google Scholar]
Lu, S.; Chen, W.; Zhang, X.; Karkee, M. Canopy-Attention-YOLOv4-Based Immature/Mature Apple Fruit Detection on Dense-Foliage Tree Architectures for Early Crop Load Estimation. Comput. Electron. Agric. 2022, 193, 106696. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, W.; Yu, J.; He, L.; Chen, J.; He, Y. Complete and Accurate Holly Fruits Counting Using YOLOX Object Detection. Comput. Electron. Agric. 2022, 198, 107062. [Google Scholar] [CrossRef]
Wang, Y.; Yan, G.; Meng, Q.; Yao, T.; Han, J.; Zhang, B. DSE-YOLO: Detail Semantics Enhancement YOLO for Multi-Stage Strawberry Detection. Comput. Electron. Agric. 2022, 198, 107057. [Google Scholar] [CrossRef]
Bhargava, A.; Bansal, A.; Goyal, V. Machine Learning—Based Detection and Sorting of Multiple Vegetables and Fruits. Food Anal. Methods 2022, 15, 228–242. [Google Scholar] [CrossRef]
Gupta, S.; Tripathi, A.K. Fruit and Vegetable Disease Detection and Classification: Recent Trends, Challenges, and Future Opportunities. Eng. Appl. Artif. Intell. 2024, 133, 108260. [Google Scholar] [CrossRef]
López-García, F.; Andreu-García, G.; Blasco, J.; Aleixos, N.; Valiente, J.-M. Automatic Detection of Skin Defects in Citrus Fruits Using a Multivariate Image Analysis Approach. Comput. Electron. Agric. 2010, 71, 189–197. [Google Scholar] [CrossRef]
Bulanon, D.M.; Kataoka, T. Fruit Detection System and an End Effector for Robotic Harvesting of Fuji Apples. Agric. Eng. Int. CIGR J. 2010, 12, 203–210. [Google Scholar]
Sengupta, S.; Lee, W.S. Identification and Determination of the Number of Immature Green Citrus Fruit in a Canopy Under Different Ambient Light Conditions. Biosyst. Eng. 2014, 117, 51–61. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G. YOLOv5 Release v7.0. GitHub Repository. 2022. Available online: https://github.com/ultralytics/yolov5/tree/v7.0 (accessed on 15 December 2024).
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; YOLOv8. GitHub Repository. 2023. Available online: https://github.com/ultralytics/ultralytics/tree/main (accessed on 15 December 2024).
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. Adv. Neural Inf. Process. Syst. 2024, 36, 51094–51112. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 1–21. [Google Scholar]
Wang, D.; He, D. Channel Pruned YOLO v5s-Based Deep Learning Approach for Rapid and Accurate Apple Fruitlet Detection Before Fruit Thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-Task Deep Convolutional Neural Network for Cherry Tomato Fruit Bunch Maturity Detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar]
Wang, J.; Liu, M.; Du, Y.; Zhao, M.; Jia, H.; Guo, Z.; Su, Y.; Lu, D.; Liu, Y. PG-YOLO: An Efficient Detection Algorithm for Pomegranate Before Fruit Thinning. Eng. Appl. Artif. Intell. 2024, 134, 108700. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; Shum, H.-Y. DINO: DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Sun, P.; Jiang, Y.; Xie, E.; Shao, W.; Yuan, Z.; Wang, C.; Luo, P. What Makes for End-to-End Object Detection? In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 9934–9944. [Google Scholar]
Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-End Object Detection with Fully Convolutional Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15849–15858. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Gu, Z.; Ma, X.; Guan, H.; Jiang, Q.; Deng, H.; Wen, B.; Zhu, T.; Wu, X. Tomato Fruit Detection and Phenotype Calculation Method Based on the Improved RTDETR Model. Comput. Electron. Agric. 2024, 227, 109524. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, X.; Wang, H.; Wei, H.; Zhang, Y.; Zhou, G. Pear Fruit Detection Model in Natural Environment Based on Lightweight Transformer Architecture. Agriculture 2024, 15, 24. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2404.12345. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware Reassembly of Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Gupta, A.; Dollar, P.; Girshick, R. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 5356–5364. Available online: https://www.kaggle.com/datasets/henningheyen/lvis-fruits-and-vegetables-dataset (accessed on 29 October 2024).
James, J.A.; Manching, H.K.; Mattia, M.R.; Bowman, K.D.; Hulse-Kemp, A.M.; Beksi, W.J. CitDet: A Benchmark Dataset for Citrus Fruit Detection. IEEE Robot. Autom. Lett. 2024, 9, 10788–10795. [Google Scholar] [CrossRef]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An Evolved Version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Jocher, G. YOLOv11. GitHub Repository. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 27 November 2024).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3621–3630. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query Denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 13619–13627. [Google Scholar]
Chen, Q.; Su, X.; Zhang, X.; Wang, J.; Chen, J.; Shen, Y.; Han, C.; Chen, Z.; Xu, W.; Li, F.; et al. LW-DETR: A transformer replacement to yolo for real-time detection. arXiv 2024, arXiv:2406.03459. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression Task in DETRs as Fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Howard, A.G. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 78–96. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. Ghostnetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Cai, Y.; Zhou, Y.; Han, Q.; Sun, J.; Kong, X.; Li, J.; Zhang, X. Reversible column networks. arXiv 2022, arXiv:2212.11696. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Shi, D. Transnext: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 17773–17783. [Google Scholar]
Chen, H.; Wang, Y.; Guo, J.; Tao, D. Vanillanet: The power of minimalism in deep learning. Adv. Neural Inf. Process. Syst. 2024, 36, 7050–7064. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 15909–15920. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher flops for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series, and image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 5513–5524. [Google Scholar]
Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16889–16900. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 1389–1400. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of FVRT-DETR.

Figure 2. Detailed structure of Simple Stem [12].

Figure 3. Detailed structure of Vision Clue Merge [12].

Figure 4. Illustration of ODSS block architecture [12]. (a) Detailed Structure of the SS2D; (b) illustration of local spatial block (LS block); (c) illustration of residual gated block (RG block).

Figure 5. Detailed of our FV40 dataset.

Figure 6. Example results of our FV40 dataset.

Figure 7. Example detection results of FVRT-DETR on our custom FV40 dataset.

Table 1. Default Key Hyperparameters for FVRT-DETR training.

Hyperparameter	Value
Optimizer	AdamW
Base Learning Rate (All)	0.0001
Weight Decay	0.0001
Batch Size	8
Epochs	200
Warmup Epochs	2000
Learning Rate Scheduler	Cosine decay
Warmup Momentum	0.8
IoU Threshold	0.7
Box Loss Weight	7.5
Classification Loss Weight	0.5
Distribution Focal Loss	1.5
Max Detected Objects	300
Class Loss Weight	1.0
Nominal Batch Size	64
Image Size	640
Data Augmentation	Mosaic, Mixup
Mask Ratio	4

Table 2. Comparison with Other SOTA Methods on the FV40 Validation Dataset (7:3).

		Efficiency			Average Accuracy
Model	Backbone	Params	FLOPs	FPS	Precision	Recall	${mAP}^{val}$	${mAP}_{50}^{val}$	${mAP}_{75}^{val}$	${mAP}_{50 : 95}^{val}$	${mAP}_{S}^{val}$	${mAP}_{M}^{val}$	${mAP}_{L}^{val}$
Real-time Object Detectors
YOLOv5-L [22]	-	46 M	109 G	54	76.7	73.6	63.9	81.1	68.2	62.4	53.7	65.4	71.7
YOLOv5-X [22]	-	86 M	205 G	43	78.9	75.1	65.8	85.2	70.0	63.1	56.7	66.9	74.4
PPYOLOE-L [44]	-	52 M	110 G	94	79.5	75.3	66.4	85.4	70.1	63.5	57.4	67.6	75.0
PPYOLOE-X [44]	-	98 M	206 G	60	80.1	75.4	67.9	86.6	71.2	64.2	58.3	68.7	76.2
YOLOv6-L [23]	-	59 M	150 G	99	80.4	75.9	68.3	87.0	71.5	64.8	58.9	69.7	77.1
YOLOv7-L [24]	-	36 M	104 G	55	79.7	75.1	67.0	85.7	70.6	64.1	58.0	68.3	75.3
YOLOv7-X [24]	-	71 M	189 G	45	81.2	77.3	67.6	87.0	71.9	65.2	59.0	69.6	76.2
YOLOv8-L [25]	-	43 M	165 G	71	81.2	77.6	68.2	87.3	72.0	65.6	59.5	70.1	76.6
YOLOv8-X [25]	-	68 M	257 G	50	81.7	78.7	69.5	88.0	73.1	66.8	60.3	70.8	77.3
YOLOv9-C [27]	-	25 M	102 G	143	82.3	79.0	69.8	88.4	73.7	67.2	60.9	71.3	77.7
YOLOv9-E [27]	-	57 M	189 G	60	82.8	79.6	70.4	89.1	73.7	67.0	61.4	72.0	78.0
Mamba YOLO-T [12]	Mamba-T	5.8 M	13.2 G	161	81.4	77.4	68.3	87.5	71.7	65.9	59.6	70.3	77.0
Mamba YOLO-B [12]	Mamba-B	19.1 M	45.4 G	161	82.5	77.9	69.0	88.0	72.3	66.2	60.0	71.8	78.3
Mamba YOLO-L [12]	Mamba-L	57.6 M	156.2 G	161	83.6	78.9	70.3	88.8	73.5	67.3	61.2	72.0	79.1
YOLOv11-L [45]	-	25 M	87 G	161	83.4	78.5	70.0	87.8	72.4	66.9	60.9	71.6	78.8
YOLOv11-X [45]	-	57 M	195 G	89	84.3	79.0	71.0	89.3	73.9	67.7	61.7	72.8	79.7
YOLOv12-L [46]	-	26 M	89 G	150	83.4	78.5	70.2	87.4	72.5	66.9	60.3	71.0	79.2
YOLOv12-X [46]	-	59 M	199 G	80	83.7	79.1	71.1	89.0	73.7	67.3	61.9	72.5	79.6
End-to-end Object Detectors
DETR-DC5 [31]	R50 [47]	41 M	187 G	-	70.9	67.8	57.8	76.3	62.9	58.4	50.3	60.9	66.9
DETR-DC5 [31]	R101 [47]	60 M	253 G	-	72.2	69.0	58.4	76.5	63.5	59.6	52.0	61.4	67.7
Anchor-DETR-DC5 [48]	R50 [47]	39 M	172 G	-	71.7	68.6	58.0	75.8	62.9	58.6	51.7	60.6	67.0
Anchor-DETR-DC5 [48]	R101 [47]	-	-	-	72.5	69.8	59.1	76.4	63.7	59.4	52.9	61.7	67.8
Conditional-DETR-DC5 [49]	R50 [47]	44 M	195 G	-	72.8	69.6	59.3	76.3	63.8	59.6	53.4	62.5	68.9
Conditional-DETR-DC5 [49]	R101 [47]	63 M	262 G	-	73.7	70.4	60.0	77.1	64.3	60.4	53.6	63.6	69.4
Efficient-DETR [50]	R50 [47]	35 M	210 G	-	72.7	69.5	59.7	76.8	64.0	60.5	53.6	62.9	69.2
Efficient-DETR [50]	R101 [47]	54 M	289 G	-	73.4	70.1	59.9	77.2	64.5	61.3	54.0	63.2	69.5
SMCA-DETR [51]	R50 [47]	40 M	152 G	-	73.2	70.0	59.5	77.0	64.0	60.7	54.5	63.0	70.0
SMCA-DETR [51]	R101 [47]	58 M	218 G	-	73.5	71.0	60.1	77.8	64.6	61.3	55.0	63.6	70.4
Deformable-DETR [32]	R50 [47]	40 M	173 G	-	74.0	71.3	60.6	78.3	64.8	61.4	55.4	64.0	70.5
DAB-Deformable-DETR [52]	R50 [47]	48 M	195 G	-	74.9	72.5	61.4	78.9	65.3	63.0	56.0	64.6	71.6
DAB-Deformable-DETR++ [52]	R50 [47]	47 M	-	-	75.7	73.0	63.3	79.8	67.4	63.5	57.7	66.2	73.3
DN-Deformable-DETR [53]	R50 [47]	48 M	195 G	-	75.9	73.4	63.5	80.3	67.7	63.8	58.3	66.8	73.5
DN-Deformable-DETR++ [53]	R50 [47]	47 M	-	-	77.3	74.6	64.7	81.8	68.9	64.4	60.0	67.9	75.2
DINO-Deformable-DETR [33]	R50 [47]	47 M	279 G	5	78.8	76.3	66.2	83.5	70.2	66.0	61.7	69.7	77.3
LW-DETR-L [54]	-	47 M	72 G	110	81.0	78.2	69.9	87.2	72.6	67.7	61.3	72.0	78.3
LW-DETR-X [54]	-	118 M	174 G	50	82.2	79.3	71.3	89.0	73.4	68.3	62.0	72.7	79.9
D-FINE-L [55]	-	31 M	91 G	127	81.6	78.4	70.4	88.1	72.7	67.7	60.4	72.4	79.1
D-FINE-X [55]	-	62 M	202 G	80	81.7	80.2	71.2	89.1	74.0	68.4	61.9	71.7	80.1
Real-time End-to-end Object Detectors
YOLOv10-B [56]	-	20.5 M	98.7 G	164	79.4	76.3	67.5	85.8	70.1	66.7	57.5	68.2	74.7
RT-DETR [36]	R34 [47]	31.4 M	90.6 G	173	80.2	76.6	68.0	86.0	70.0	66.3	57.3	68.7	75.2
RT-DETRv2 [57]	R34 [47]	36 M	100 G	145	80.9	77.3	68.4	86.9	70.5	67.0	57.6	68.9	75.1
FVRT-DETR-T	Mamba-T	17.0 M	91.5 G	170	82.5	77.3	68.9	87.2	70.9	68.2	58.4	70.2	77.9
YOLOv10-L [56]	-	25.8 M	127.2 G	137	81.6	77.5	68.2	87.9	72.0	68.9	59.3	70.5	77.2
RT-DETR [36]	R50 [47]	42.8 M	134.4 G	108	81.9	78.2	68.2	87.8	72.2	69.0	59.0	71.3	77.6
RT-DETRv2 [57]	R50 [47]	42 M	136 G	108	82.4	78.5	68.3	87.9	72.4	69.6	59.0	71.6	77.9
FVRT-DETR-B	Mamba-B	27.1 M	145.8 G	95	83.6	78.3	70.0	88.3	72.1	70.0	59.2	71.1	78.8
YOLOv10-X [56]	-	31.7 M	171 G	93	82.4	78.3	68.5	88.5	72.6	69.4	60.7	71.9	78.6
RT-DETR [36]	R101 [47]	76.5 M	257.3 G	74	83.0	79.1	69.0	89.6	73.9	70.3	61.5	72.7	79.2
RT-DETRv2 [57]	R101 [47]	76 M	259 G	74	83.1	79.0	68.7	89.8	74.0	70.7	61.5	72.5	79.3
FVRT-DETR-L	Mamba-L	44.6 M	270.1 G	63	84.6	80.4	71.6	90.4	74.6	71.6	62.5	74.0	80.5

Table 3. Comparison with Other SOTA Methods on the FV40 Test Dataset (7:2:1).

		Efficiency			Average Accuracy
Model	Backbone	Params	FLOPs	FPS	Precision	Recall	${mAP}^{test}$	${mAP}_{50}^{test}$	${mAP}_{75}^{test}$	${mAP}_{50 : 95}^{test}$	${mAP}_{S}^{test}$	${mAP}_{M}^{test}$	${mAP}_{L}^{test}$
Real-time Object Detectors
YOLOv11-L [45]	-	25 M	87 G	161	80.5	74.9	67.2	87.3	70.9	65.8	60.7	71.4	77.6
YOLOv11-X [45]	-	57 M	195 G	89	81.6	76.1	68.1	87.8	71.2	66.6	61.4	72.3	78.6
YOLOv12-L [46]	-	26 M	89 G	150	80.3	74.6	67.7	87.6	71.3	65.9	60.6	71.2	78.0
YOLOv12-X [46]	-	59 M	199 G	80	82.3	76.3	68.4	88.7	73.0	66.0	61.6	72.0	79.3
End-to-end Object Detectors
LW-DETR-L [54]	-	47 M	72 G	110	78.6	76.0	67.7	84.9	70.3	65.2	58.8	69.5	75.8
LW-DETR-X [54]	-	118 M	174 G	50	79.9	76.8	68.8	86.6	70.9	65.9	59.5	70.4	77.5
D-FINE-L [55]	-	31 M	91 G	127	79.2	76.0	67.9	85.6	70.3	65.2	57.9	69.9	76.6
D-FINE-X [55]	-	62 M	202 G	80	79.3	77.7	68.8	86.8	71.5	66.0	59.4	69.3	77.7
Real-time End-to-end Object Detectors
RT-DETR [36]	R34 [47]	31.4 M	90.6 G	173	78.2	74.1	65.6	83.7	67.9	64.2	55.3	66.3	72.9
RT-DETRv2 [57]	R34 [47]	36 M	100 G	145	78.5	75.3	65.9	84.5	68.4	64.9	55.5	66.7	72.8
FVRT-DETR-T	Mamba-T	17.0 M	91.5 G	170	80.3	75.2	66.6	85.1	68.8	66.2	56.2	67.8	75.8
RT-DETR [36]	R50 [47]	42.8 M	134.4 G	108	79.6	75.9	66.2	85.5	70.1	67.3	56.5	68.8	75.2
RT-DETRv2 [57]	R50 [47]	42 M	136 G	108	80.2	76.5	66.7	85.7	70.3	67.4	57.6	69.1	75.8
FVRT-DETR-B	Mamba-B	27.1 M	145.8 G	95	81.3	77.1	68.7	87.1	71.3	68.5	57.8	69.6	78.4
RT-DETR [36]	R101 [47]	76.5 M	257.3 G	74	80.7	76.6	67.1	87.5	71.9	68.1	59.3	70.6	76.8
RT-DETRv2 [57]	R101 [47]	76 M	259 G	74	80.9	76.9	66.4	87.7	71.6	68.7	59.4	70.1	77.2
FVRT-DETR-L	Mamba-L	44.6 M	270.1 G	63	83.6	79.0	70.2	89.4	73.2	70.6	62.3	72.9	79.3

Table 4. Ablation experiment on the encoder. It should be noted that, except for the models with ResNet as the backbone, all other models in the table are based on the RT-DETR (R18) architecture with only the backbone modified. The model proposed in this pmAPer also follows the RT-DETR (R18) structure, with the backbone replaced by the Mamba-T, Mamba-M, and Mamba-L introduced in this work.

		Efficiency			Average Accuracy
Model	Backbone	Params	FLOPs	FPS	Precision	Recall	${mAP}^{val}$	${mAP}_{50}^{val}$	${mAP}_{75}^{val}$	${mAP}_{S}^{val}$	${mAP}_{M}^{val}$	${mAP}_{L}^{val}$
RT-DETR [36] (Baseline)	R18 [47]	20.0 M	60.0 G	211	79.2	76.1	67.8	85.9	69.8	56.7	67.8	74.9
RT-DETR [36]	R34 [47]	31.4 M	90.6 G	173	80.2	76.6	68.0	86.0	70.0	57.3	68.7	75.2
RT-DETR [36]	R50 [47]	42.8 M	134.4 G	108	81.9	78.2	68.2	87.8	72.2	59.0	71.3	77.6
RT-DETR [36]	R101 [47]	76.5 M	257.3 G	74	83.0	79.1	69.0	89.6	73.9	61.5	72.7	79.2
RT-DETR [36]	MobileNetV1 [58]	12.3 M	34.6 G	256	76.3	72.9	65.3	82.6	68.3	56.9	67.2	73.9
RT-DETR [36]	MobileNetV2 [59]	10.6 M	28.9 G	274	76.5	72.6	65.9	83.0	68.0	57.5	67.3	74.3
RT-DETR [36]	MobileNetV3 [60]	11.9 M	27.8 G	302	76.6	72.5	65.5	83.3	67.5	57.9	67.8	74.5
RT-DETR [36]	MobileNetV4 [61]	11.5 M	40.6 G	220	76.3	73.1	66.0	83.4	67.3	58.2	67.7	74.5
RT-DETR [36]	ShuffleNetV1 [62]	16.8 M	39.1 G	210	75.9	72.0	66.3	83.9	68.2	58.0	66.5	73.7
RT-DETR [36]	ShuffleNetV2 [63]	9.8 M	26.6 G	239	74.7	71.2	65.1	82.5	67.6	56.9	65.4	73.3
RT-DETR [36]	GhostnetV1 [64]	11.8 M	26.8 G	225	74.9	71.0	65.8	82.7	67.9	57.3	65.8	73.0
RT-DETR [36]	GhostnetV2 [65]	12.7 M	27.3 G	217	75.4	71.5	66.3	82.4	68.2	57.7	66.0	73.5
RT-DETR [36]	EfficientNetV1 [66]	14.3 M	24.5 G	195	75.8	72.0	67.2	83.0	69.5	57.0	65.7	73.8
RT-DETR [36]	EfficientViT [67]	11.1 M	28.5 G	174	74.7	72.9	67.0	83.4	70.1	56.6	65.0	73.3
RT-DETR [36]	SwinTransformer [68]	37.1 M	99.2 G	140	80.5	77.2	68.5	86.0	70.1	57.6	69.0	75.6
RT-DETR [36]	RevColV1 [69]	69.3 M	173.1 G	89	82.5	78.0	68.7	88.0	72.6	59.5	71.3	77.7
RT-DETR [36]	ConvNeXtV2 [70]	12.7 M	33.1 G	243	75.8	73.1	66.2	83.0	67.5	58.8	68.2	74.4
RT-DETR [36]	TransNeXt [71]	21.5 M	65.7 G	184	80.3	78.2	66.9	87.4	71.7	58.6	70.0	77.3
RT-DETR [36]	VanillaNet [72]	28.1 M	116.7 G	105	81.2	79.0	66.5	87.5	71.7	58.2	70.4	77.9
RT-DETR [36]	RepViT [73]	13.8 M	38.3 G	230	76.5	72.6	65.7	82.4	68.6	57.0	67.6	74.2
RT-DETR [36]	CSWinTransformer [74]	32.3 M	91.3 G	113	79.8	77.0	68.1	85.5	70.2	57.7	68.8	75.3
RT-DETR [36]	FasterNet [75]	22.1 M	56.5 G	213	80.0	77.3	67.2	86.3	70.5	57.4	68.7	76.5
RT-DETR [36]	UniRepLknet [76]	13.3 M	35.2 G	227	76.0	72.3	65.5	82.6	68.2	57.0	67.9	74.0
RT-DETR [36]	EfficientFormerV2 [77]	12.2 M	30.6 G	236	75.6	72.0	65.7	82.5	68.1	56.5	67.5	74.1
RT-DETR [36]	EMO [78]	13.5 M	28.6 G	240	75.7	72.2	65.4	82.1	68.2	56.5	67.7	74.3
RT-DETR [36]	Mamba-T	9.1 M	17.2 G	375	80.0	76.3	68.1	86.3	70.4	57.6	68.0	75.1
RT-DETR [36]	Mamba-B	23.6 M	48.4 G	232	81.7	78.5	68.0	87.7	72.0	59.4	71.6	77.3
RT-DETR [36]	Mamba-L	57.8 M	147.4 G	97	83.2	79.5	68.7	89.5	73.0	61.2	72.7	79.3

Table 5. Ablation experiment on the encoder.

		Efficiency			Average Accuracy
Model	Encoder	Params	FLOPs	FPS	Precision	Recall	${mAP}^{val}$	${mAP}_{50}^{val}$	${mAP}_{75}^{val}$	${mAP}_{S}^{val}$	${mAP}_{M}^{val}$	${mAP}_{L}^{val}$
FVRT-DETR-T (Baseline)	PA-FPN [41]	9.1 M	17.2 G	375	80.0	76.3	68.1	86.3	70.4	57.6	68.0	75.1
FVRT-DETR-B (Baseline)	PA-FPN [41]	23.6 M	48.4 G	232	81.7	78.5	68.0	87.7	72.0	59.4	71.6	77.3
FVRT-DETR-L (Baseline)	PA-FPN [41]	57.8 M	147.4 G	97	83.2	79.5	68.7	89.5	73.0	61.2	72.7	79.3
FVRT-DETR-T	Bi-FPN [79]	9.9 M	24.5 G	367	80.1	76.8	68.4	86.1	70.3	57.0	68.2	75.5
FVRT-DETR-B	Bi-FPN [79]	24.8 M	54.5 G	205	82.0	78.6	69.2	87.5	72.3	59.7	71.4	77.4
FVRT-DETR-L	Bi-FPN [79]	58.4 M	153.8 G	90	83.5	79.7	68.1	89.8	72.9	60.5	72.6	79.3
FVRT-DETR-T	HS-FPN [80]	8.0 M	18.5 G	362	79.9	76.8	68.0	86.5	70.1	56.9	68.4	75.1
FVRT-DETR-B	HS-FPN [80]	22.8 M	49.6 G	227	81.5	77.9	68.4	86.8	71.5	59.1	71.2	76.8
FVRT-DETR-L	HS-FPN [80]	56.5 M	149.2 G	86	83.0	79.0	68.7	88.7	73.0	60.7	72.4	78.9
FVRT-DETR-T	MDFF	17.0 M	91.5 G	170	82.5	77.3	68.9	87.2	70.9	58.4	70.2	77.9
FVRT-DETR-B	MDFF	27.1 M	145.8 G	95	83.6	78.3	70.0	88.3	72.1	59.2	71.1	78.8
FVRT-DETR-L	MDFF	44.6 M	270.1 G	63	84.6	80.4	71.6	90.4	74.6	62.5	74.0	80.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bu, X.; Wu, Y.; Lv, H.; Yu, Y. Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method. Agriculture 2025, 15, 760. https://doi.org/10.3390/agriculture15070760

AMA Style

Bu X, Wu Y, Lv H, Yu Y. Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method. Agriculture. 2025; 15(7):760. https://doi.org/10.3390/agriculture15070760

Chicago/Turabian Style

Bu, Xiaosheng, Yongfeng Wu, Hongtai Lv, and Youling Yu. 2025. "Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method" Agriculture 15, no. 7: 760. https://doi.org/10.3390/agriculture15070760

APA Style

Bu, X., Wu, Y., Lv, H., & Yu, Y. (2025). Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method. Agriculture, 15(7), 760. https://doi.org/10.3390/agriculture15070760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast and Accurate Detection of Forty Types of Fruits and Vegetables: Dataset and Method

Abstract

1. Introduction

2. Related Work

Fruit and Vegetable Detection Algorithm

3. Our Method

3.1. Mamba Backbone

3.2. MDFF Encoder

4. Experimental Details

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

5. Results and Analysis

5.1. Benchmark Algorithm Performance Evaluation

5.2. Ablation Study

5.2.1. Mamba Backbone

5.2.2. MDFF Encoder

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI