Next Article in Journal
PhysAstro-Pose: Physics-Inspired Semi-Supervised Human Pose Estimation in Microgravity Environments
Previous Article in Journal
Intrusion Detection in the Internet of Things: A Comprehensive Review of Techniques, Architectures, Datasets, and Emerging Trends
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FDA-YOLO: A Feature Fusion and Attention-Based Network for Multiscale Tomato Maturity Detection in Real-World Agricultural Scenarios

1
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
Jiangsu Provincial Key Laboratory of Internet of Things Intelligent Perception and Computing, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
3
College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
4
College of Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(11), 3404; https://doi.org/10.3390/s26113404
Submission received: 12 April 2026 / Revised: 24 May 2026 / Accepted: 25 May 2026 / Published: 27 May 2026
(This article belongs to the Section Smart Agriculture)

Abstract

Fruit detection and maturity recognition are crucial for intelligent tomato harvesting and management. However, in complex field environments, challenges such as the similarity in color between fruits and leaves, cluttered backgrounds, and severe occlusions significantly hinder accurate tomato detection. To address these issues, this paper proposes a lightweight tomato maturity detection model, termed FDA-YOLO. Building upon the YOLOv11 framework, the proposed model enhances global perception in complex scenarios by introducing a multiscale feature enhancement module. In addition, a foreground–background dual-path attention mechanism is designed to better distinguish fruits from the background, thereby improving detection robustness. Furthermore, a lightweight asymmetric detection head is constructed to reduce computational cost while maintaining high accuracy. These improvements enable the model to achieve more efficient and accurate tomato maturity detection under complex conditions. Extensive experiments are conducted on the LaboroTomato dataset. The results demonstrate that FDA-YOLO achieves the best performance with relatively low computational overhead, reaching 83.4% and 67.5% in mAP50 and mAP50–95, respectively, while also attaining a near-optimal F1 score. Overall, the proposed model achieves an excellent balance between accuracy and efficiency, providing an effective solution for intelligent agricultural monitoring and automated harvesting systems.

1. Introduction

Owing to their high nutritional value and distinctive flavor, tomatoes are among the most widely cultivated crops worldwide [1,2] and represent important sources of income and export revenue in many countries [3,4]. However, tomato harvesting still largely relies on manual labor and faces challenges, such as an aging agricultural workforce, labor shortages, and rising labor costs [5]. In addition, manual harvesting is labor-intensive, time-limited, and inefficient, making it difficult to meet the demands of large-scale modern agricultural production [6].
With the rapid development of smart agriculture [7,8], mechanical harvesting has become an important research focus [9]. Accurate fruit detection and maturity assessment are essential prerequisites for automated harvesting and directly affect harvesting efficiency, transportation and storage strategies, and final fruit quality and market value. Therefore, developing robust computer vision and deep learning methods for accurate real-time maturity detection is crucial for promoting intelligent tomato harvesting systems. Early approaches relied mainly on color thresholding [10], texture analysis [11], and morphological operations [12] but performed poorly under varying illumination and complex backgrounds. With the advancement of deep learning, convolutional neural networks (CNNs) have shown stronger adaptability in field environments because they learn hierarchical features in an end-to-end manner. As a result, deep learning-based detectors, such as Faster R-CNN [13], SSD [14], and YOLO [15], have become mainstream methods for tomato maturity recognition.
Despite recent progress, tomato maturity detection still faces several challenges. First, significant variations in fruit scale and distance, combined with complex illumination conditions, make it difficult to capture both the fine-grained textures of nearby fruits and the global shape information of distant fruits. Second, the similarity between fruit and leaf colors, together with occlusions and dense distributions, causes severe feature ambiguity, often leading to false detections. Finally, some detection heads suffer from structural redundancy and high computational cost, limiting real-time performance. Although some methods decouple classification and localization tasks, they still suffer from imbalanced feature utilization and excessive computational overhead, making deployment on resource-constrained devices difficult. Therefore, balancing detection accuracy, inference speed, and computational complexity remains a key challenge.
Existing YOLO-based improvements introduce mainly channel/spatial attention mechanisms and multiscale feature fusion strategies to enhance feature representation. However, most of these approaches rely on implicit feature reweighting and lack explicit foreground–background modeling while rarely considering joint optimization of feature modeling and detection structures. In complex agricultural environments, occlusion and background interference also remain insufficiently addressed, and some methods improve accuracy at the cost of higher computational overhead. Compared with recent attention-based YOLO variants, the fundamental novelty of our method is that the proposed DCFA module achieves explicit foreground–background dual-path decoupling and contrastive modeling, representing a paradigm shift from “implicit enhancement” to “explicit decoupling”.
To address these challenges, this study proposes FDA-YOLO, a lightweight and high-precision tomato maturity detection model based on YOLOv11 [16]. Unlike previous methods that focus solely on lightweight design or accuracy improvement, our approach aims to achieve high detection accuracy with low computational cost. The framework consists of three key components: focal spatial modulation (FSM), dual-contrast feature aggregation (DCFA), and an asymmetric depthwise separable detection head (ADSDH). The main contributions of this work are summarized as follows:
(1) A lightweight and high-precision tomato maturity detection model, FDA-YOLO. The proposed model is built upon YOLOv11 and is tailored to address challenges in tomato detection, such as scale variations and feature entanglement. It incorporates a more efficient feature modeling and detection mechanism, achieving a favorable balance between detection accuracy and computational efficiency. Consequently, the model is not only suitable for complex field environments but also deployable on resource-constrained embedded devices.
(2) Introduction of focal spatial modulation (FSM). To address the challenges related to scale variation, illumination differences, and the complex backgrounds of tomato fruits, we design an FSM module to enhance the model’s multiscale feature representation capability. FSM employs a hierarchical contextual focusing mechanism to modulate feature responses across different regions, effectively integrating local details with global semantic information. This approach enhances the overall feature representation of target fruits, enabling more accurate global perception in complex field environments.
(3) Design of the dual-contrast feature aggregation (DCFA) module. The DCFA module is proposed to address local feature confusion caused by color similarity between fruits and leaves, occlusions, and dense fruit distributions in agricultural scenarios. It leverages a foreground–background dual-path attention mechanism, combined with frequency-domain decomposition, to enhance salient features and complement information. By focusing on fine-grained local features and foreground–background separation, DCFA significantly improves the model’s discriminative ability and robustness for identifying tomatoes at different maturity stages. Compared with recent attention-based YOLO variants, the fundamental novelty of our method is that the proposed DCFA module achieves explicit foreground–background dual-path decoupling and contrastive modeling, representing a paradigm shift from “implicit enhancement” to “explicit decoupling”.
(4) Development of the asymmetric depthwise separable detection head (ADSDH). To overcome the issues of computational redundancy and uneven feature utilization observed in existing detection heads during multiscale feature fusion of high-resolution images, we design an ADSDH module based on the YOLOv11 detection head. This module decouples classification and localization tasks via an asymmetric branch design and replaces standard convolutions with depthwise separable convolutions. Such a design preserves detection accuracy while significantly reducing parameters and computational complexity, thereby enhancing real-time inference performance and efficiency on embedded platforms.

2. Related Work

2.1. Fruit Maturity Detection Based on Machine Learning Methods

Traditional detection approaches rely primarily on manually designed image features, such as color, size, shape, and texture, which are subsequently used by machine learning algorithms for fruit recognition and classification. The commonly employed algorithms for maturity detection include random forest [17], support vector machine (SVM) [18], and K-means clustering [19].
Zhou et al. [20] proposed a method for determining the maturity of field-grown red grapes based on an improved circular Hough transform, thereby providing a reference for harvest timing and automated picking. Wang [21] employed an HSI colour model that better aligns with human visual perception, combined with findContours and the equivalent diameter method, to classify tomato maturity and size. Huang et al. [22] utilized SVM to classify tomato maturity and achieved satisfactory recognition performance. Liu et al. [23] extracted histogram of oriented gradient (HOG) features to train an SVM classifier, thus enabling the preliminary detection of tomatoes at different maturity stages.
Although these traditional methods can achieve relatively good detection performance under controlled conditions, they are highly sensitive to factors such as illumination variation, fruit occlusion, and background complexity, making them less adaptable to the diverse conditions of natural field environments. Moreover, handcrafted features are insufficient to fully exploit the high-dimensional information in tomato images, including subtle local textures, nonlinear characteristics, and multiscale contextual relationships, which limits detection accuracy. Therefore, the practical applicability and generalization ability of these methods are constrained.

2.2. Deep Learning-Based Fruit Maturity Detection

With the development of deep learning, convolutional neural network (CNN)-based object detection methods have gradually replaced traditional algorithms that rely on handcrafted features and have become the mainstream approach for tomato maturity recognition. These models can automatically extract multilevel features through end-to-end learning, demonstrating stronger robustness and generalization ability. According to the detection pipeline, such methods are generally categorized into two-stage detection algorithms [24] and one-stage detection algorithms [25].
In two-stage detection algorithms, methods such as R-CNN [26], Fast R-CNN [27], and Faster R-CNN [13] first generate candidate regions and then perform classification and bounding box regression. Owing to their high detection accuracy and robustness, these algorithms have been widely applied in agricultural vision tasks. Rong et al. [28] optimized a Faster R-CNN VGG16-based model for tomato maturity grading by classifying tomatoes into unripe, half-ripe, and ripe categories, significantly improving the discrimination accuracy. Long et al. [29] modified Mask R-CNN to distinguish green, half-ripe, and fully ripe tomatoes and achieved favorable results. Yue et al. [30] employed an improved Cascade R-CNN network to classify green fruits and tomatoes at different maturity stages, enhancing the discrimination ability. However, this method relies on Cascade R-CNN with ResNet-101 as the backbone, resulting in high computational complexity and resource consumption during training and inference. In addition, it still has limitations in fine-grained discrimination between turning-stage and fully ripe tomatoes. Overall, although two-stage detectors provide high accuracy and robustness, their computational overhead is usually large. For example, CSPResNeXt-50 [31] and ResNet-101 [32] contain 20.50 M and 44.55 M parameters, respectively, which limits their application in real-time harvesting scenarios.
Compared with two-stage frameworks, one-stage object detection algorithms adopt a more streamlined architecture by integrating feature extraction, classification, and localization into a single network. These methods directly output object categories and bounding box locations in one forwards pass, eliminating the need for candidate region generation. Representative algorithms include the YOLO series and SSD, which achieve a better balance between detection accuracy and real-time performance. Owing to their fast processing capability for high-resolution agricultural images, one-stage algorithms have been extensively studied in terms of fruit recognition and maturity detection. Su et al. [33] proposed the SE-YOLOv3-MobileNetV1 network, which significantly improved the accuracy of tomato maturity classification. Lü et al. [34] modified YOLOv4 to differentiate tomatoes at various maturity stages. Chen et al. [35] conducted tomato maturity detection based on YOLOv5s, although missed detections still occurred in dense or heavily occluded scenarios. Han et al. [36] developed a lightweight YOLOv5-Lite model for papaya maturity detection, maintaining robustness under varying lighting and occlusion conditions. Zhang et al. [37] utilized YOLOv5-GAP to detect green grape clusters in dense and shaded environments. Peng et al. [38] proposed MFEFF-SSD for high-precision detection of small lychee targets in UAV images. Zhou et al. [39] combined YOLOv7 with traditional image processing methods to improve localization accuracy, although memory consumption increased. Wei et al. [40] developed GFS-YOLO11, which significantly improved multi-variety tomato recognition accuracy. Overall, compared with two-stage methods, one-stage detectors achieve a better balance between accuracy and speed. However, they remain sensitive to background interference and still have limited ability in terms of multiscale feature representation, which remains a major challenge.
In recent years, Transformer [41]-based self-attention mechanisms have demonstrated advantages in global feature interaction and object relationship modeling, making Transformer-based detection an emerging research direction in one-stage object detection. However, these methods still face challenges, such as slow training convergence, high memory consumption, and limited small-object detection capability, which restrict their practical deployment in agricultural scenarios.
Overall, deep learning-based fruit maturity detection methods demonstrate significant advantages in terms of automatic feature extraction and detection accuracy. Nevertheless, their complex structures, high computational overhead, and insufficient multiscale contextual modeling still limit their adaptability in complex field environments and lightweight deployment. To address these limitations, this study introduces multiscale feature modeling, feature modulation mechanisms, and a lightweight detection head within a one-stage framework to improve detection accuracy while maintaining computational efficiency.

2.3. Attention Mechanisms in YOLO-Based Detection

Attention mechanisms have been widely adopted to improve YOLO-based object detection models. For example, CBAM [42] recalibrates feature representations along both channel and spatial dimensions, and CA [43] incorporates coordinate information to enhance positional awareness. However, these methods are implicit feature enhancement approaches and do not explicitly distinguish between foreground and background information.
Some studies [21,44,45] have attempted to introduce foreground- and background-related attention mechanisms. For example, SEAM in YOLO-FaceV2 compensates for occluded regions by enhancing feature responses in unobstructed areas; the foreground dual attention in FCFPN captures foreground features from both channel and spatial dimensions. However, these methods essentially focus only on the foreground, lacking explicit modeling of background features, let alone explicit separation and contrastive interaction between foreground and background. Consequently, their effectiveness remains limited when addressing complex foreground–background confusion in tomato maturity detection.
To address this issue, our proposed DCFA module is designed with a foreground–background dual-path structure, enabling explicit feature separation and contrastive modeling and thereby more effectively mitigating this problem.

3. Methods

3.1. Overall Framework

This section provides a detailed description of the proposed tomato maturity detection model, FDA-YOLO, whose overall architecture is shown in Figure 1. The model is built upon the YOLOv11 framework released by Ultralytics on 30 September 2024 (YOLO Vision 2024) and specifically adopts the YOLOv11n (nano) version. It contains 238 layers, approximately 2.58M parameters, and 6.3 GFLOPs. All hyperparameters (e.g., learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005) are kept consistent with the official Ultralytics benchmark settings to ensure fairness and reproducibility.
The choice of YOLOv11 as the baseline instead of the more established YOLOv8 is motivated by several considerations. Compared with YOLOv8, YOLOv11 introduces the C2PSA module, which incorporates spatial attention mechanisms to enhance the focus on critical regions, making it particularly effective for small and occluded object detection. In addition, the C3k2 module improves feature representation while reducing computational complexity. On the COCO dataset, YOLOv11m achieves a higher mAP than YOLOv8m does with approximately 22% fewer parameters. Furthermore, YOLOv11 has been applied in various domains, including underwater object detection [46,47], traffic detection [48], and metal defect detection [49], thus demonstrating its effectiveness and applicability.
FDA-YOLO consists of three main components: the backbone, neck, and head. Input tomato images are first fed into the backbone to extract multilevel semantic features, capturing both local texture details and global structural information. The neck then employs a multiscale feature pyramid structure (FPN + PAN) to fuse high-level and low-level features, enhancing the perception of fruits at different scales. Finally, the detection head outputs classification probabilities and bounding box locations for each fruit, enabling end-to-end maturity detection.
In the original YOLOv11 backbone, the spatial pyramid pooling fast (SPPF) module is used to enlarge the receptive field and enhance global feature modeling, achieving satisfactory performance in standard scenarios. However, in complex field environments, tomato fruits exhibit significant variations in scale, distance, and lighting conditions. Single-scale feature modeling struggles to capture both the fine details of nearby fruits and the global structure of distant fruits, limiting multiscale detection performance. To address this issue, the proposed method replaces SPPF with the FSM module to enhance multiscale feature representation.
Furthermore, owing to the high similarity in color and texture between tomato fruits and leaves, foreground and background features are easily confused in complex environments, and traditional attention mechanisms remain insufficient for distinguishing key regions. To address this issue, we place the DCFA module in the backbone, which leverages feature modulation and cross-scale information interaction to enhance foreground representation while suppressing redundant background interference, thereby improving robustness in occluded and dense scenarios.
With respect to the detection head, the original YOLOv11 structure still suffers from parameter redundancy and computational overhead under high-resolution inputs. Therefore, the proposed ADSDH module employs depthwise separable convolutions to reduce the number of parameters and computational cost. In addition, it decouples classification and regression tasks through an asymmetric structure, ensuring detection accuracy while maintaining lightweight efficiency.
In summary, FDA-YOLO inherits the efficiency of YOLOv11 while introducing targeted improvements in multiscale feature modeling, foreground–background discrimination, and lightweight detection head design, achieving a good balance between detection accuracy and computational cost. The following sections provide detailed descriptions of the FSM, DCFA, and ADSDH modules.

3.2. Focal Spatial Modulation

This section introduces the focal spatial modulation (FSM) module. This module is designed to enhance the representation capability of tomato fruits at different distances and scales while maintaining feature stability under complex illumination conditions. In tomato maturity detection, fruits exhibit significant scale variations: representations of nearby fruits contain rich texture details and subtle color gradients, whereas the detection of distant fruits relies mainly on shape and global color information. To address this issue, FSM replaces the SPPF module in YOLOv11. Although SPPF enlarges the receptive field through serial pooling, its feature fusion strategy is fixed, and it has a limited ability to model local details and channel dependencies. As a result, it struggles under complex lighting and background interference. The core idea of FSM is to achieve adaptive multiscale contextual fusion through a hierarchical convolution-gated aggregation mechanism, thus balancing global modeling ability and computational efficiency.
The workflow of FSM is shown in Figure 2. The module first applies linear mapping to generate a query feature. Then, multi-branch structures extract multiscale contextual information from different receptive fields. A gating mechanism is used to adaptively weight features from each scale. Finally, the fused contextual feature is combined with the original feature via elementwise multiplication, enhancing important regions while suppressing redundant information.
Specifically, let the input feature be X R H × W × C , where H, W, and C denote the height, width, and channel number, respectively.
First, a linear projection is applied to generate the query feature q, initial contextual feature Z 0 , and gating weights
[ q , Z 0 , gate ] = Split ( Linear ( X ) ) ,
where Split divides the feature along the channel dimension. q R H × W × C is the query feature, Z 0 R H × W × C is the initial context feature, and  gate R H × W × 4 controls the contribution of each scale.
Next, three stacked depthwise convolution (DWConv) layers are used to extract hierarchical contextual features Z 1 , Z 2 , and  Z 3 . Each layer applies a depthwise convolution with a kernel size k l , followed by SiLU activation, and then SAAttention to refine important regions,
Z l = SA SiLU ( DWConv k l ( Z l 1 ) ) , l = 1 , , 3
where Z l 1 denotes the output of the previous layer and k l represents the kernel size of the l-th convolution, covering receptive fields ranging from the local texture to the global structure. Small kernels focus on capturing fine-grained texture and local color variations of nearby fruits, whereas large kernels aggregate global semantics of distant fruits and background regions, thereby achieving a balance between “seeing clearly” and “seeing far”.
After hierarchical contextual extraction, the features from different scales are adaptively aggregated via a gating mechanism. Then, a weighted sum is calculated through elementwise multiplication to produce a feature map Z out of the same size as the input X
Z out = l = 1 L gate l Z l + gate L + 1 AvgPool ( Z L ) ,
where ⊙ denotes elementwise multiplication. The second term corresponds to the global context path, aggregating the overall brightness and color distributions, which helps to mitigate feature shifts caused by varying illumination conditions. Although the gates corresponding to each scale are generated from the same linear mapping, they remain independent along the channel dimension and can adaptively modulate the response strength of different receptive field contexts based on the content of the input tomato image. This ability allows the effective selection and fusion of multiscale fruit information under complex field conditions. The gating aggregation mechanism enables the model to adaptively choose more reliable contextual features across scales, thereby enhancing the consistency of representations for fruits at different distances.
Finally, the aggregated contextual feature Z out is passed through a linear mapping layer implemented with a 1 × 1 convolution and transformed into a modulation weight tensor. This tensor is then multiplied elementwise with the query feature q to perform adaptive feature modulation,
Y = q Linear ( Z out ) ,
and the resulting output Y serves as the final enhanced feature of the FSM module and can be directly fed into subsequent network processing.

3.3. DCFA Module

This section introduces the proposed dual-contrast feature aggregation (DCFA) module, which is designed to enhance fine-grained fruit feature representation and improve foreground–background discrimination in complex field environments. In real-world scenarios, the strong similarity between fruit and leaf colors, occlusion, and dense fruit distribution often lead to feature confusion. To address this issue, DCFA performs frequency-domain decomposition and constructs a foreground–background dual-path attention mechanism to enhance the salient regions and compensate for missing information. Unlike FSM, which focuses on global feature modeling, DCFA emphasizes local fine-grained representations, thereby improving discrimination across different tomato maturity stages and enhancing robustness against background interference.
The architecture of DCFA is shown in Figure 3. The processing pipeline consists of four stages: feature preprocessing with Haar wavelet decomposition, local unfolding, foreground–background dual-path attention modeling, and feature aggregation.
The first stage is feature preprocessing and Haar wavelet decomposition. Given the input feature map X R H × W × C , two convolutional blocks are first applied to enhance the semantic representation,
X = CBS ( CBS ( X ) ) ,
and the resulting feature X is then decomposed using a Haar wavelet transform. Based on the discrete wavelet transform (DWT) [50], a single-level decomposition with four 2 × 2 Haar filters is applied, producing one low-frequency component, a, and three high-frequency components, h, v, and d. Here, a captures the global structure, smooth variation, and color distribution, while h, v, and d represent the horizontal, vertical, and diagonal texture details, respectively. In natural scenes, fruit regions contain richer high-frequency responses, while background regions are dominated by low-frequency information. Based on these properties, high-frequency components are grouped as foreground features ( f g ), and low-frequency components are treated as background features ( b g ), forming a dual-branch representation:
( b g , f g ) = HaarWaveletConv ( X )
The next stage is local unfolding. To model local relationships, the value feature V is first computed by applying a linear transformation with weight W V to the input feature X ,
V = W V ( X ) ,
where W V R C × C and V R H × W × C . Subsequently, for each spatial position, its local neighborhood is expanded. Specifically, an Unfold operation is applied to extract a K × K neighborhood centered at each position p and reshape it into a matrix V Δ ( p ) , where V Δ ( p ) R K 2 × C . This matrix represents all pixel values within the local region centered at position p, which will be used for subsequent attention weighting.
Following local unfolding, foreground–background dual-path attention modeling is performed. The foreground and background feature maps are first processed by local average pooling and then passed through two independent linear layers, W f g and W b g , to generate the foreground attention A f g and background attention A b g , respectively,
A f g = W f g ( Pool ( f g ) ) A b g = W b g ( Pool ( b g ) ) ,
where Pool ( · ) denotes the local average pooling operation, W f g , W b g R C × K 4 , and  A f g , A b g R H × W × K 4 . The attention maps A f g and A b g are then reshaped along the last dimension, converting the local attention at each spatial position into vector form. A softmax function is subsequently applied to obtain normalized attention weights, denoted as Softmax ( A f g ) and Softmax ( A b g ) .
Finally, feature aggregation is performed. For each spatial position p, the local features are subjected to a two-stage foreground–background weighting process. First, foreground attention is applied to emphasize the key local relationships in target regions, where the local features at each position are multiplied by foreground attention to obtain foreground-enhanced features. Then, these foreground-weighted features are expanded into neighborhood matrices via the Unfold operation and further multiplied by the background attention to incorporate structural information and enhance the understanding of complete semantics,
V Δ ( p ) = Softmax ( A b g ) ( p ) Softmax ( A f g ) ( p ) V Δ ( p ) ,
where ⊗ denotes matrix multiplication. After aggregation over all positions, a Fold operation reconstructs the feature map back to R H × W × C , followed by two CBS layers to produce the final output X ˜ .
Overall, DCFA enhances sensitivity to subtle differences in maturity while maintaining strong robustness under occlusion and dense distribution scenarios, thereby providing more stable and discriminative feature representations for the detection head.

3.4. ADSDH Module

This section introduces the asymmetric depthwise separable detection head (ADSDH) module. Existing detection heads typically adopt a unified branch design, which often overlooks the structural differences between classification and localization tasks when processing high-resolution fruit images and feature fusion, leading to insufficient feature utilization and increased computational burden. To address this issue, the ADSDH is designed based on the YOLOv11 detection head. By introducing differentiated structures for classification and regression branches and replacing standard convolutions with depthwise separable convolutions (DSConv), the proposed method constructs a lightweight and efficiently decoupled detection head. The ADSDH maintains detection accuracy while significantly reducing parameter count and computational complexity, thereby improving real-time inference capability on resource-constrained embedded platforms.
The overall architecture of the ADSDH is shown in Figure 4. Its core idea is to further design task-specific branches with different depths for regression and classification on top of the original decoupled detection head, thus reducing redundant computations and improving task adaptability. Specifically, the regression branch focuses on accurate modeling of fruit boundary localization and spatial geometric structure, while the classification branch emphasizes the discrimination of color and texture features, thereby achieving a balance between detection performance and inference efficiency in resource-constrained scenarios.
In the ADSDH module, multiscale features from the backbone are first fed into the detection head. Let the input feature at the i-th scale be X ( i ) R B × C i × H i × W i , where i = 1 , 2 , 3 . These features are simultaneously passed to both regression and classification branches for localization and classification modeling, respectively.
The regression branch is relatively deeper and consists of stacked DSConv layers to enhance the modeling of scale variation and boundary information. The feature transformation is formulated as follows:
R ( i ) = Conv CBS 1 × 1 DSConv ( DSConv ( DSConv ( X ( i ) ) ) ) ,
where R ( i ) R B × C r × H i × W i denotes the regression feature at the i-th scale used for bounding box prediction. For each spatial location, the model predicts distance distributions to the four sides of the target bounding box (left, top, right, and bottom). Continuous boundary distances are then obtained via expectation over these discrete distributions, which are decoded into final bounding box coordinates.
To reduce the computational cost while preserving the representation ability, DSConv replaces standard convolutions. It decomposes convolution into depthwise convolution (DWConv) and pointwise convolution ( 1 × 1 Conv), where DWConv extracts spatial information and pointwise convolution performs channel fusion. This operation reduces complexity from O ( C i n C o u t k 2 ) to O ( C i n k 2 + C i n C o u t ) , achieving lightweight computation without significant performance loss.
The classification branch focuses more on appearance-level color and texture cues, which are crucial for distinguishing maturity stages. Since classification requires weaker spatial modeling, this branch adopts a lightweight structure composed of stacked 1 × 1 Conv. The feature transformation is expressed as follows:
C ( i ) = Conv CBS 1 × 1 CBS 1 × 1 ( X ( i ) ) ,
where C ( i ) R B × C c × H i × W i denotes the classification feature at the i-th scale.
Finally, the regression and classification outputs are concatenated along the channel dimension,
Y ( i ) = Concat ( R ( i ) , C ( i ) ) ,
where Y ( i ) R B × ( C r + C c ) × H i × W i is the fused detection feature at the i-th scale. The final bounding boxes and maturity categories are obtained through decoding.
The ADSDH achieves joint optimization of feature decoupling and a lightweight design. The regression branch focuses on spatial localization and boundary modeling, whereas the classification branch focuses on color and texture discrimination, effectively reducing intertask feature conflict. Moreover, by integrating DSConv, the ADSDH enables efficient real-time tomato maturity detection with significantly reduced computational cost while maintaining accuracy.

4. Experiments

We conduct experiments on the LaboroTomato dataset [51] to validate the effectiveness and superiority of the proposed FDA-YOLO algorithm.

4.1. Dataset

In this study, the LaboroTomato dataset is employed, which is a high-quality image dataset designed for tomato object detection and instance segmentation tasks, covering tomatoes at different maturity stages. The images are captured in orchard environments, with resolutions of 3024 × 4032 and 3120 × 4160 pixels. According to GH/T 1193-2021 [52], in combination with fruit growth conditions, tomato maturity is categorized into three stages: fully ripe, semiripe, and green, where the red coverage area exceeds 90 % , with corresponding ranges of 30 90 % and 0 30 % , respectively. In addition, tomatoes are divided into large and small categories based on fruit size, resulting in a total of six classes. Examples of tomato detection results are shown in Figure 5. The dataset contains 804 images in total.
Owing to the limited number of samples, data augmentation techniques are applied to increase data diversity and improve the generalization ability and robustness of the convolutional neural network in complex field environments. Specifically, Gaussian blur, horizontal flipping, vertical flipping, and brightness adjustment are employed, as illustrated in Figure 6. As a result, the dataset is expanded from 804 to 1608 images, which are then divided into training, validation, and test sets at a ratio of 7:2:1.
In addition, to standardize the input format and reduce the computational cost, all the images are resized to 640 × 640 pixels, which is required by YOLOv11. The reason is that larger image resolutions significantly increase training and inference time and may limit the network’s ability to effectively capture mid- and high-level features, thereby affecting detection performance [53]. In contrast, with this input size, the model benefits from a more consistent data distribution and more efficient feature processing, leading to improved computational efficiency and inference performance.
Finally, the preprocessed dataset is strongly representative in terms of fruit type, maturity level, and occlusion conditions, thus providing a reliable foundation for model training and evaluation.

4.2. Experimental Settings

The proposed model was implemented using PyTorch 2.5.1 and trained on an NVIDIA RTX 4090 GPU with 24 GB memory. Stochastic gradient descent (SGD) was adopted for optimization, and the input image size was set to 640 × 640 . The hardware environment and training hyperparameter settings are summarized in Table 1.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed model in terms of tomato maturity detection, the precision, recall, F1 score, mean average precision (mAP), parameters, GFLOPs, and FPS are adopted as evaluation metrics. These metrics assess detection performance from different perspectives. The precision reflects the prediction accuracy, the recall reflects the coverage of positive samples, and the F1 score balances the precision and recall. The mAP evaluates the overall detection performance across multiple categories. The parameters indicate the model size and storage requirements, GFLOPs reflects the computational cost per inference, and FPS reflects the real-time inference speed. The definitions of these metrics are presented below.
Precision (P): Precision represents the proportion of correctly predicted positive samples among all the predicted positive samples. Here, T P denotes true positive, and F P denotes false positive. It is calculated as follows:
Precision = T P T P + F P
Recall (R): Recall represents the proportion of correctly identified positive samples among all actual positive samples, where F N denotes false negative. It is calculated as follows:
Recall = T P T P + F N
F1 Score: The F1 score is the harmonic mean of the precision and recall, providing a balanced evaluation of prediction accuracy and completeness:
F 1 = 2 T P 2 T P + F P + F N
Mean Average Precision (mAP): mAP represents the mean value of the average precision ( A P ) across all classes in the dataset. Let A P ( c ) denote the average precision of class c, and let N classes denote the total number of classes:
mAP = A P ( c ) N classes
In object detection tasks, mAP is typically evaluated under different intersection over union (IoU) thresholds. Specifically, mAP50 represents the average precision at IoU = 0.5, while mAP50–95 averages the results over IoU thresholds from 0.5 to 0.95 with a step size of 0.05, thus providing a stricter evaluation of localization performance.
Parameters: Parameters refer to the total number of trainable parameters in the model, reflecting its structural complexity and storage requirements.
GFLOPs: GFLOPs denote the number of floating-point operations required for a single forward inference. It directly reflects computational cost and inference efficiency.
Frames Per Second (FPS): FPS represents the number of image frames processed per second. It is an important metric for evaluating inference speed and real-time performance:
FPS = Total Frames Processed Inference Time ( seconds )
Unlike parameters and GFLOPs, FPS is affected by hardware configuration, input resolution, and optimization strategies, making it a more practical indicator of deployment performance.

4.4. Experimental Results and Analysis

To comprehensively validate the effectiveness and superiority of the proposed model and its key components, multiple groups of experiments are designed in this section, providing a progressive analysis from a macro perspective to a micro perspective. First, the overall performance of the model is evaluated through comparisons with several mainstream methods. Second, module replacement experiments are conducted to verify the effectiveness of the FSM module. Third, a dedicated comparison of different detection head structures is performed to demonstrate the advantages of the proposed design. Fourth, ablation studies are carried out to quantitatively analyze the individual contributions and synergistic effects of each component. In addition, we conducted a five-fold cross-validation experiment, a per-category performance analysis, and a generalization evaluation on different YOLO architectures to further verify the generalization ability and discriminative capability of the model.

4.4.1. Comparison with Mainstream Models

To systematically evaluate the overall performance of the proposed model, experiments were conducted under a unified setting using YOLOv11n as the baseline. FDA-YOLO was compared with several mainstream models, including MobileNet [54], Swin Transformer [55], StarNet [56], InceptionNeXt [57], and different versions of the YOLO series. The results demonstrate the advantages of the proposed model in terms of both detection accuracy and computational efficiency.
The results are shown in Table 2. FDA-YOLO ranks first in terms of the two core metrics, mAP50 (83.4%) and mAP50–95 (67.5%), demonstrating strong and stable detection performance under different IoU thresholds. In terms of the F1 score, our method achieves 78.1%, which is only slightly lower than that of Swin Transformer-Tiny (78.8%). However, Swin Transformer-Tiny achieves this performance with 29.7 M parameters and 77.6 GFLOPs. In contrast, FDA-YOLO requires only 17.5% of the parameters and 9.8% of the computational cost while achieving superior mAP performance, with a marginal 0.7% F1 gap. This finding suggests that the slight advantage of Swin Transformer-Tiny results mainly from its larger model capacity rather than architectural efficiency. Compared with the other lightweight models, FDA-YOLO achieves the best performance in terms of recall, F1 score, mAP50, and mAP50–95 while ranking second in precision. These results demonstrate that the proposed model can extract features more effectively at a reasonable computational cost, thus significantly improving detection accuracy.
To provide a more intuitive comparison, Figure 7 illustrates the distribution of different models in terms of detection accuracy and computational cost. FDA-YOLO achieves an excellent balance between accuracy and efficiency, outperforming mainstream detectors in terms of mAP50 and mAP50–95 while maintaining only 5.2 M parameters and 7.6 GFLOPs. Its computational cost is substantially lower than that of Swin Transformer-Tiny (29.7 M parameters and 77.6 GFLOPs). This favorable trade-off makes FDA-YOLO particularly suitable for resource-constrained scenarios, such as those involving UAVs and edge devices.

4.4.2. Comparison of the FSM Module

To evaluate the effectiveness of the proposed focal spatial modulation (FSM) module, a comparative study was conducted within the YOLOv11 backbone. FSM was compared with three widely used feature enhancement modules: SPPF, SPP [58], and SPPF-LSKA [59]. For fairness, all the other network settings were kept consistent with the original YOLOv11 architecture and only the feature enhancement module was replaced.
As shown in Table 3, FSM maintains a lightweight design with only 2.7 M parameters and 6.4 GFLOPs while achieving the best performance on the key metrics. The precision, mAP50, and mAP50–95 reach 82.3%, 81.5%, and 66.2%, respectively, outperforming lightweight baselines such as SPPF. FSM improves mAP50–95 because its multiscale contextual modeling mechanism generates more accurate spatial responses for objects at different scales, thereby directly enhancing bounding box regression accuracy. The F1 score remains highly competitive. Although the recall (68.6%) is slightly lower, FSM still demonstrates superior overall detection performance while maintaining computational efficiency.
These results show that FSM effectively integrates multiscale contextual information and achieves strong detection performance with low model complexity, further demonstrating its practicality for real-world deployment.

4.4.3. Comparison of the ADSDH Module

To validate the performance advantages of the ADSDH module in task decoupling and lightweight modeling, a comparative study was conducted under the same YOLOv11 backbone. The ADSDH was compared with SEAMHead [44], LQEHead [60], and detection heads with auxiliary branches [61]. For fairness, all the other network components were kept unchanged, and each comparison head retained its original design.
As shown in Table 4, the ADSDH achieves a precision of 79.9%, an F1 score of 76.3%, mAP50–95 of 81.6%, and mAP50–95 of 65.4%, outperforming all the compared detection heads. It also has the lowest model complexity, with only 2.3 M parameters and 5.2 GFLOPs. These results demonstrate that the ADSDH improves detection accuracy while maintaining high computational efficiency. Compared with LQEHead, the ADSDH further optimizes the classification and regression branches, achieving a better balance between feature extraction and localization accuracy. Its asymmetric depthwise separable design is particularly effective for multiscale features, improving the mAP while keeping the computational cost low.

4.4.4. Ablation Study of the Proposed Modules

To quantitatively evaluate the contribution of each proposed module to the overall detection performance, a systematic ablation study was conducted based on the YOLOv11n baseline. As shown in Table 5, a controlled variable strategy was adopted through four progressive experiments. First, the FSM and DCFA modules were individually added to the baseline to evaluate their independent contributions. Then, both modules were introduced simultaneously to examine their combined effect. Finally, the ADSDH module was incorporated to form the complete model.
The results in Table 6 show that using only the FSM module increases mAP50 from 80.5% to 81.5% and mAP50–95 from 63.6% to 66.2%, demonstrating the effectiveness of multiscale contextual feature fusion. Introducing the DCFA module improves mAP50 to 81.9% and mAP50–95 to 65.7%, highlighting its ability to enhance local feature discrimination and foreground–background separation. When FSM and DCFA are combined, mAP50 further increases to 83.2%, while mAP50–95 reaches 67.6%, indicating a clear synergistic effect. After adding the ADSDH, the complete model achieves an F1 score of 78.1%, mAP50 of 83.4%, and mAP50–95 of 67.5%. Moreover, the computational cost is reduced from 8.8 GFLOPs to 7.6 GFLOPs, and the parameter count decreases from 5.5 M to 5.2 M. These results demonstrate that the ADSDH maintains detection accuracy while effectively reducing the overall computational burden, thereby achieving a better balance between accuracy and efficiency.
The performance improvement achieved by combining these modules can be attributed to their complementary functions. The FSM module enhances global contextual representation by modeling multiscale receptive fields, which improves feature extraction for objects of different scales. The DCFA module focuses on fine-grained feature enhancement and foreground–background separation through local contrastive modeling, thereby improving target discriminability. Finally, the ADSDH serves as a lightweight detection head that optimizes the decoupling of classification and regression branches while preserving the feature representation capability. As a result, it maintains detection accuracy while reducing the computational cost.
Therefore, FSM and DCFA improve feature quality from two different perspectives: global semantic modeling and local discriminative enhancement. Hence, they are functionally complementary rather than redundant. In contrast, the ADSDH performs structural optimization at the detection head stage and does not overlap with the feature enhancement modules in the backbone and neck. Instead, it further improves feature utilization efficiency.
The combination of these three modules enables the synergistic optimization of feature representation, object discrimination, and computational efficiency, ultimately improving overall performance. Through four progressive ablation experiments, this study validates the contribution of each module to detection performance and model efficiency. The results show that the proposed feature enhancement modules and improved detection head consistently improve the baseline performance while effectively controlling the parameter size and computational complexity, achieving a strong balance between accuracy and efficiency.

4.4.5. Five-Fold Cross-Validation and Robustness Evaluation

To further evaluate the robustness and generalization ability of the proposed method, a 5-fold cross-validation experiment was conducted on the LaboroTomato dataset. Specifically, the original 804 images were divided into five mutually exclusive subsets. In each fold, four subsets were used for training, and the remaining subset was used for testing. This process was repeated five times, and the results were statistically analyzed.
To prevent data leakage, data augmentation was applied only to the training set. During each fold, Gaussian blur, horizontal flipping, vertical flipping, and brightness adjustment were adopted to increase the data diversity, while the test set remained unchanged.
The results of the 5-fold cross-validation are presented in Table 5. FDA-YOLO achieves stable performance across different data splits in terms of the precision, recall, F1 score, and mAP, with relatively small variations among folds. For example, mAP50 ranges from 81.3% to 84.9%, while mAP50–95 ranges from 66.5% to 70.4%, indicating that the model is not sensitive to specific data partitions.
Furthermore, compared with the baseline model, the proposed method consistently maintains performance improvements across all folds, demonstrating that it does not overfit a specific training subset.
Overall, the cross-validation results demonstrate the stability and certain robustness of the proposed method on the current dataset.

4.4.6. Per-Category Performance Analysis Across Maturity Stages

To further evaluate the discriminative capability of the proposed method across different maturity stages, a categorywise performance analysis was conducted for all six classes, as shown in Table 7. Compared with the baseline YOLOv11n model, FDA-YOLO achieves performance improvements in most categories. In particular, categories such as b_fully_ripened and l_half_ripened show relatively significant improvements in both mAP50 and mAP50–95, indicating that the proposed model has a stronger recognition capability for categories with high visual similarity and greater classification difficulty. In addition, for categories that originally achieved relatively high performance, such as b_green and l_fully_ripened, the improved model maintains stable performance. This finding demonstrates that the proposed method enhances discriminative capability without sacrificing detection accuracy for easy-to-classify targets.
Overall, the results of the categorywise analysis demonstrate that the proposed method can effectively capture subtle differences among different maturity stages while alleviating multiscale detection challenges and foreground–background confusion to a certain extent. This finding further validates the effectiveness of FSM and DCFA.

4.4.7. Qualitative Visualization Analysis

To further validate the effectiveness of the proposed method for multiscale detection and foreground–background discrimination, qualitative visualization comparisons between YOLOv11n and FDA-YOLO were conducted. Representative scenarios involving occlusion, complex lighting conditions, and scale variations were selected for evaluation.
As shown in Figure 8, the baseline YOLOv11n exhibits clear category confusion. Specifically, when a half-ripened tomato is occluded by vines, the baseline model misclassifies it as a fully ripened tomato with a relatively low confidence score (0.60). In contrast, FDA-YOLO produces more accurate and consistent predictions by correctly identifying the half-ripened tomato with higher confidence scores (0.95 and 0.74) while maintaining stable recognition of green tomatoes. These findings indicate that the proposed method is more effective in handling occlusion scenarios and further validate the role of the DCFA module in alleviating foreground–background confusion.
As shown in Figure 9, this scenario is challenging because of complex lighting conditions and mutual occlusions among targets. The baseline YOLOv11n suffers from missed detections and category confusion, failing to detect one tomato target and producing an incorrect classification result (l_fully_ripened, 0.64). In contrast, FDA-YOLO successfully detects the missed target and corrects the misclassification. This difference demonstrates the effectiveness of the FSM module in handling scale variations among different tomato types (e.g., large and small fruits).
As shown in Figure 10, a tomato is heavily occluded by vines. The baseline model fails to localize the target and cannot correctly identify its size category. In contrast, FDA-YOLO successfully localizes the target and accurately predicts both its size and maturity category. This difference further demonstrates the stronger robustness of the proposed method under complex occlusion conditions.
Based on the above qualitative results, the proposed framework offers several practical advantages for real-world agricultural deployment. First, FSM and DCFA collectively enhance detection robustness under challenging field conditions, such as leaf occlusion, illumination variation, and dense fruit clusters, directly improving the success rate of automated harvesting. Second, the lightweight ADSDH detection head enables real-time inference on resource-constrained edge devices, making the model suitable for agricultural robots and UAV platforms. Third, the method achieves significant improvements on visually similar and easily confused half-ripened tomato categories, providing fine-grained maturity assessment for optimal harvest timing and post-harvest grading.

4.4.8. Generalization Evaluation

To evaluate the generality of the proposed method, we applied the proposed modules (FSM, DCFA, and the ADSDH) to another mainstream YOLO variant, YOLOv8n, under identical experimental settings. As shown in Table 8, our method brings consistent and stable improvements on both YOLOv8n and YOLOv11n. Specifically, for YOLOv8n, the proposed method achieves a precision of 81.7%, recall of 74.7%, F1 score of 78.0%, mAP50 of 82.9%, and mAP50–95 of 66.7%, which are significantly higher than the baseline. These results demonstrate that the proposed method is not specifically tailored to a single architecture but possesses good generalization capability across different YOLO versions.

5. Discussion

This study systematically investigates the design and optimization of a lightweight detection model. Overall, the results show that the proposed FDA-YOLO achieves a good trade-off between detection accuracy and computational efficiency. Specifically, FDA-YOLO outperforms the baseline model on key metrics, including precision, recall, mAP50, and mAP50–95, while maintaining stable performance across diverse unstructured natural scenes. These findings validate the effectiveness and superiority of the proposed method. All the reported gains are expressed as relative improvements.
FDA-YOLO consists of three core modules: FSM, DCFA, and the ADSDH. In the ablation study, the FSM module improves the ability of the baseline network to capture long-range dependencies through multiscale contextual extraction and gated aggregation. It yields a 1.2% improvement in mAP50 and a 4.1% improvement in mAP50–95, with only a slight increase in the computational cost. Notably, the DCFA module explicitly separates foreground and background features via Haar wavelet decomposition and adopts dual-path attention to model them independently, thereby enhancing target discriminability under occlusion and complex backgrounds. When combined with FSM, it increases mAP50 by 3.4% and mAP50–95 by 6.3% while keeping the computational cost manageable. The ADSDH further enhances feature discriminability while maintaining a lightweight structure. After FSM and DCFA are integrated, the F1 score improves by 2.5%, while the parameter count and FLOPs decrease by 5.5% and 13.6%, respectively, with mAP50 and mAP50–95 remaining largely stable.
FSM and DCFA improve accuracy but increase computational overhead, while the ADSDH reduces overhead through lightweight design. Overall, the three modules provide clear synergistic gains, enabling the proposed method to achieve a favorable trade-off between accuracy and computational cost, demonstrating strong efficiency and effectiveness. Considering the inference speed and resource constraints, the method is well suited for UAV inspection, field robotics, and edge computing devices, providing a practical solution for real-time deployment.
Despite these advantages, the proposed model has several limitations. First, for distant or very small fruits, limited feature information may reduce the detection accuracy. Second, the dataset covers a relatively limited range of scenes, which may affect generalization to more complex environments. Third, the model relies solely on RGB visual information without incorporating depth, spectral, or temporal cues, which may limit its applicability in broader agricultural scenarios. Finally, since all the experiments were conducted on a single dataset, a comprehensive assessment of the generalization capability of the proposed method still requires further cross-dataset validation in future work.

6. Conclusions

Accurate, real-time, and efficient object detection is crucial for autonomous tomato-picking robots during harvesting tasks. Although existing deep learning-based detection models achieve high accuracy, they are often limited by a substantial computational cost, large size, and high hardware demands, making deployment on resource-constrained agricultural robots or embedded platforms challenging. Moreover, environmental factors, such as varying illumination, occlusion, and differences in fruit maturity, further increase the complexity of the detection task. To address these challenges, this study proposes an improved FDA-YOLO model based on YOLOv11, incorporating FSM, DCFA, and ADSDH modules to facilitate feature extraction and detection head design. These enhancements strengthen global modeling capabilities, improve multiscale information fusion, and construct a lightweight detection head. The experimental results on an augmented dataset demonstrate that FDA-YOLO achieves precision, recall, F1 score, mAP50, and mAP50–95 of 81.3%, 75.2%, 78.1%, 83.4%, and 67.5%, respectively, while maintaining FLOPs and a parameter size of 7.6 and 5.2M, respectively. These results reflect a favorable balance between accuracy and efficiency, making the model particularly suitable for real-time deployment on embedded agricultural robots.
Future research directions include (1) optimizing the model architecture and hardware compatibility to further improve the inference speed and energy efficiency on resource-limited devices; (2) exploring multitask or multimodal information fusion, such as integrating depth, spectral, or infrared data to enhance robustness and applicability; and (3) extending the model to other crops or diverse environments to validate its cross-crop and cross-environment generalization abilities. These directions aim to broaden the practical application and deployment potential of the proposed model in intelligent agricultural systems.

Author Contributions

Conceptualization, J.S.; methodology, J.S.; software, J.S.; formal analysis, J.S.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, J.G.; visualization, J.S.; supervision, J.G.; project administration, J.G.; funding acquisition, J.G.; validation, W.L.; resources, X.W.; investigation, H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 62272242).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is the publicly available Laboro Tomato dataset, which can be accessed via GitHub (https://github.com/laboroai/LaboroTomato, accessed on 1 March 2026) and Kaggle.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. Vats, S.; Bansal, R.; Rana, N.; Kumawat, S.; Bhatt, V.; Jadhav, P.; Kale, V.; Sathe, A.; Sonah, H.; Jugdaohsingh, R.; et al. Unexplored nutritive potential of tomato to combat global malnutrition. Crit. Rev. Food Sci. Nutr. 2022, 62, 1003–1034. [Google Scholar] [CrossRef]
  2. Wang, Q.; Yan, N.; Qin, Y.; Zhang, X.; Li, X. BED-YOLO: An Enhanced YOLOv10n-Based Tomato Leaf Disease Detection Algorithm. Sensors 2025, 25, 2882. [Google Scholar] [CrossRef]
  3. Cammarano, D.; Jamshidi, S.; Hoogenboom, G.; Ruane, A.C.; Niyogi, D.; Ronga, D. Processing tomato production is expected to decrease by 2050 due to the projected increase in temperature. Nat. Food 2022, 3, 437–444. [Google Scholar] [CrossRef] [PubMed]
  4. FAOSTAT. Agriculture Organization of the United Nations FAO Statistical Database; FAO: Faro, Portugal, 2023. [Google Scholar]
  5. Zhang, J.; Kang, N.; Qu, Q.; Zhou, L.; Zhang, H. Automatic fruit picking technology: A comprehensive review of research advances. Artif. Intell. Rev. 2024, 57, 54. [Google Scholar] [CrossRef]
  6. Mitaritonna, C.; Ragot, L. After COVID-19, will seasonal migrant agricultural workers in Europe be replaced by robots? CEPII Policy Brief 2020, 33, 1–10. [Google Scholar]
  7. Kizielewicz, B.; Wątróbski, J.; Sałabun, W. Multi-Criteria Decision Support System for the Evaluation of UAV Intelligent Agricultural Sensors. Artif. Intell. Rev. 2025, 58, 194. [Google Scholar] [CrossRef]
  8. Zhao, J.; Fan, S.; Zhang, B.; Wang, A.; Zhang, L.; Zhu, Q. Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery. Agriculture 2025, 15, 1223. [Google Scholar] [CrossRef]
  9. Erdoǧan, D.; Güner, M.; Dursun, E.; Gezer, I. Mechanical harvesting of apricots. Biosyst. Eng. 2003, 85, 19–28. [Google Scholar] [CrossRef]
  10. Miao, Z.; Yu, X.; Li, N.; Zhang, Z.; He, C.; Li, Z.; Deng, C.; Sun, T. Efficient tomato harvesting robot based on image processing and deep learning. Precis. Agric. 2023, 24, 254–287. [Google Scholar] [CrossRef]
  11. Dubey, S.R.; Jalal, A.S. Fusing color and texture cues to categorize the fruit diseases from images. arXiv 2014, arXiv:1412.7277. [Google Scholar] [CrossRef]
  12. Sun, J.T.; Sun, Y.F.; Zhao, R.; Ji, Y.H.; Zhang, M.; Li, H. Tomato Recognition Method Based on Iterative Random Circle and Geometric Morphology. Nongye Jixie Xuebao Trans. Chin. Soc. Agric. Mach. 2019, 50, 22–26+61. [Google Scholar]
  13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  14. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  15. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  16. Jocher, G.; Qiu, J. Ultralytics YOLO11. Available online: https://github.com/ultralytics/ultralytics (accessed on 19 June 2025).
  17. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  18. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  19. McQueen, J.B. Some Methods of Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
  20. Zhou, W.; Zha, Z.; Wu, J. Maturity discrimination of “Red Globe” grape cluster in grapery by improved circle Hough transform. Trans. Chin. Soc. Agric. Eng. 2020, 36, 205–213. [Google Scholar] [CrossRef]
  21. Wang, H.; Cui, Y. Tomato color detection research based on computer vision. Agric. Technol. Equip. 2017, 49–51. (In Chinese) [Google Scholar]
  22. Huang, Y.; Si, W.; Chen, K.; Sun, Y. Assessment of tomato maturity in different layers by spatially resolved spectroscopy. Sensors 2020, 20, 7229. [Google Scholar] [CrossRef] [PubMed]
  23. Liu, G.; Mao, S.; Kim, J.H. A Mature-Tomato Detection Algorithm Using Machine Learning and Color Analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef]
  24. Kaur, R.; Singh, S. A Comprehensive Review of Object Detection with Deep Learning. Digit. Signal Process. 2023, 132, 103812. [Google Scholar] [CrossRef]
  25. Yasmine, G.; Maha, G.; Hicham, M. Overview of Single-Stage Object Detection Models: From YOLOv1 to YOLOv7. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC), Marrakesh, Morocco, 19–23 June 2023; pp. 1579–1584. [Google Scholar] [CrossRef]
  26. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  27. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  28. Rong, Y. Maturity Grading and Size Detection of Greenhouse Tomatoes Based on an Improved Faster R-CNN. Ph.D. Thesis, Shanxi Agricultural University, Jinzhong, China, 2021. [Google Scholar]
  29. Long, J.; Zhao, C.; Lin, S.; Guo, W.; Wen, C.; Zhang, Y. Segmentation method of the tomato fruits with different maturities under greenhouse environment based on improved Mask R-CNN. Trans. Chin. Soc. Agric. Eng. 2021, 37, 100–108. [Google Scholar] [CrossRef]
  30. Yue, Y.; Sun, B.; Wang, H.; Zhao, H. Object Detection of Tomato Fruit Based on Cascade RCNN. Sci. Technol. Eng. 2021, 21, 2387–2391. [Google Scholar]
  31. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2016; pp. 770–778. [Google Scholar]
  33. Su, F.; Zhao, Y.; Wang, G.; Liu, P.; Yan, Y.; Zu, L. Tomato Maturity Classification Based on SE-YOLOv3-MobileNetV1 Network under Nature Greenhouse Environment. Agronomy 2022, 12, 1638. [Google Scholar] [CrossRef]
  34. Lu, J.; Fu, Y.; Ni, M.; Cao, W.; Du, Z. Research on tomato maturity detection method based on improved YOLOv4 model. Food Mach. 2023, 39, 134–139. [Google Scholar] [CrossRef]
  35. Zhang, M. Study on Tomato Maturity Detection and System Implementation Based on Improved YOLOv5s. Master’s Thesis, Anhui University, Hefei, China, 2024. [Google Scholar]
  36. Han, Y. Study on Papaya Maturity Detection on Tree Based on YOLOv5-Lite. Master’s Thesis, South China Agricultural University, Guangzhou, China, 2023. [Google Scholar]
  37. Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-Bunch Identification and Location of Picking Points on Occluded Fruit Axis Based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
  38. Peng, H.; Li, J.; Xu, H.; Chen, H.; Xing, Z.; He, H.; Xiong, J. Litchi detection based on multiple feature enhancement and feature fusion SSD. Trans. Chin. Soc. Agric. Eng. 2022, 38, 169–177. [Google Scholar] [CrossRef]
  39. Zhou, Y.; Tang, Y.; Zou, X.; Wu, M.; Tang, W.; Meng, F.; Zhang, Y.; Kang, H. Adaptive active positioning of Camellia oleifera fruit picking points: Classical image processing and YOLOv7 fusion algorithm. Appl. Sci. 2022, 12, 12959. [Google Scholar] [CrossRef]
  40. Wei, J.; Ni, L.; Luo, L.; Chen, M.; You, M.; Sun, Y.; Hu, T. GFS-YOLO11: A Maturity Detection Model for Multi-Variety Tomato. Agronomy 2024, 14, 2644. [Google Scholar] [CrossRef]
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  42. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  43. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  44. Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-Facev2: A Scale and Occlusion Aware Face Detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
  45. Han, H.; Zhang, Q.; Li, F.; Du, Y. Foreground Capture Feature Pyramid Network-Oriented Object Detection in Complex Backgrounds. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 6925–6939. [Google Scholar] [CrossRef]
  46. Cheng, S.; Han, Y.; Wang, Z.; Liu, S.; Yang, B.; Li, J. An Underwater Object Recognition System Based on Improved YOLOv11. Electronics 2025, 14, 201. [Google Scholar] [CrossRef]
  47. Luo, Y.; Wu, A.; Fu, Q. MAS-YOLOv11: An Improved Underwater Object Detection Algorithm Based on YOLOv11. Sensors 2025, 25, 3433. [Google Scholar] [CrossRef] [PubMed]
  48. Talaat, F.M.; El-Balka, R.M.; Sweidan, S.; Gamel, S.A.; Al-Zoghby, A.M. Smart Traffic Management System Using YOLOv11 for Real-Time Vehicle Detection and Dynamic Flow Optimization in Smart Cities. Neural Comput. Appl. 2025, 37, 19957–19974. [Google Scholar] [CrossRef]
  49. Yin, X.; Zhao, X. YOLOv11-MAS: An efficient PCB defect detection algorithm. J. Comput. Eng. Appl. 2025, 61, 102–110. [Google Scholar] [CrossRef]
  50. Heil, C.E.; Walnut, D.F. Continuous and Discrete Wavelet Transforms. SIAM Rev. 1989, 31, 628–666. [Google Scholar] [CrossRef]
  51. Li, R.; Ji, Z.; Hu, S.; Huang, X.; Yang, J.; Li, W. Tomato Maturity Recognition Model Based on Improved YOLOv5 in Greenhouse. Agronomy 2023, 13, 603. [Google Scholar] [CrossRef]
  52. GH/T 1193-2021; Tomato. Standardization Administration of China: Beijing, China, 2021.
  53. Saponara, S.; Elhanashi, A. Impact of Image Resizing on Deep Learning Detectors for Training Time and Model Performance. In International Conference on Applications in Electronics Pervading Industry, Environment and Society; Springer: Berlin/Heidelberg, Germany, 2021; pp. 10–17. [Google Scholar]
  54. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  55. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  56. Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar] [CrossRef]
  57. Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNext: When Inception Meets ConvNeXt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5672–5683. [Google Scholar]
  58. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  59. Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
  60. Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11632–11641. [Google Scholar]
  61. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Figure 1. Overall framework of the proposed method.
Figure 1. Overall framework of the proposed method.
Sensors 26 03404 g001
Figure 2. Architecture of FSM.
Figure 2. Architecture of FSM.
Sensors 26 03404 g002
Figure 3. Architecture of DCFA.
Figure 3. Architecture of DCFA.
Sensors 26 03404 g003
Figure 4. Architecture of the ADSDH.
Figure 4. Architecture of the ADSDH.
Sensors 26 03404 g004
Figure 5. Examples of tomato detection results.
Figure 5. Examples of tomato detection results.
Sensors 26 03404 g005
Figure 6. Examples of data augmentation techniques applied to the dataset.
Figure 6. Examples of data augmentation techniques applied to the dataset.
Sensors 26 03404 g006
Figure 7. Comparison of different models in terms of accuracy and computational complexity.
Figure 7. Comparison of different models in terms of accuracy and computational complexity.
Sensors 26 03404 g007
Figure 8. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (1).
Figure 8. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (1).
Sensors 26 03404 g008
Figure 9. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (2).
Figure 9. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (2).
Sensors 26 03404 g009
Figure 10. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (3).
Figure 10. Detection comparison between YOLOv11n (left) and FDA-YOLO (right) (3).
Sensors 26 03404 g010
Table 1. Hardware environment and training parameter settings.
Table 1. Hardware environment and training parameter settings.
ItemsConfiguration
Operating SystemUbuntu 22.04.5 LTS (Jammy Jellyfish)
CPUAMD EPYC 7542 32-Core Processor
Memory Size32 GB
GPU (Memory Size)NVIDIA GeForce RTX 4090 (24 GB)
CUDA Version12.1
Python Version3.10
Batch Size32
Momentum0.937
Epoch300
Patience20
Weight Decay0.0005
All experiments were conducted under the above hardware and training configurations.
Table 2. Experimental results of different models on the dataset.
Table 2. Experimental results of different models on the dataset.
ModelsPrecision (%)Recall (%)F1 (%)mAP50 (%)mAP50–95 (%)Params (M)GFLOPsFPS
MobileNetV471.574.573.078.062.05.421.087.2
Swin Transformer-Tiny81.976.078.883.366.729.777.671.9
StarNet-Small79.968.573.878.060.72.46.5105.1
InceptionNeXt-Tiny74.065.169.374.356.426.272.779.4
YOLOv3t80.070.775.178.464.09.514.591.1
YOLOv5n76.973.775.380.964.92.67.2111.7
YOLOv8n80.070.274.880.463.63.08.295.6
YOLOv10n80.670.475.279.964.72.78.496.8
YOLOv12n81.568.174.279.662.52.56.0135.6
YOLOv13n75.273.774.479.062.52.56.3127.3
YOLOv11n77.272.975.080.563.62.66.4122.6
Ours81.375.278.183.467.55.27.6108.7
The best and second-best results are highlighted in red and blue, respectively, while the worst results are shaded in gray.
Table 3. Comparison of different focal modulation modules.
Table 3. Comparison of different focal modulation modules.
ModelsPrecision (%)Recall (%)F1 (%)mAP50 (%)mAP50–95 (%)Params (M)GFLOPs
SPPF77.272.975.080.563.62.66.4
SPP71.371.571.476.958.82.66.4
SPPF-LSKA77.272.874.980.464.22.96.7
FSM82.368.674.881.566.22.76.4
The best results are highlighted in red.
Table 4. Comparison of different detection heads.
Table 4. Comparison of different detection heads.
ModelsPrecision (%)Recall (%)F1 (%)mAP50 (%)mAP50–95 (%)Params (M)GFLOPs
SEAMHead74.273.773.979.361.32.55.9
LQEHead78.172.375.180.764.62.66.5
aux76.773.475.079.964.33.08.2
ADSDH79.973.076.381.665.42.35.2
The best results are highlighted in red.
Table 5. Five-fold cross-validation results of FDA-YOLO.
Table 5. Five-fold cross-validation results of FDA-YOLO.
FoldPrecision (%)Recall (%)F1 (%)mAP50 (%)mAP50–95 (%)
fold178.874.776.781.366.5
fold280.078.379.184.970.4
fold380.172.676.282.367.0
fold476.376.076.182.566.6
fold581.676.679.084.668.7
Avg79.475.677.483.167.8
Average performance across all folds is reported in bold.
Table 6. Ablation study of the proposed components.
Table 6. Ablation study of the proposed components.
MethodsFSMDCFAADSDHPRF1mAP50mAP50–95Params (M)GFLOPs
YOLOv11n77.272.975.080.563.62.66.4
YOLOv11n82.368.674.881.566.22.76.4
YOLOv11n79.673.276.381.965.75.48.6
YOLOv11n78.174.476.283.267.65.58.8
YOLOv11n81.375.278.183.467.55.27.6
P: precision; R: recall. The best and second-best results are highlighted in red and blue, respectively, while the worst results are shaded in gray. ‘–’ indicates the module is not used; ‘✓’ indicates the module is used.
Table 7. Comparison of mAP performance across six categories (YOLOv11n → FDA-YOLO).
Table 7. Comparison of mAP performance across six categories (YOLOv11n → FDA-YOLO).
CategoriesmAP50 (%)mAP50–95 (%)
b_fully_ripened77.0 → 80.061.5 → 68.3
b_half_ripened78.3 → 84.859.0 → 69.8
b_green90.0 → 90.070.5 → 70.8
l_fully_ripened80.6 → 82.067.3 → 67.4
l_half_ripened76.8 → 80.062.1 → 65.2
l_green80.5 → 83.861.1 → 63.5
Significant improvements are highlighted in red.
Table 8. Generalization evaluation on different YOLO architectures.
Table 8. Generalization evaluation on different YOLO architectures.
ModelPrecision (%)Recall (%)F1 (%)mAP50mAP50–95 (%)
YOLOv8n80.070.274.880.463.6
YOLOv8n + Ours81.774.778.082.966.7
YOLOv11n77.272.975.080.563.6
YOLOv11n + Ours81.375.278.183.467.5
Results improved by our method are highlighted in red.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, J.; Luo, W.; Wang, X.; Guo, J.; Ren, H. FDA-YOLO: A Feature Fusion and Attention-Based Network for Multiscale Tomato Maturity Detection in Real-World Agricultural Scenarios. Sensors 2026, 26, 3404. https://doi.org/10.3390/s26113404

AMA Style

Shi J, Luo W, Wang X, Guo J, Ren H. FDA-YOLO: A Feature Fusion and Attention-Based Network for Multiscale Tomato Maturity Detection in Real-World Agricultural Scenarios. Sensors. 2026; 26(11):3404. https://doi.org/10.3390/s26113404

Chicago/Turabian Style

Shi, Jiacheng, Wenjun Luo, Xuemei Wang, Jian Guo, and Hengyi Ren. 2026. "FDA-YOLO: A Feature Fusion and Attention-Based Network for Multiscale Tomato Maturity Detection in Real-World Agricultural Scenarios" Sensors 26, no. 11: 3404. https://doi.org/10.3390/s26113404

APA Style

Shi, J., Luo, W., Wang, X., Guo, J., & Ren, H. (2026). FDA-YOLO: A Feature Fusion and Attention-Based Network for Multiscale Tomato Maturity Detection in Real-World Agricultural Scenarios. Sensors, 26(11), 3404. https://doi.org/10.3390/s26113404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop