Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling

Hao, Yan; Rao, Lei; Fu, Xueliang; Zhou, Hao; Li, Honghui

doi:10.3390/agriculture15121310

Open AccessArticle

Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling

by

Yan Hao

¹,

Lei Rao

²,

Xueliang Fu

¹,

Hao Zhou

¹ and

Honghui Li

^1,*

¹

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(12), 1310; https://doi.org/10.3390/agriculture15121310

Submission received: 21 May 2025 / Revised: 13 June 2025 / Accepted: 16 June 2025 / Published: 18 June 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Several pressing issues have been revealed by deep learning-based tomato ripening detection technology in intricate environmental applications: The ripening transition stage distinction is not accurate enough, small target tomato detection is likely to miss, and the detection technology is more susceptible to variations in light. Based on the YOLOv11 model, a YOLOv11-SLBA tomato ripeness detection model was presented in this study. First, SPPF-LSKA is used in place of SPPF in the backbone section, greatly improving the model’s feature discrimination performance in challenging scenarios including dense occlusion and uneven illumination. Second, a new BiAttFPN hierarchical progressive fusion is added in the neck area to increase the feature retention of small targets during occlusion. Lastly, the feature separability of comparable categories is significantly enhanced by the addition of the auxiliary detection head DetectAux. In this study, comparative experiments are carried out to confirm the model performance. Under identical settings, the YOLOv11-SLBA model is compared to other target detection networks, including Faster R-CNN, SSD, RT-DETR, YOLOv7, YOLOv8, and YOLOv11. With 2.7 million parameters and 10.9 MB of model memory, the YOLOv11-SLBA model achieves 92% P, 83.5% R, 91.3% mAP50, 64.6% mAP50-95, and 87.5% F1-score. This is a 3.4% improvement in accuracy, a 1.5% improvement in average precision, and a 1.6% improvement in F1-score when compared to the baseline model YOLOv11. It outperformed the other comparison models in every indication and saw a 1.6% improvement in score. Furthermore, the tomato-ripeness1public dataset was used to test the YOLOv11-SLBA model, yielding model p values of 78.6%, R values of 91.5%, mAP50 values of 93.7%, and F1-scores of 84.6%. This demonstrates that the model can perform well across a variety of datasets, greatly enhances the detection generalization capability in intricate settings, and serves as a guide for the algorithm design of the picking robot vision system.

Keywords:

deep learning; YOLOv11; tomato; ripening detection; image recognition

1. Introduction

China, which is a major exporter and consumer of tomatoes, still harvests and grades tomatoes primarily by hand. However, high harvesting expenses are caused by the irregular ripening period of tomato fruits as well as the limited accuracy and efficiency of manual operations [1]. Furthermore, tomatoes are leapfrog fruits that continue to ripen after harvesting, have short ripening times, and provide large yields [2]. Fruits of varying ripeness stages must be carefully chosen based on storage duration and transit distance in order to ensure quality. To optimize the picking schedule, it is therefore crucial to quickly and accurately identify tomato fruits and determine their distribution of ripeness. Machine picking based on target identification is anticipated to become a significant development trend in tomato harvesting in the future as smart agriculture advances [3]. In order to accomplish this, it is necessary to first accurately identify and locate tomatoes, then assess their ripeness and identify the several stages of tomato harvesting [4]. The optimal time to select tomatoes is determined by accurately detecting their ripeness, which not only ensures quality but also increases market value and profitability. On the other hand, tomatoes lose quality and market value when they are not harvested at the right maturation stage, which lowers profitability [5]. It is evident that performing research on tomato maturity identification technology has practical relevance and non-negligible application value for achieving accurate and efficient tomato harvesting in real production as well as enhancing tomato planting quality and yield.

Deep learning-based techniques for detecting tomato ripening have yielded impressive outcomes in recent years. Halstead et al. [6] suggested a robotic vision system that uses a parallel layer in the Faster R-CNN architecture to improve pepper ripeness estimation performance. According to the experimental results, when ripeness was taken into consideration as an extra category, The proposed parallel layer improved the F1-score from 72.5% (multiclass) to 77.3%. The estimation of ripeness details, especially for little bell peppers, may be impacted by the new framework’s reduced accuracy when working with small targets. Two machine learning and transfer learning-based lossless papaya ripeness classification techniques were presented by Behera et al. [7]. Experiments showed that 100% classification accuracy was attained by both new learning techniques. However, machine learning techniques require segmented processing, which is laborious, and their data processing power is restricted. With an F1-score of 92.0%, Zu et al. [8] created a Mask R-CNN network for the recognition and segmentation of green tomatoes in greenhouse settings.

For ripe strawberry recognition, Yu et al. [9] employ Mask Region CNN (Mask R-CNN), which shows better results, particularly for those overlapping and hidden fruits., For fruit identification, Kang et al. [10] suggest a Mobile-DasNet network and a segmented network; the accuracy is 90% and the instance segmentation accuracy is 82%. Zhang et al. [11] proposed a greenhouse tomato recognition and pose classification technique based on enhanced YOLOv5, aiming to lower picking collisions and increase picking success rates. By enhancing the YOLOv4 model, a YOLO V5s algorithm based on channel pruning was suggested by Dandan Wang et al. [12] for the quick and precise identification of young apple fruits. With an accuracy of 95.8%, the trial findings demonstrated that the method worked effectively under a variety of circumstances, serving as a guide for the development of portable fruit thinning equipment and orchard management. Sheping Zhai [13] suggested an enhanced technique based on DenseNet and feature fusion to boost SSD’s performance in small target recognition. With fewer parameters, experimental results showed a 3.1% improvement in DF-SSD detection accuracy; however, this may not be applicable for small target identification in extremely complicated backgrounds.

Yifan Bai et al. [14] proposed an improved YOLOv7 model for the real-time recognition of strawberry seedling flowers and fruits in greenhouses, integrating the GS-ELAN network neck optimization module with the Swin Transformer prediction head. Although the new model’s durability and real-time detection capabilities are demonstrated by the testing results, problems may still occur when processing low-contrast images. Seo et al. [15] employed Faster R-CNN under the HSV color model to identify tomatoes produced hydroponically by separating the tomato fruit region from the background region using a K-means clustering algorithm. Sun et al. [16] proposed a tomato ripeness detection algorithm based on improved YOLOv8n, which significantly improved the model’s ability to capture key features of tomatoes (e.g., color change and spatial localization) by introducing the RCA-CBAM module to integrate color, channel, and spatial three-dimensional attention mechanisms. In this study, the BiFPN module was used to replace the traditional PANet to realize multi-scale feature fusion, and the Inner-FocalerIoU loss function was designed to solve the sample imbalance problem in ripeness classification. Experiments showed that the improved YOLOv8+ model achieved 95.8% precision and 91.7% accuracy on the test set, but the robustness under extreme occlusion conditions was not verified.

The present deep learning-based tomato ripeness detection techniques have advanced somewhat in terms of classification criteria, model application, and enhancement for certain issues in order to summarize the aforementioned research status both domestically and internationally. While some research enhanced the model for particular circumstances to increase the adaptability and real-time detection, others optimized the model structure and methods to obtain high recognition accuracy in particular datasets and contexts [17]. But there are still a lot of issues with the current research:

(1): It is difficult to discern between the transition stage of ripeness in complicated surroundings, and small tomato detection is prone to false negatives.
(2): Some techniques are computationally costly, need a lot of disk space, and take a long time to run, making it impossible to meet the speedy detection requirements of real-world applications.
(3): Accurately identifying tomatoes in the transition stage of ripeness still needs work, and the majority of research relies on color and form features, which are insufficiently reliable for detecting tomatoes in complicated situations.

In order to address the aforementioned issues, enhance tomato ripeness detection’s performance in intricate settings, and offer workable technological solutions for the effective implementation of tomato ripeness detection in actual production scenarios, this paper suggests a multi-module fusion YOLOv11-SLBA tomato ripeness detection model based on the YOLOv11 model with deep learning technology.

2. Materials and Methods

2.1. Sources and Acquisition of Images

The tomato plantation area in Wangfufu Town, Karakin Banner, Chifeng City, Inner Mongolia (118°42′00″ E, 41°55′48″ N), dubbed the “Hometown of Tomato in China”, served as the research base for this paper. We acquired image data for various tomato growth periods there. Between 8 January and 19 March 2025, the study team moved the picture acquisition mission along at the park in a systematic way. The research team communicated extensively with local fruit growers to gain a precise understanding of tomato growing features. The tomato cultivars used in this study had a growth cycle of about 26–37 days from the flowering stage to the red ripening stage, according to the fruit farmers. However, this cycle is not constant; rather, it is impacted by a number of environmental elements, including soil conditions, temperature, and humidity, in addition to the genetic traits of tomato cultivars. This crucial information advances our knowledge of the fundamental connections between environmental conditions and tomato growth and offers crucial contextual support for further research.

The Huawei Mate 50 smart terminal (Huawei Technologies Co., Ltd., Shenzhen, China) was selected by the research team as the acquisition device for images. The phone’s dual-camera setup on the back includes a 12-megapixel wide-angle primary camera with a near-shooting reduction feature that can effectively ensure image clarity when taking close-up shots. The 12-megapixel ultra-wide-angle lens also broadens the shooting field of view, which offers a more thorough collection of tomato growth data. Simultaneously, the clever HDR 4 technology improves screen processing to guarantee high-quality photos, thereby offering a dependable dataset for further model training [18]. The gadget has the ability to automatically modify settings including focal length, aperture, and white balance during shooting in order to accommodate various shooting scenarios and situations [19]. To fully document the tomato’s growing state from various perspectives, the shooting distance was carefully regulated between 10 and 50 cm, and images were taken from a range of angles, including overhead, flat, and vertical. The obtained photos have a resolution of up to 3024 × 4032 pixels, which allows them to clearly display the tomato’s intricate features [20].

Based on color and size features, the current national standard GH/T1193-2021 [21] divides tomato maturity into six stages: unripe, green ripe, color change, early red ripe, mid-red ripe, and late red ripe [22]. The colored portion of the fruit makes up 40–60% in the mid-red ripe stage and 60–100% in the late red ripe stage. Only fruits from these two maturity phases are picked in greenhouse farming settings. Modern agricultural product logistics systems immediately benefit from the detection of green tomatoes. First, they have excellent storage and transportation performance: the epidermis of green tomatoes is thicker, and their mechanical strength is 35–40% higher than that of mature fruits. Second, they can undergo controlled ripening: by regulating ethylene, the market release time can be precisely controlled to meet the timeliness requirements of cross-border trade. The detection targets in this study are primarily categorized into four stages: flowering and young fruit stage, green growth stage, yellowing and semi-ripe stage, and red mature stage. Table 1 below displays the features of each tomato maturity category as well as the relationship between the categories in this study and national criteria.

A total of more than 6400 valid images were collected in this study. Based on tomato ripeness, they were mainly divided into four categories: flowering and young fruit stage, green growth stage, yellowing and semi-ripe stage, and red ripe stage. Figure 1 shows photos selected from several categories in the dataset created in this study.

Table 2 displays the statistics of the quantity of photos of the various tomato maturity groups. Particularly noteworthy is the fact that some photos included tomatoes of several ripeness types simultaneously, which complicated the data processing and analysis that followed [23] but also offered a wealth of information for researching the dynamic shifts in the tomato growth process.

2.2. Data Preprocessing and Environmental Enhancement

The original images underwent rigorous quality screening to exclude low-quality and blurred samples caused by suboptimal shooting conditions. To enhance model focus on agriculturally relevant features, selective background cropping was performed using Adobe Photoshop 2024. Specifically, we removed only non-environmental artifacts such as greenhouse walls and support structures while deliberately retaining natural environmental factors including foliage, lighting variations, and partial occlusions. All processed images maintained a fixed aspect ratio to ensure geometric consistency. This approach balanced the need for clean training data with the preservation of realistic field conditions critical for model generalization.

In order to overcome the overfitting issue, improve data diversity, balance the number of categories, and create a high-quality dataset—all of which improve tomato localization and picking ability—complex environment data enhancement is applied to the tomato ripening dataset in order to simulate the effects of various environments on the images. Using Python’s (version 3.10.16) OpenCV package (version 4.11.0) [24], we were able to improve the simulation of four common greenhouse environmental disturbances in this work in the following ways:

(1): Spray simulation: To replicate the effects of spray cooling operations, droplets of various sizes and colors are sporadically added. Figure 2 illustrates how the transparency mixing (α = 0.6) is used to create 200–500 white line segments (length 10–20 pixels) at random to mimic raindrops. This creates the illusion of water mist in the sprinkler system. The technique is capable of accurately simulating the adhesion and refraction properties of water droplets on the lens surface.

(2): Fog interference: To replicate how fog affects image clarity in the greenhouse, a random fog layer was produced. A non-uniform fog was created using a Gaussian blur (kernel size 101 × 101), and the vision loss brought on by variations in greenhouse humidity was simulated by superimposing a gray mask (transparency α = 0.4). The attenuating effect of fog on visual contrast is seen in Figure 3.

(3): Powerful light interference: To replicate direct sunlight or a powerful light source, a round, bright light was placed at a specific spot in the picture. Weighted fusion (weight 0.2) simulates the halo effect of direct sunlight on the lens by drawing a circular light spot (radius 30–80 pixels) at random. The light spot area displays typical overexposure characteristics, as illustrated in Figure 4.

(4): Shadow masking: Shadow masking was used to remove the image’s leaf shadow effect. To replicate the localized reduction in brightness brought on by leaf projections, black rectangles with random positions and sizes (transparency 0.3) are created. The impact of shadow interference on the fruit recognition region is seen in Figure 5.

Table 3 below displays the quantity of photos of various tomato ripening categories following environmental enhancement. It should be mentioned that a variety of tomato ripening types are present in some of the images. This adds complexity to the data processing and analysis that follows, but it also offers a rich and useful data resource for thoroughly examining the dynamics of the tomato growth process.

2.3. Data Labeling and Segmentation

In this study, tomato ripening image data was annotated using the Labelimg-1.8.6 tool, dividing this data into four categories: green growth period, yellowing and half ripe, red maturity stage and flowering and young fruiting. Figure 6 displays the labeling interface, and the labeled data is saved in txt format.

This study separates the dataset into three sections, which are separated into training, validation, and test sets in accordance with the ratios of 70%, 20%, and 10%, as indicated in Table 4 below. The test set, which is separate from the first two, is employed for the final evaluation of the model’s performance and capacity for generalization in order to ensure an objective evaluation of the model; the validation set is used for parameter optimization and performance evaluation, which can identify overfitting or underfitting in time; and the training set is rich in samples to aid the model in learning data features [25].

2.4. YOLOv11 Network Architecture

The input layer at the input end, the neck layer, which is the feature fusion module, the head layer at the output end, and the backbone layer, which is the backbone module of network feature extraction, are the four main parts of the YOLOv11 network architecture, as shown in Figure 7 [26]. The input side of the network receives the tomato photos that require ripeness checking, and the backbone layer extracts the most crucial feature information from the images using feature extraction. In order to extract deeper and more thorough feature information, it then moves on to the neck layer, which fuses the features of various scales in the backbone. The head layer, which is the final layer and the output layer, classifies and localizes the fused features [27]. It then outputs the tomato’s ripeness detection results, which serve as the foundation for the next picking decision.

2.5. YOLOv11-SLBA Network Structure

Figure 8 displays the enhanced YOLOv11-SLBA model network structure, which has been enhanced in three ways:

Backbone layer improvement: The convolutional kernel sensing field is enlarged and the down-sampling procedure is optimized by substituting the SPPF-LSKA module for the original SPPF module. By capturing a wider range of contextual information while still extracting important image features, this module greatly improves the feature extraction capability for small targets and provides a richer semantic information base for feature fusion and detection tasks later on.
Neck layer structure optimization: the two-way propagation mechanism is used to accomplish multi-level feature interaction in this research, which replaces the conventional FPN structure with the BiAttFPN feature fusion module. High-level characteristics offer a semantic context, such as the spatial relationship between fruits, branches, and leaves, while low-level features preserve spatial details, such as the texture of the tomato’s surface. By successfully integrating feature maps at various sizes, this dynamically weighted fusion technique significantly improves feature representation and increases the model’s adaptability in complicated scenarios.
Detection head enhancement: Six sets of feature inputs from the neck layer are received by the detection head architecture with the help of DetectAux. The stability of small target detection is greatly enhanced, detection performance in complex environments is greatly enhanced, and the model’s identification accuracy for tomatoes of varying maturity is effectively improved through the joint optimization of multi-level gradient signals.

Figure 8. YOLOv11-SLBA network structure diagram.

2.5.1. Space Pyramid-Large Nucleus Attention SPPF-LSKA

Effective feature extraction and comprehension are essential to model performance in tomato ripeness detection. Rich feature support for tomato ripeness recognition is provided by the SPPF module in the YOLOv11 network, which is a crucial part of improving the feature extraction capabilities. It can effectively capture multi-scale aspects of the image by spatial pyramid pooling. Spatial pyramid pooling is an efficient way to capture multi-scale characteristics in photographs. Redundancy in the SPPF module’s processing, however, hinders the model’s computational efficiency and running speed while handling massive amounts of tomato image data. This impacts real-time detection and makes it challenging to fulfill production demands [28]. The goal of this research is to build reliable and efficient models for tomato ripening detection. Figure 9 illustrates the introduction of LSKA, a revolutionary large kernel attention method, to achieve this goal. With the help of the big kernel convolution operation, this method can better comprehend the shape, color, and other characteristics of the identification target while also successfully capturing the long-distance relationships in the image. The low efficiency of the SPPF module in processing huge amounts of data is caused by some redundancy in the computing process [29]. In light of this, this paper incorporates the LSKA mechanism into the SPPF module for accelerated design. This strikes a compromise between efficiency and model correctness, meets the requirements of tomato ripening detection in real complex scenarios, and not only uses LSKA to improve understanding of tomato features but also compensates for the computational redundancy of the SPPF. C is the number of input channels for a specific feature map, and H and W represent the feature map’s height and width, respectively. The following is the precise formula for LSKA output [30].

{\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times 1}^{C} * (\sum_{H, W} W_{(2 d - 1) \times 1}^{C} * F^{C})

(1)

Z^{C} = \sum_{H, W} W_{[\frac{k}{d}] \times 1}^{C} * (\sum_{H, W} W_{[\frac{k}{d}] \times 1}^{C} \times {\bar{Z}}^{C})

(2)

A_{C} = W_{1 \times 1} * Z^{C}

(3)

\bar{F^{C}} = A^{C} \otimes F^{C}

(4)

In the equation, * and ⊗ denote convolution and the Hadamard product, respectively, while

{\bar{Z}}^{C}

used to construct spatial features with a large receptive field.

W_{(2 d - 1) \times 1}^{C}

is a one-dimensional learnable convolution kernel.

Z^{C}

is the output of a deep convolution obtained by convolving a kernel of size k × k with the input feature map f. Local enhancement can be performed on

{\bar{Z}}^{C}

to integrate more detailed spatial information. The left side of Equation (1) is the output of a depth convolution with a kernel size of 1 × (2d − 1), which captures local spatial information and compensates for the lattice effect of subsequent depth convolutions.

A_{C}

is used to map spatial features Z to corresponding attention maps, which characterize the importance of each position in space, as shown in Equation (3). The kernel size of the deep convolution is [k/d] × 1, where k represents the receptive field of kernel W, and d represents the dilation rate. Note that the figure is obtained by performing convolution using a 1 × 1 convolution kernel (or). Furthermore, also note that the Hadamard product of the figure and the input feature map is represented by the left and middle equations in Equation (4).

By combining the spatial pyramid pooling and filtering module with the large kernel separable attention module, the SPPF-LSKA module improves its sensory field and increases its capacity to monitor the target. As a result, the module may adapt to targets of varied sizes without requiring more computation by pooling feature maps of different scales and extracting richer feature information. The idea is illustrated in Figure 10.

It is noted that the dimensions of the input feature map are (C, H, W). To create a feature map with C/2 channels, the channels are first compressed using a 1 × 1 convolution. After that, three layers of maximum pooling operations are applied to the compressed feature map in order to extract spatial features under various receptive fields. A merged feature map with dimensions (2C, H, W) is then obtained by concatenating the output of the three pooling layers with the original feature map in the channel dimension.

To enhance the model’s ability to understand key semantic information such as the global structure, color, and texture of images, the LSKA module was introduced to the concatenated features. LSKA uses large kernel convolutions to model long-range dependencies in the spatial dimension, effectively expanding the receptive field range and improving the model’s ability to focus attention on the target area. Finally, the channel count is adjusted to the target channel count using 1 × 1 convolutions, completing the feature transformation output of the SPPF-LSKA module.

This module not only inherits the multi-scale feature fusion advantages of the original SPPF, but also introduces a long-range modeling mechanism, effectively alleviating the problems of limited receptive fields and redundant local calculations in complex scenarios that are common in traditional SPPF. The experimental results show that the introduction of the SPPF-LSKA module improves model detection accuracy while maintaining good inference efficiency, providing a structural guarantee for the accurate identification of tomato ripeness. The detailed process is shown in Figure 11.

2.5.2. Bidirectional Attention Feature Pyramid Network BiAttFPN

The accurate and effective fusion of multi-scale characteristics is crucial for improving detection performance in tomato ripening detection tasks, particularly when dealing with complicated environmental variables. Tomato fruits typically exhibit multi-scale distribution in real-world application scenarios, where they are obstructed by intricate illumination, branch and leaf shade, and other factors. For example, enormous fruits in the near view may coexist with little fruits in the far view. To produce feature maps with different spatial resolutions but with the same number of channels, the conventional feature pyramid network FPN uses a top-down topology. This partially achieves the first fusing of multi-scale features. Nevertheless, there are restrictions on how well multi-scale feature fusion may be used with this structure.

To improve the effect of feature fusion, the path aggregation network PAN is widely used in YOLO detectors. It introduces bottom-up paths on the top-down architecture of FPN, and propagates fine-grained features from lower to higher levels through iterative down-sampling [31]. However, PAN is less effective in small-target defect detection, and the increased parameter and computational overheads associated with feature dimension upgrading affect the generalizability of the model.

Subsequently, BIFPN, a plug-and-play multi-scale feature fusion network, emerged, which further promotes feature interaction and fusion. BIFPN further fuses the output structural and texture features, adopts hopping connections to prevent semantic corruption during fusion, and facilitates the perception and interaction of texture and structural feature information with the help of context learning, which enhances the correlation between local features of the study object, while maintaining the overall consistency [32]. However, there are still some optimizable aspects of BIFPN, such as the possible feature dilution problem, and the fusion effect of tomato ripeness features for small targets in tomato ripeness detection in complex environments needs to be improved; the structure of the FPN, PAN, and BIFPN networks is shown in Figure 12.

In light of this, this article suggests a novel bidirectional attention feature pyramid network, BiAttFPN, as seen in Figure 13, for the task of tomato ripening detection in complex environmental conditions. BiAttFPN aims to achieve more efficient and accurate feature fusion to enhance the performance of tomato ripening detection by improving the feature fusion method, network structure design, and related formulas. BiAttFPN improves on BIFPN in the following aspects, respectively:

(1): Enhancement of the feature fusion technique: A bidirectional pyramid network for attention features, BiAttFPN, has been enhanced to normalize the fusion features with channel adjustment using the Concat+node_mode combination, which activates denser feature interactions. Multiple Concat operations allow the same layer to receive features from various scales. Simultaneously, redundant information is decreased to guarantee feature reuse efficiency while lowering computation in order to prevent potential feature dilution in BIFPN.
(2): Network structure improvement: a multi-level backtracking connection is used, in which the higher-level features will interact with the bottom-level features multiple times; the number of module repetitions is optimized to increase the fusion path; a hierarchical progressive fusion is used, in which the P4 feature pyramid is built first and then gradually fused upwards. Since P2–P4 progressive fusion preserves more superficially detailed traits, this improvement is more suited for early small target maturity identification in tomatoes. Reducing the number of times particular modules are repeated and preserving performance through more intricate connections are two further ways to maximize computational efficiency.

Figure 13. Diagram of BiAttFPN network structure.

(3): Improvement of the related formula: the Concat used by BiAttFPN is spliced along the channel dimension as follows:

O_{c o n c a t} = [F_{1}, F_{2}, \dots, F_{n}]

(5)

O = \sum_{i = 1}^{n} ω_{i} \cdot C o n v (F_{i})

(6)

ω_{i} = \frac{e^{α_{i}}}{\sum_{j = 1}^{n} e^{α_{i}} + ε}

(7)

where

F_{i}

represents the input features,

C o n v (F_{i})

indicates how many 1 × 1 convolutional unification channels are used on the features initially,

α_{i}

indicates the learnable weight parameters, and

ε

indicates the extremely small constants that are meant to avoid numerical instability.

This network accomplishes effective information fusion and detail retention by including cross-layer connections, attention mechanisms, and a combination of top-down and bottom-up feature propagation channels. The backbone network sends BiAttFPN four scale feature maps, P1, P2, P3, and P4, which represent varying levels of spatial resolution and semantic abstraction. The feature fusion procedure transmits semantic information layer by layer by concatenating lower-level features with high-level features that have first been up-sampled via a top-down path. To maintain semantic continuity, P1’s features are first up-sampled and blended with P2, and then P3 and P4 features are added. In order to improve the specificity of feature expression, each fusion point simultaneously employs a channel splicing operation that collaborates with the attention module to dynamically modify the importance of each channel. BiAttFPN further creates a bottom-up information feedback channel after top-down fusion is finished. Layer by layer, the fused low-level characteristics are down-sampled and returned to the upper levels, where they are fused once more with the higher levels’ preexisting semantic information. This procedure greatly enhances the model’s capacity to identify far-off, tiny tomato fruits by reinforcing the expression of low-level elements like structural edges and textures in the finished product. The drawbacks of conventional feature pyramid structures, where the up and down paths are disconnected and information transmission is constrained, are also addressed by BiAttFPN, which creates multiple cross-scale jump connections to prevent information silos and enable bidirectional interaction between features at various levels. To achieve adaptive weighting and the integration of the multi-source fusion features following channel splicing, the entire network implements an attention adjustment method. This effectively mitigates the feature dilution issue that arises in BIFPN and inhibits the diffusion of redundant information. Stronger semantic expression and detail retention abilities in the final multi-scale feature map provide a robust basis for tomato ripeness assessment in challenging settings.

2.5.3. Auxiliary Detection Head DetectAux

When determining the maturity of tomatoes in intricate settings, the accurate and comprehensive extraction of tomato image features is a key prerequisite for guaranteeing detection accuracy. However, in the actual feature extraction process, due to the diverse morphology of tomato fruits and the complexity of the scene, such as the presence of occlusion, uneven illumination, etc., it is very easy to miss image features, which will lead to the inaccurate judgment of tomato ripeness in the model, thus affecting the overall accuracy of the detection. Additionally, it will make the model less generalizable to other situations.

Based on this, in this paper, for the tomato ripeness detection needs, an auxiliary detection head DetectAux is added to the network, as shown in Figure 14.

In order to improve the model’s overall expressive ability while accounting for deployment efficiency, the structure design is based on four dimensions: multi-scale feature completion, model robustness enhancement, training efficiency optimization, and computing resource control.

(1): Methodical addition of multi-scale characteristics: At varying stages of ripeness and from different gathering angles, tomatoes show notable variations in size and texture. Small target features and superficial details may be overlooked because traditional detection networks only use a small number of primary detecting heads. In order to improve the model’s perception of multi-scale targets and the overall detection of fine-grained feature expression, this study constructs six sets of DetectAux that are connected to feature maps of various scales.
(2): Increased robustness in complicated environments: The model’s perception of important target areas can be readily impacted by interference variables like obstruction and lighting changes that are frequently present in real-world agricultural contexts. In order to obtain discriminative information from various “receptive field perspectives”, the DetectAux module introduces multiple feature pathways in parallel. This redundant supervision greatly improves the model’s stability and robustness in situations with partial obstruction and complex backgrounds.
(3): Convergence acceleration and supervised reinforcement in the early training phases: DetectAux can also offer extra supervised signals during the training phase, creating a multi-path gradient propagation mechanism that reduces gradient disappearance and boosts model parameter update efficiency. In the early phases of training, the backbone network can be guided by the independent losses produced by many detection heads to collect target features more quickly, reducing training time and enhancing learning quality overall.
(4): Resource management and optimization of computational efficiency: The six DetectAux groups maintain a lightweight structural design despite increasing model complexity. Small convolution kernels and shallow channels are used by each detection head, which requires less parameters than the primary detection head. Even though there are now more parameters overall, it is still appropriate for embedded devices and mid-range GPUs. Furthermore, to improve learning capabilities, all six DetectAux groups are activated during the training phase. Only the primary detection head is kept during the inference phase, and DetectAux output is turned off to prevent any extra computational load during deployment. DetectAux’s modular architecture allows for variable enablement and disabling, making platform switching easy. In addition, DetectAux is inherently compatible with GPU/TPU concurrency optimization and offers parallel computing. Technology such as gradient check-pointing or mixed precision training can be employed to further minimize resource usage in the event of a memory bottleneck.

3. Results and Analysis

The performance of the tomato ripeness detection model is thoroughly examined in this section. By elaborating the experimental environment, setting scientific performance evaluation indexes, deeply analyzing the convergence characteristics of model training, carrying out comparison experiments with mainstream methods, exploring the role of key modules in the ablation experiments, and visually demonstrating the model detection effect, we comprehensively assess the model’s ability to detect tomato ripening in complex environments.

3.1. Experimental Environment and Performance Evaluation Index

The experimental setup of this study was based on deep learning techniques and high performance computing resources, configured as follows: the PyTorch (2.0.1+cpu) framework was used, with NVIDIA Tesla V100S-PCIE-32 GB GPUs, Volta architecture, 32 GB video memory, support for mixed-precision computation and CUDA acceleration, GPU driver version 545.23.08, and CUDA version 12.3. The CPU was Intel i5-8250U, 1.60 GHz base frequency, 1.80 GHz max RPM, with four cores and eight threads.

In this paper, accuracy P, recall R, average precision mAP50, and F1-score were selected as evaluation metrics [33]. R represents the proportion of tomatoes in that category that were correctly detected in the test set, and P represents the proportion of tomatoes in that category that were correctly detected in the test as a percentage of the total number of tomatoes in that category. The mAP50 is used to evaluate the accuracy of target detection, i.e., the accuracy values of tomato categories with different maturity levels are summed and then divided by the number of all categories, which represents the average of the accuracy of all categories in the tomato dataset. The F1-score is the reconciled average of P and R, and the larger value of this metric indicates the better performance of the model. mAP is computed by the formula of (8); the F1-score of the formula is (9), the formula for P is (10), and the formula for R is (11).

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(8)

F 1 - s c o r e = \frac{2 \times P \times R}{P + R}

(9)

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

where N is the number of categories (in this study, N = 4), True Positive (TP) is the number of correctly detected positive samples, False Positive (FP) is the number of incorrectly detected negative samples, and False Negative (FN) is the number of positive samples that were not detected. The number of tomato samples in a category that were accurately identified as belonging to that category is indicated by TP in this study, while the number of tomato samples in that category that were mistakenly identified as belonging to other categories or the background is indicated by FP [34].

3.2. Convergence Analysis

In the model training and evaluation work based on the tomato maturity dataset, both the YOLOv11 baseline model and the YOLOv11-SLBA improved model have completed 200 rounds of training. In order to deeply analyze the differences in convergence performance between the two, this paper presents a detailed comparative analysis of the loss function curves performance indicators with the number of training rounds, and the convergence curves of the main performance indicators are shown in Figure 15.

In the course of the training, the performance of both networks on the tomato ripeness dataset shows that the loss values are maintained at a low level, which indicates that both can fit the dataset well and have some learning ability. However, a closer look at the loss function curves reveals that YOLOv11-SLBA shows superior performance in both category loss and feature point loss. Specifically, during the whole training cycle, the category loss and feature point loss curves of YOLOv11-SLBA are always below the corresponding loss curves of YOLOv11, and this advantage becomes more and more significant as the number of training rounds increases. This fully indicates that YOLOv11-SLBA has effectively improved the model structure or training strategy, which enables it to more accurately capture the information of different ripening categories and key feature points in the tomato images, and thus reduces the prediction error of the model in these aspects.

Combining the results of the comparative analysis of the loss function curves as well as the dataset, it can be clearly concluded that YOLOv11-SLBA has better convergence performance. Its reduction in category loss and feature point loss, faster convergence speed on the validation set, and indexes fully prove that the model can learn data features more efficiently in the tomato ripeness detection task, quickly reach a stable and excellent performance state, and provide more reliable technical support for the practical application of tomato ripeness detection.

3.3. Comparison Experiment

3.3.1. Comparison of Model Performance on Self-Built Datasets

A number of comparative tests were carried out in this work to thoroughly assess the YOLOv11-SLBA model’s performance on the self-constructed tomato dataset used in this paper for the job of tomato ripening detection. First, representative mainstream models in the current target detection field are selected, including Faster R-CNN [35], which is based on the idea of two-stage detection, SSD [36], which is known for its fast single-stage detection, RT-DETR [37], which is based on the optimization of real-time detection, as well as the YOLO series of models [38], which are excellent in terms of lightweight and efficient detection, such as YOLOv7, YOLOv8, and YOLOv11, as shown in Table 5.

Under identical experimental conditions, the models’ performance on the self-constructed dataset differs considerably. Although the traditional two-stage detector Faster R-CNN has a higher recall (88.4%), its 41.12 M number of parameters and 657 MB memory footprint lead to inefficient deployment and significantly lags behind the single-stage model in terms of precision (60.5%) and F1-score (71.8%). The SSD model improves inference by streamlining the number of parameters (24.01 M) and its speed (22.5 ms), but its 85.7% mAP50 and 78.4% recall indicate that its ability to discriminate complex features is still limited. RT-DETR, as a real-time target detection model optimized based on the Transformer architecture, achieves efficient end-to-end detection with a hybrid encoder and query de-redundancy design, which is highly effective in terms of precision (91.2%) and its mAP50 (88.6%), which outperform traditional methods; however, the inference efficiency (96.3 ms) still fails to meet the real-time demand due to its 19.9 M parameter count and 307.4 MB memory occupation.

Among the YOLO series models, YOLOv7 achieved high recall (84.9%) and mAP50 (90.2%) with 36.5 M number of parameters, but there is still room for optimization with its 148 MB memory footprint and 5.8 ms inference time. The lightweight models YOLOv8 and YOLOv11 significantly reduce the number of parameters (3 M and 2.8 M) and memory footprint (5.8 MB and 6.9 MB) through structural improvement, but perform similarly in precision (90.1% and 88.6%) and F1-score (87.2% and 85.9%), suggesting that a purely lightweight design may sacrifice some of the detection stability.

The improved YOLOv11-SLBA model has the best overall performance among all the compared models: it achieves 92.0% precision, 83.5% recall, 91.3% mAP50, 64.6% mAP50-95, and 87.5% F1-score with only 2.72 M covariates and 10.9 MB memory footprint; its inference speed (2.3 ms) improves by about 60% and the F1-score is improved by 1.6 percentage points over the baseline model YOLOv11. This result verifies that the SLBA module effectively balances computational efficiency and detection accuracy while enhancing feature extraction capability, providing a better solution for real-time maturity detection in agricultural embedded devices.

3.3.2. Verification of Generalization Capabilities on Public Datasets

The tomato-ripeness1 [39] dataset, which is publicly accessible in Roboflow Universe, was used to assess the YOLOv11-SLBA model’s generalization performance. The dataset contains 3154 images covering tomatoes of different ripeness levels. Compared with the self-constructed dataset in this study, this dataset has differences in image acquisition environment and labeling standards, which can effectively evaluate the model’s adaptability on unknown data, as shown in Table 6.

The performance variations among the various models are further confirmed by experiments conducted on publicly available datasets. The traditional detector Faster R-CNN shows a serious lack of precision problem with its 55.6% precision and 70.6% F1-score, although its recall is as high as 96.9%. SSD and RT-DETR perform moderately well in terms of recall (91.4%, 86.5%) and mAP50 (87.1%, 89.7%) but are limited by a large computational overhead (26 ms, 176.9 ms). In the YOLO series, YOLOv7 and YOLOv8 have their own focus between speed and precision, while YOLOv11 achieves 92.7% mAP50 with 2.58 M parametric quantities, demonstrating the potential of lightweight design.

In the end, the improved YOLOv11-SLBA model takes the overall lead with 93.7% mAP50 and 91.5% recall while maintaining efficient inference (2.5 ms). The observed disparity between precision (78.6%) and recall (91.5%) stems from the model’s design prioritization for tomato ripeness detection scenarios: (1) Agricultural application requirements: In actual harvesting operations, missing ripe tomatoes (false negatives) causes direct economic loss, while false positives can be corrected by subsequent manual sorting. This justifies our focus on recall optimization. (2) Data characteristics: Public datasets contain dense tomato clusters with heavy occlusions (see Figure 16), where conservative detection thresholds naturally increase false positives. And its stable performance across datasets fully proves the generalization ability and usefulness of the SLBA module.

3.3.3. Stability Analysis of Improved Models on Heterogeneous Datasets

In order to deeply verify the robustness and stability of the YOLOv11-SLBA model under different data distribution conditions, this section carries out a comparative analysis of the model’s performance on the self-constructed tomato dataset and the publicly available dataset, as shown in Table 7.

The comparative experimental analysis of the YOLOv11-SLBA model on the self-constructed dataset and the public dataset can verify that the model has good generalization ability. The YOLOv11-SLBA model shows differentiated performance characteristics on the two types of datasets: in the public dataset, the model exhibits a higher recall rate, but the precision rate is relatively low, which indicates that the model’s coverage of the target objects is more comprehensive but is accompanied by more false detections; in the self-constructed dataset, the precision rate significantly improved to 92%, but the recall rate reduced to 83.5%, reflecting that the model is more stringent in discriminating the positive samples. It is worth noting that the model’s mAP50 remains stable on both datasets and the composite index F1-score exceeds 84% despite the differences in data distribution, a phenomenon that validates the effectiveness of the improved model design and its stable performance in different environments.

3.4. Ablation Experiment

In order to verify the gains brought to the improved model by the three optimization strategies of the YOLOv11-improved SPPF-LSKA, BiAttFPN, and DetectAux modules, this paper completes the following ablation experiments based on the same experimental environment on the self-constructed tomato ripening dataset, and the results of the experiments are shown in Table 8 below.

From the above table, it can be seen that the YOLOv11-SLBA model outperforms the baseline model YOLOv11 in all the indexes, and through the in-depth analysis of the ablation experimental results, we clearly see that the YOLOv11-SLBA model realizes a significant surpassing of the baseline model in all the key indexes by virtue of the organic integration of the three innovative modules, namely, the SPPF-LSKA, BiAttFPN, and Aux. Specifically, the BiAttFPN module, with its unique bidirectional attention mechanism, improves the precision to 92.5%, achieving an absolute increase of 3.9 percentage points compared to the baseline model. However, there is a slight decline of 2.3% in recall in this process, which may be due to the over-suppression effect of the attention mechanism on some of the difficult samples. The SPPF-LSKA module significantly enhances the ability to capture chromaticity features by virtue of its large kernel spatial attention, which pushes the mAP50 up by 0.6% to 90.4%. The Aux-assisted training module, by means of its multi-scale supervision mechanism, successfully improves the F1-score by 1.0%. When these three modules work in concert, the YOLOv11-SLBA model achieves a comprehensive and balanced improvement in performance: while maintaining the high precision advantage of BiAttFPN, with the compensating effect of the other modules, it stabilizes the recall rate at 83.5%, and achieves a 1.5% and 1.6% improvement in the mAP50 and the F1-score, respectively, ultimately achieving a total increase of 91.3% and 87.5%. From the comprehensive experimental results, this multi-module collaborative optimization approach successfully breaks through the performance bottleneck of target detection in complex agricultural environments through the reinforcement of the attention mechanism, the enhancement of the feature capture ability, and the optimization of the training strategy, which provides a highly innovative and practical technical solution for real-time precision detection.

3.5. Evaluation of Testing Effectiveness

This section focuses on model performance evaluation, clearly comparing the performance of the model before and after the improvement, and showing the effect brought by the improvement. And the improved model is compared with other models to analyze its performance advantages and disadvantages from different perspectives, which highlights the direction for subsequent research.

3.5.1. Comparison of Model Performance Before and After Improvement

The original model YOLOv11 and the improved model YOLOv11-SLBA were evaluated using precision, recall, and average precision mean, and the results are shown in Figure 17. Figure 18 shows the confusion matrix comparing the two models.

According to the findings, the enhanced YOLOv11-SLBA model performs noticeably better in the tomato ripeness detection test. The training curves show that compared with the baseline YOLOv11 model, YOLOv11-SLBA exhibits a steady improvement trend in the three indicators of precision, recall, and mAP50, and the performance advantage of the improved model is more obvious in the middle and late stages of the training, which is attributed to the fact that the cross-layer attention mechanism introduced by the SLBA module effectively enhances the model’s ability of extracting features at multiple scales. Although there is a short performance adaptation period in the early stage of training, the model converges quickly and remains stable with parameter optimization, which verifies that the improved scheme significantly improves the detection performance while maintaining the lightweight characteristics.

3.5.2. Comparison of Detection Effect of Different Models

In order to evaluate the detection results of the YOLOv11-SLBA model, the YOLOv11-SLBA model was compared with Faster R-CNN, SSD, RT-DETR, YOLOv7, YOLOv8, and YOLOv11 models for the tomato ripening detection effect; the results are shown in Figure 19. Red circles in the figure indicate fruits that the optimized model can correctly recognize but the baseline model frequently misses, demonstrating how well the better model detects small targets or fine occlusions.

The comparison experiments of mainstream detection models such as YOLOv7, YOLOv8, YOLOv11, Faster R-CNN, SSD, and RT-DETR reveal that each model exhibits obvious performance bottlenecks in the tomato ripeness detection task: the models based on the YOLO series present typical scale-sensitivity defects. Among them, YOLOv7 has a high leakage rate of small distant targets (<30 pixels in diameter) due to limited receptive fields, and is prone to misclassify morphologically similar branches and leaves as unripe tomatoes; although YOLOv8 and YOLOv11 have improved their multi-scale detection ability by increasing the number of feature pyramid tiers, there are still leakage rates ranging from 15.6% to 18.9% in the area of fruit densities with fluctuating confidence scores (±)0.12), reflecting the lack of discriminative stability of color gradient features in the long term. The two-stage detector Faster R-CNN exhibits more complete detection coverage, but its region proposing mechanism produces partial detection frame offset (IoU < 0.5) in fruit sticking scenarios and its sensitivity to ripeness key features is only about 60%, resulting in neighboring fruit ripeness misclassification. The SSD model mitigates the scale variance problem through multi-scale feature map prediction, but under complex background interference (e.g., leaf shadow mottling, reflections, etc.), its shallow feature extraction is insufficient, resulting in a significantly higher omission rate of small targets than that of deep targets, showing the limitations of the feature fusion strategy in adapting to agricultural scenarios. Although RT-DETR adopts the Transformer architecture to achieve global modeling, its attentional mechanism plummets in accuracy in fruit detection with more than 40% occlusion, and the Anchor-Free design suffers from localization drift when there are sudden changes in fruit size (e.g., near-large and far-small), which verifies the optimization space of pure attentional methods for the detection of dense targets in agriculture.

The YOLOv11-SLBA model is able to effectively recognize tomato ripening targets in complex environments, including small targets occluded by branches and leaves, transitionally ripe fruits under uneven lighting conditions, and densely stacked tomatoes. As shown in Figure 20, the detection performance of the improved model in this paper under different environments is shown. It can accurately detect the vast majority of the targets in the scene, which significantly reduces the problems of omission and misdiagnosis that are commonly found in traditional methods. By substituting the SPPF in the backbone network with the SPPF-LSKA structure, the model improves the feature adaptation ability to light changes and occlusion interference, allowing the network to concentrate more on the key characteristics of tomato targets, as demonstrated by the detection effect. Meanwhile, the introduced BiAttFPN hierarchical fusion mechanism effectively improves the feature retention of small targets under occlusion conditions, while the new DetectAux auxiliary detection head significantly improves the differentiation ability of similar ripeness categories. To sum up, the YOLOv11-SLBA model is able to identify more targets than other models, identify small targets at a greater distance, identify the ripeness of nearby fruits, identify under branch and leaf occlusion, fruit stacking, and other interferences, and perform better than other models.

4. Discussion

In this study, the YOLOv11-SLBA model was innovatively proposed, and by introducing the SPPF-LSKA module, the BiAttFPN feature fusion strategy, and the DetectAux-assisted detection head, it was able to overcome the three key challenges faced in the field of tomato ripening detection, i.e., the leakage of small targets, difficulties in distinguishing between transition stages, and light sensitivity problems. Key innovations contributed in different ways:

In order to improve the flow and expression of information between multi-scale features, the BiAttFPN fusion structure developed in this study incorporates a bidirectional attention mechanism. This allows the model to retain a high recall rate even when processing small, densely occluded objects, like tomatoes. Multi-scale attention processes have been demonstrated to provide notable performance improvements in agricultural target identification in related studies [40]. YOLOv11-SLBA significantly mitigates the issue of missed detections by increasing the recall value in obscured situations by effectively keeping characteristics in limited target areas.

Second, this study specifically added an auxiliary detection head, DetectAux, as a supplemental channel to the main output to assist in identifying the tomato ripeness transition stage. By directing the model to concentrate on the distinctions between highly comparable categories in the feature space (such “half-ripe” and “red-ripe”), this detection head greatly enhances category separability and lowers misclassification at the crucial ripeness stage. Blurred category borders are a major problem for deep learning models in multi-stage fruit ripeness identification [41]. This difficulty is well addressed by the introduction of DetectAux in this investigation. The main detection head’s capacity to differentiate complex data is enhanced by the addition of auxiliary branches, which give it more semantic supervision.

Third, the SPPF-LSKA module incorporates a large-core attention mechanism into its structure to solve light sensitivity difficulties, improving the model’s capacity to examine local elements in images in various lighting scenarios. Image quality can be impacted by common problems in agricultural output, including backlighting, highlights, and shadows. This module preserves the lightweight aspect of the model while enhancing robustness in such situations. Compared with the traditional SPPF structure, the introduction of LSKA enables the model to maintain high feature stability under high dynamic lighting conditions, which is consistent with the HDR enhancement detection approach proposed by Purohit, M et al. [42].

In order to replicate the difficulties brought on by environmental changes in actual farming, this study also created an improved dataset including a variety of interference elements (such as fog, sprinklers, intense light, and obstruction). According to the experimental findings, the YOLOv11-SLBA model outperforms the baseline model on upgraded data in terms of robustness and detection accuracy, particularly in situations with complicated illumination and ambiguous maturity. This confirms that upgrading the model structure alone is insufficient in real-world agricultural applications; environmental simulation and data diversity are both crucial components in enhancing performance.

The usefulness of the enhanced model is well demonstrated by the experimental findings. The YOLOv11-SLBA model’s mAP50 index on the self-built dataset is 91.3%, a 1.5% improvement over the baseline model YOLOv11, proving its exceptional performance on particular datasets. Meanwhile, the model also shows good generalization ability on public datasets, with a mAP50 as high as 93.7%, which demonstrates its wide applicability and ability to adapt to data from different sources.

Analyzing the roles of the model components in depth, we find that the BiAttFPN structure significantly enhances the model’s ability to adapt to occlusion scenarios by virtue of the bidirectional attention mechanism. In real agricultural production scenarios, it is more common for tomato plants to shade each other [43], and this improvement enables the model to more accurately recognize tomato ripeness in the face of such complex situations. The SPPF-LSKA module, on the other hand, effectively enhances the feature discriminative power of the model under uneven light conditions, which is particularly prominent in the detection of tomato color change stages, and the detection accuracy is improved compared with the traditional method. The determination of tomato ripeness at this stage is crucial for the timing of picking, and this advantage of the model provides strong support for actual production.

Compared with existing studies, the YOLOv11-SLBA model achieves a perfect balance of accuracy and speed while maintaining a lightweight. Its parameter scale is only 2.72 M and its inference speed reaches 2.3 ms/frame, which is fully capable of meeting the demand of real-time detection in the field. However, we also note that the performance of the model under extreme backlight conditions shows some fluctuations and a decrease in the detection accuracy, which indicates that the model still has much room for improvement in terms of light invariance. For this reason, subsequent studies can consider combining multispectral data or physical imaging models to further optimize the feature extraction module to enhance the model’s adaptability to various lighting conditions [44]. Meanwhile, exploring advanced methods such as knowledge distillation to reduce the model’s dependence on labeled data is also an important direction to improve the model’s utility [45]. These improvements will help promote the scale application of tomato ripening detection technology in actual agricultural production and provide strong support for the intelligent development of agriculture.

In addition, most of the current deep learning-based tomato ripeness detection studies are limited to the recognition task under a single growing environment, ignoring the significant impact of complex and changing environmental factors on the detection effect in real production. To address this issue, this study constructs an augmented dataset containing multiple disturbing factors such as spray, fog, bright light, and shading. The experimental results showed that the YOLOv11-SLBA model trained with environmental enhancement improved the detection accuracy under simulated disturbance conditions compared with the baseline model, especially in the identification accuracy of tomatoes at the color change stage. This result clearly demonstrates that in practical agricultural applications, it is not enough to optimize the model structure, but the effects of environmental variables on the detection system must also be fully considered.

Based on the above study, future research should further expand the coverage of environmental interference factors and establish a dataset that is closer to the real production scene [46]. This is not only of inestimable importance for promoting the practical application of tomato intelligent picking technology, but also this complex environment-oriented research idea provides a useful reference for the growth monitoring of other crops. We plan to develop more precise and effective agricultural intelligent detection technology and to revive the advancement of contemporary agriculture by consistently refining the model and expanding the dataset.

5. Conclusions

In view of the demand for tomato ripeness detection under complex environmental conditions, this study not only self-constructed a dataset of tomato images with different ripeness levels but also proposed YOLOv11-SLBA, a model specifically optimized for agricultural challenges such as occlusion, subtle color transitions, and lighting variations. The key modifications—replacing SPPF with SPPF-LSKA in the backbone, integrating BiAttFPN in the neck, and adding DetectAux heads—were designed to address three critical limitations in existing methods:

SPPF-LSKA enhances spectral sensitivity, particularly in distinguishing fine color differences (e.g., green–yellow and red ripe stages) under uneven lighting. This is evidenced by a 0.6% mAP50 improvement in complex conditions, reducing the misclassification of transitional ripeness levels.
BiAttFPN improves feature retention for occluded tomatoes, a common scenario in dense foliage or clustered fruit. The bidirectional attention mechanism increased precision (P) by 3.9% for small targets (<50 × 50 pixels) while maintaining precision (P) above 92%, demonstrating robustness against partial occlusion.
DetectAux refines ripeness grading accuracy by leveraging multi-scale feature distillation. The six-head architecture reduced false positives between visually similar stages, critical for harvesting robots requiring stage-specific actions.

Experimental results on both our custom dataset and the public tomato-ripeness1 benchmark validate the model’s superiority:

High accuracy under variability: Achieved 91.3% mAP50 and 87.5% F1-score despite occlusion/lighting noise, outperforming YOLOv11 by 1.5% mAP50.
Generalization capability: On tomato-ripeness1, the model attained 93.7% mAP50 and 84.6% F1-score, proving adaptability to diverse environments.
Real-time readiness: With 10.9 MB memory and less FPS on embedded hardware, YOLOv11-SLBA balances speed and precision for field deployment.

The YOLOv11-SLBA model has made significant progress in both average accuracy and inference speed in the direction of target detection, and the lightweight of the model has also been significantly improved; these advances reflect the potential and promise of the YOLO algorithm, which provides methods and approaches for the application of tomato picking robots [47].

Author Contributions

Conceptualization, Y.H.; methodology, Y.H.; software, Y.H.; validation, L.R.; formal analysis, H.L.; data curation, H.Z.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H. and H.L.; visualization, Y.H.; supervision, X.F. and H.L.; project administration, X.F.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61962047), the Inner Mongolia Autonomous Region Science and Technology Major Special Project (2021ZD0005), the Inner Mongolia Autonomous Region Natural Science Foundation (2024MS06002), the Inner Mongolia Autonomous Region Universities and Colleges Innovative Research Team Program (NMGIRT2313), the Basic Research Business Fund for Inner Mongolia Autonomous Region Directly Affiliated Universities (BR22-14-05), Intelligent Knowledge Service Model for Agricultural and Livestock (2025ZD012) and the Collaborative Innovation Projects between Universities and Institutions in Hohhot (XTCX2023-20, XTCX2023-24).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

If data are needed, interested parties may contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nie, H.; Yang, X.; Zheng, S.; Hou, L. Gene-Based Developments in Improving Quality of Tomato: Focus on Firmness, Shelf Life, and Pre- and Post-Harvest Stress Adaptations. Horticulturae 2024, 10, 641. [Google Scholar] [CrossRef]
Deribe, H.; Beyene, B.; Beyene, B.J.F.S.; Management, Q. Review on pre and post-harvest management on quality tomato (Lycopersicon esculentum Mill.) production. Food Sci. Qual. Manag. 2016, 54, 72–79. [Google Scholar]
Song, K.; Chen, S.; Wang, G.; Qi, J.; Gao, X.; Xiang, M.; Zhou, Z. Research on High-Precision Target Detection Technology for Tomato-Picking Robots in Sustainable Agriculture. Sustainability 2025, 17, 2885. [Google Scholar] [CrossRef]
Kabir, H.; Tham, M.-L.; Chang, Y.C. Internet of robotic things for mobile robots: Concepts, technologies, challenges, applications, and future directions. Digit. Commun. Netw. 2023, 9, 1265–1290. [Google Scholar] [CrossRef]
Sherafati, A.; Mollazade, K.; Saba, M.K.; Vesali, F. TomatoScan: An Android-based application for quality evaluation and ripening determination of tomato fruit. Comput. Electron. Agric. 2022, 200, 107214. [Google Scholar] [CrossRef]
Halstead, M.; McCool, C.; Denman, S.; Perez, T.; Fookes, C. Fruit quantity and ripeness estimation using a robotic vision system. IEEE Robot. Autom. Lett. 2018, 3, 2995–3002. [Google Scholar] [CrossRef]
Behera, S.K.; Rath, A.K.; Sethy, P.K. Maturity status classification of papaya fruits based on machine learning and transfer learning approach. Inf. Process. Agric. 2021, 8, 244–250. [Google Scholar] [CrossRef]
Zu, L.; Zhao, Y.; Liu, J.; Su, F.; Zhang, Y.; Liu, P. Detection and segmentation of mature green tomatoes based on mask R-CNN with automatic image acquisition approach. Sensors 2021, 21, 7842. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Kang, H.; Zhou, H.; Wang, X.; Chen, C. Real-time fruit recognition and grasping estimation for robotic apple harvesting. Sensors 2020, 20, 5670. [Google Scholar] [CrossRef]
Zhang, J.; Xie, J.; Zhang, F.; Gao, J.; Yang, C.; Song, C.; Rao, W.; Zhang, Y. Greenhouse tomato detection and pose classification algorithm based on improved YOLOv5. Comput. Electron. Agric. 2024, 216, 108519. [Google Scholar] [CrossRef]
Wang, C.; Han, Q.; Li, J.; Li, C.; Zou, X. YOLO-BLBE: A Novel Model for Identifying Blueberry Fruits with Different Maturities Using the I-MSRCR Method. Agronomy 2024, 14, 658. [Google Scholar] [CrossRef]
Zhai, S.; Shang, D.; Wang, S.; Dong, S. DF-SSD: An improved SSD object detection algorithm based on DenseNet and feature fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An improved YOLO algorithm for detecting flowers and fruits on strawberry seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Seo, D.; Cho, B.-H.; Kim, K.-C. Development of monitoring robot system for tomato fruits in hydroponic greenhouses. Agronomy 2021, 11, 2211. [Google Scholar] [CrossRef]
Sun, H.; Zheng, Q.; Yao, W.; Wang, J.; Liu, C.; Yu, H.; Chen, C. An Improved YOLOv8 Model for Detecting Four Stages of Tomato Ripening and Its Application Deployment in a Greenhouse Environment. Agriculture 2025, 15, 936. [Google Scholar] [CrossRef]
Faseeh, M.; Bibi, M.; Khan, M.A.; Kim, D.-H. Deep learning assisted real-time object recognition and depth estimation for enhancing emergency response in adaptive environment. Results Eng. 2024, 24, 103482. [Google Scholar] [CrossRef]
Wang, L.; Yoon, K.-J. Deep learning for HDR imaging: State-of-the-art and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8874–8895. [Google Scholar] [CrossRef]
Yang, X. Application of High-Speed Optical Measurement Based on Nanoscale Photoelectric Sensing Technology in the Optimization of Football Shooting Mechanics. J. Nanoelectron. Optoelectron. 2023, 18, 1493–1501. [Google Scholar] [CrossRef]
Ambrus, B.; Teschner, G.; Kovács, A.; Neményi, M.; Helyes, L.; Pék, Z.; Takács, S.; Alahmad, T.; Nyéki, A. Field-grown tomato yield estimation using point cloud segmentation with 3D shaping and RGB pictures from a field robot and digital single lens reflex cameras. Heliyon 2024, 10, e37997. [Google Scholar] [CrossRef]
GH T1193-2021; Standards for Supply and Distribution Cooperation in the People’s Republic of China. All China Federation of Supply and Marketing Cooperatives (ACFSMC): Beijing, China, 2021.
Liang, X.; Jia, H.; Wang, H.; Zhang, L.; Li, D.; Wei, Z.; You, H.; Wan, X.; Li, R.; Li, W.; et al. ASE-YOLOv8n: A Method for Cherry Tomato Ripening Detection. Agronomy 2025, 15, 1088. [Google Scholar] [CrossRef]
Meng, Z.; Du, X.; Xia, J.; Ma, Z.; Zhang, T. Real-time statistical algorithm for cherry tomatoes with different ripeness based on depth information mapping. Comput. Electron. Agric. 2024, 220, 108900. [Google Scholar] [CrossRef]
Akbar, J.U.M.; Kamarulzaman, S.F.; Muzahid, A.J.M.; Rahman, A.; Uddin, M. A Comprehensive review on deep learning assisted computer vision techniques for smart greenhouse agriculture. IEEE Access 2024, 12, 4485–4522. [Google Scholar] [CrossRef]
Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Overfitting, Model Tuning, and Evaluation of Prediction Performance. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerland, 2022; pp. 109–139. [Google Scholar]
He, L.-H.; Zhou, Y.-Z.; Liu, L.; Cao, W.; Ma, J.-H. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]
Rahman, M.; Khan, S.I.; Babu, H.M.H. BreastMultiNet: A multi-scale feature fusion method using deep neural network to detect breast cancer. Array 2022, 16, 100256. [Google Scholar] [CrossRef]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for Accurate Object Detection. In Pattern Recognition and Computer Vision, Proceedings of the 7th Chinese Conference, PRCV 2024, Urumqi, China, 18–20 October 2024; Springer: Singapore, 2024; pp. 492–505. [Google Scholar]
Zhang, X.; Wang, Y.; Fang, H. Steel surface defect detection algorithm based on ESI-YOLOv8. Mater. Res. Express 2024, 11, 056509. [Google Scholar] [CrossRef]
Xu, K.; Zhu, D.; Shi, C.; Zhou, C. YOLO-DBL: A multi-dimensional optimized model for detecting surface defects in steel. J. Membr. Comput. 2025, 1–11. [Google Scholar] [CrossRef]
Liu, D.; Liang, J.; Geng, T.; Loui, A.; Zhou, T. Tripartite feature enhanced pyramid network for dense prediction. IEEE Trans. Image Process. 2023, 32, 2678–2692. [Google Scholar] [CrossRef]
Senussi, M.F.; Kang, H.-S. Occlusion Removal in Light-Field Images Using CSPDarknet53 and Bidirectional Feature Pyramid Network: A Multi-Scale Fusion-Based Approach. Appl. Sci. 2024, 14, 9332. [Google Scholar] [CrossRef]
Thapa, N.; Khanal, R.; Bhattarai, B.; Lee, J. Pine Wilt Disease Segmentation with Deep Metric Learning Species Classification for Early-Stage Disease and Potential False Positive Identification. Electronics 2024, 13, 1951. [Google Scholar] [CrossRef]
Anbeek, P.; Vincken, K.L.; van Osch, M.J.; Bisschops, R.H.; van der Grond, J. Probabilistic segmentation of white matter lesions in MR imaging. NeuroImage 2004, 21, 1037–1044. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Zhang, H.; Tian, Y.; Wang, K.; Zhang, W.; Wang, F.-Y. Mask SSD: An effective single-stage approach to object instance segmentation. IEEE Trans. Image Process. 2019, 29, 2078–2093. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Du, F.-J.; Jiao, S.-J. Improvement of lightweight convolutional neural network model based on YOLO algorithm and its research in pavement defect detection. Sensors 2022, 22, 3537. [Google Scholar] [CrossRef]
Tomato Maturity. Tomato Ripeness1 Dataset. 2022. Available online: https://universe.roboflow.com/tomato-maturity/tomato-ripeness1 (accessed on 5 May 2025).
Du, Z.; Liang, Y. Object detection of remote sensing image based on multi-scale feature fusion and attention mechanism. IEEE Access 2024, 12, 8619–8632. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Fan, W.; Liu, J. State-of-the-Art Techniques for Fruit Maturity Detection. Agronomy 2024, 14, 2783. [Google Scholar] [CrossRef]
Purohit, M.; Singh, M.; Kumar, A.; Kaushik, B.K. Enhancing the surveillance detection range of image sensors using HDR techniques. IEEE Sens. J. 2021, 21, 19516–19528. [Google Scholar] [CrossRef]
Sattler, C.; Nagel, U.J.; Werner, A.; Zander, P. Integrated assessment of agricultural production practices to enhance sustainable development in agricultural landscapes. Ecol. Indic. 2010, 10, 49–61. [Google Scholar] [CrossRef]
Liu, J.; Wu, G.; Liu, Z.; Wang, D.; Jiang, Z.; Ma, L.; Zhong, W.; Fan, X.; Liu, R. Infrared and visible image fusion: From data compatibility to task adaption. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2349–2369. [Google Scholar] [CrossRef]
Tian, Y.; Pei, S.; Zhang, X.; Zhang, C.; Chawla, N.V. Knowledge distillation on graphs: A survey. ACM Comput. Surv. 2025, 57, 1–16. [Google Scholar] [CrossRef]
Zhang, K.; Fang, B.; Zhang, Z.; Liu, T.; Liu, K. Exploring future ecosystem service changes and key contributing factors from a “past-future-action” perspective: A case study of the Yellow River Basin. Sci. Total Environ. 2024, 926, 171630. [Google Scholar] [CrossRef] [PubMed]
Miao, Z.; Yu, X.; Li, N.; Zhang, Z.; He, C.; Li, Z.; Deng, C.; Sun, T. Efficient tomato harvesting robot based on image processing and deep learning. Precis. Agric. 2023, 24, 254–287. [Google Scholar] [CrossRef]

Figure 1. Map of different categories of data.

Figure 2. Comparison diagram of spray simulation.

Figure 3. Comparison chart of fog simulation.

Figure 4. Strong light simulation comparison chart.

Figure 5. Comparison plot of masking simulations.

Figure 6. Data annotation interface diagram.

Figure 7. YOLOv11 network architecture diagram.

Figure 9. LSKA module network structure diagram.

Figure 10. SPPF-LSKA module network structure diagram.

Figure 11. Detailed diagram of the SPPF-LSKA module network structure.

Figure 12. FPN, PAN, and BIFPN network structure diagram.

Figure 14. Diagram of the structure of the DetectAux network.

Figure 15. Convergence curves of YOLOv11 and YOLOv11-SLBA on the tomato ripening dataset.

Figure 16. YOLOv11-SLBA detection results on the tomato-ripeness1 dataset.

Figure 17. Comparison of model performance before and after improvement.

Figure 18. Confusion matrix for different models’ predictions on different categories. Note: (A) YOLOv11; (B) YOLOv11-SLBA. The horizontal axis in the figure represents the actual type of tomato ripeness, and the vertical axis represents the model’s predicted type of tomato ripeness. For example, in (B), the YOLOv11-SLBA model correctly predicted 1002 cases of “Yellowing and half ripe”, but incorrectly classified 84 polyps as background.

Figure 19. Comparison plot of the detection effect of different models.

Figure 20. Detection effect diagram in different complex environments.

Table 1. Data classification comparison table.

GH/T1193-2021	Article Category	Color/Size Characteristics	Applicable Scenarios
Unripe stage + green ripe stage	Flowering and young fruiting	White-green, diameter < 3 cm	Do not pick
Transition period	Green growth period	Pure green, diameter ≥ 3 cm	Long-term storage
Early ripening stage	Yellowing and half ripe	The colored area accounts for 30–89% of the total area	Short-haul transportation
Mid-to-late stage of ripening	Red maturity stage	Red accounts for more than 70%	Instant sales

Table 2. A statistical table showing how many photos fall into each category.

Categories	Total Number of Images/Picture	Number of Training Sets/Sheet	Number of Validation Sets/Sheet	Number of Test Sets/Sheet
Flowering and young fruiting	600	422	125	53
Green growth period	4407	3075	895	437
Yellowing and half ripe	1469	1021	306	142
Red maturity stage	2044	1418	416	210

Table 3. Statistical table of the number of different categories of images after environmental enhancement.

Categories	Total Number of Images/Picture	Number of Training Sets/Sheet	Number of Validation Sets/Sheet	Number of Test Sets/Sheet
Flowering and young fruiting	1470	1107	241	122
Green growth period	6512	4549	1298	712
Yellowing and half ripe	3899	2721	784	417
Red maturity stage	4041	2795	838	400

Table 4. Comparison of the number of images before and after data enhancement.

Datasets	Total Number of Images/Picture	Number of Training Sets/Sheet	Number of Validation Sets/Sheet	Number of Test Sets/Sheet
Pre-expansion dataset	6493	4546	1298	649
Expanded dataset	10,000	7000	2000	1000

Table 5. Performance comparison of self-built datasets in different target detection networks under the same conditions.

Model	Participants /Million	Model Memory/MB	Inference Time/ms	P	R	mAP50	F1-Score	mAP50-95
Faster R-CNN	41.12	657	75	60.5%	88.4%	87.5%	71.8%	61.0%
SSD	24.01	93	22.5	89.3%	78.4%	85.7%	83.5%	59.5%
RT-DETR	19.9	307.4	96.3	91.2%	82.9%	88.6%	86.8%	61.9%
YOLOv7	36.5	148	5.8	89.5%	84.9%	90.2%	87.1%	63.6%
YOLOv8	3	5.8	1.2	90.1%	84.5%	90.4%	87.2%	63.9%
YOLOv11	2.8	6.9	3.6	88.6%	83.4%	89.8%	85.9%	63.8%
YOLOv11-SLBA	2.72	10.9	2.3	92.0%	83.5%	91.3%	87.5%	64.6%

Table 6. Performance comparison of public datasets in different target detection networks under the same conditions.

Model	Participants /Million	Model Memory/MB	Inference Time/ms	P	R	mAP50	F1-Score
Faster R-CNN	41	657	77	55.6%	96.9%	89.02%	70.6%
SSD	26.28	92	26	70.5%	91.4%	87.1%	79.7%
RT-DETR	19.9	307.3	176.9	77.9%	86.5%	89.7%	81.9%
YOLOv7	36.48	73	6.2	78.2%	91.0%	88.7%	84.2%
YOLOv8	3	5.8	1.7	83.0%	84.0%	91.5%	83.5%
YOLOv11	2.58	9.8	3.2	85.9%	85.3%	92.7%	85.6%
YOLOv11-SLBA	2.72	10.9	2.5	78.6%	91.5%	93.7%	84.6%

Table 7. Performance comparison of different datasets in YOLOv11-SLBA model.

Datasets	P	R	mAP50	F1-Score
Public Dataset	78.6%	91.5%	93.7%	84.6%
Self-built Datasets	92.0%	83.5%	91.3%	87.5%

Table 8. Modeling improved ablation experiments.

Methodologies			Model	P	R	mAP50	F1-Score
SPPF-LSKA	BiAttFPN	Aux	Model	P	R	mAP50	F1-Score
✗	✗	✗	YOLOv11	88.6%	83.4%	89.8%	85.9%
✓	✗	✗	YOLOv11-SPPF-LSKA	89.3%	84%	90.4%	86.6%
✗	✓	✗	YOLOv11-BiAttFPN	92.5%	81.1%	90.8%	86.4%
✗	✗	✓	YOLOv11-Aux	90.9%	83.2%	90.1%	86.9%
✓	✗	✓	YOLOv11-SPPF-LSKA-Aux	90.8%	84.1%	90.7%	87.3%
✓	✓	✓	YOLOv11-SLBA	92%	83.5%	91.3%	87.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, Y.; Rao, L.; Fu, X.; Zhou, H.; Li, H. Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling. Agriculture 2025, 15, 1310. https://doi.org/10.3390/agriculture15121310

AMA Style

Hao Y, Rao L, Fu X, Zhou H, Li H. Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling. Agriculture. 2025; 15(12):1310. https://doi.org/10.3390/agriculture15121310

Chicago/Turabian Style

Hao, Yan, Lei Rao, Xueliang Fu, Hao Zhou, and Honghui Li. 2025. "Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling" Agriculture 15, no. 12: 1310. https://doi.org/10.3390/agriculture15121310

APA Style

Hao, Y., Rao, L., Fu, X., Zhou, H., & Li, H. (2025). Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling. Agriculture, 15(12), 1310. https://doi.org/10.3390/agriculture15121310

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tomato Ripening Detection in Complex Environments Based on Improved BiAttFPN Fusion and YOLOv11-SLBA Modeling

Abstract

1. Introduction

2. Materials and Methods

2.1. Sources and Acquisition of Images

2.2. Data Preprocessing and Environmental Enhancement

2.3. Data Labeling and Segmentation

2.4. YOLOv11 Network Architecture

2.5. YOLOv11-SLBA Network Structure

2.5.1. Space Pyramid-Large Nucleus Attention SPPF-LSKA

2.5.2. Bidirectional Attention Feature Pyramid Network BiAttFPN

2.5.3. Auxiliary Detection Head DetectAux

3. Results and Analysis

3.1. Experimental Environment and Performance Evaluation Index

3.2. Convergence Analysis

3.3. Comparison Experiment

3.3.1. Comparison of Model Performance on Self-Built Datasets

3.3.2. Verification of Generalization Capabilities on Public Datasets

3.3.3. Stability Analysis of Improved Models on Heterogeneous Datasets

3.4. Ablation Experiment

3.5. Evaluation of Testing Effectiveness

3.5.1. Comparison of Model Performance Before and After Improvement

3.5.2. Comparison of Detection Effect of Different Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI