1. Introduction
As the core component of modern agriculture, facility agriculture overcomes the reliance of traditional farming on natural conditions by artificially regulating environmental factors such as light, temperature, and humidity. It has become a critical industry for ensuring global food security and enhancing the efficiency of agricultural product supply [
1]. Greenhouse cultivation, one of the most widely adopted forms of facility agriculture, has become a primary platform for off-season vegetable production in northern China during winter, owing to its relatively low construction cost, ease of standardized management, and strong environmental adaptability. Among greenhouse crops, round eggplant is an important commercial vegetable; its firm flesh, good storability and transportability, and competitive price have made it a leading product in northern markets, contributing substantially to farmers’ income and the regional agricultural economy [
2,
3].
Yield assessment is a core component of greenhouse eggplant production management, and the accuracy of fruit counting directly affects planting strategy optimization, water and fertilizer allocation, supply-chain planning, and the reliability of economic benefit forecasting [
4,
5,
6]. Traditional eggplant counting has relied mainly on manual field inspection, where workers tally fruits to estimate total yield. This process is subjective and labor-intensive, and its accuracy is easily degraded by fruit occlusion and inconsistent counting efficiency, making it inadequate for real-time monitoring in large-scale production. As a result, conventional manual approaches can no longer meet the requirements of precision agriculture for real-time, non-destructive, and accurate data, highlighting the urgent need for automated and high-precision counting methods.
Computer vision–based detection provides an effective way to address these challenges. Early studies mainly relied on traditional image processing techniques, such as exploiting color [
7,
8], shape [
9,
10], and texture [
11] features for fruit segmentation and recognition. These methods can achieve acceptable performance under relatively simple backgrounds and uniform illumination. However, greenhouse scenes are inherently complex, with frequent leaf and stem occlusion, strong specular reflections, background interference from structures, and severe overlap among fruits. These factors limit the generalization and robustness of conventional image processing approaches, making them difficult to apply reliably in practical production.
Recent advances in deep learning provide an effective solution to these challenges [
12]. Compared with LiDAR [
13,
14], near-infrared spectroscopy [
15], and thermal imaging [
16,
17], deep learning–based vision methods are typically lower in cost, easier to deploy, and able to extract rich information from images, showing strong potential for agricultural vision tasks. These methods have been applied to fruit detection [
18,
19,
20], crop disease and pest identification [
21,
22], fruit maturity grading [
23,
24], and weed management [
25,
26]. Pan et al. [
27] proposed an improved Faster R-CNN (Faster Region-based Convolutional Neural Network) based detection model that enhanced seedling feature extraction and provided more accurate data support for cultivation management. Islam et al. (2024) improved Mask R-CNN (Mask Region-based Convolutional Neural Network) to achieve accurate detection and segmentation of lettuce seedlings in tray images, reaching an F1-score of 93% and producing leaf-area estimates that were highly consistent with manual measurements [
28]. El Akrouchi et al. (2025) improved Mask R-CNN for accurate detection and segmentation of dense quinoa panicles by replacing the backbone with EfficientNet-B7 and adopting the Mish activation function, thereby enhancing representation and segmentation performance in crowded scenes and providing useful insights for smart agriculture research [
29].
YOLO (You Only Look Once) uses a one-stage detection framework for object detection, achieving a good trade-off between inference speed and detection accuracy, and is therefore suitable for agricultural scenarios with strict real-time requirements. Su et al. [
30] developed an SE-YOLOv3-MobileNetV1 model for greenhouse tomatoes by integrating depthwise separable convolutions with a squeeze-and-excitation (SE) attention mechanism, achieving a mean average precision (mAP) of 97.5% for classifying four maturity stages. Omer et al. [
31] redesigned YOLOv5l for cucumber leaf disease identification by incorporating Bottleneck CSP modules, reducing the model size to 13.6 MB. Zhu et al. [
32], focusing on camellia fruits in complex orchard settings, developed an SDF-YOLO model by integrating Selective Kernel Convolution (SKConv), a decoupled head, and Focal EIoU loss. The model achieved a comprehensive mAP of 96.65% and exhibited strong adaptability to occlusion and lighting variations. Chen [
33] aimed to address the challenge of detecting ripe wolfberry fruits in natural environments by proposing an improved YOLOv8-based lightweight and efficient detection model. This model reliably achieves real-time recognition of wolfberry fruits under complex conditions, providing valuable references for non-destructive maturity detection in other agricultural products. In the field of jujube fruit recognition, Wang et al. (2025) [
34] introduced the Jujube-YOLO model, which incorporates targeted enhancements to the latest YOLOv11 architecture. By integrating a multi-branch channel attention module, their method surpasses several image detections models in accuracy for identifying jujube fruits and their split states in complex environments, demonstrating the significant potential of object detection models in addressing specific agricultural challenges. This study aims to detect and count eggplants under complex greenhouse conditions using per-row video acquisition. Such a data-collection protocol poses significant challenges, such as severe occlusions, illumination variations, background clutter, motion blur, and duplicate counts caused by repeated appearances of the same fruit in adjacent frames.
However, relying solely on static image detection is insufficient for achieving precise yield estimation and counting. In video sequences, factors such as fruit occlusion, camera motion, and viewpoint variations can lead to repeated counting or missed detections of the same fruit, significantly compromising the accuracy of statistical results. To address this issue, multi-object tracking (MOT) has been introduced into yield estimation systems. This approach aims to assign a unique and persistent ID to each detected target in the video, enabling continuous tracking of fruit trajectories and accurate counting.
DeepSORT (Deep Simple Online and Realtime Tracking) extends SORT (Simple Online and Realtime Tracking) by incorporating appearance features for more reliable association, while using a Kalman filter for motion prediction and the Hungarian algorithm for data association [
35,
36].
Currently, studies have explored the integration of YOLO with DeepSORT for agricultural applications, such as the counting of passion fruit [
37], apples [
38], and cherry tomatoes [
39]. However, the detection and tracking of eggplants in greenhouse environments still face unique challenges. These include occlusion by stems and leaves, structural interference, and the similarity in shape between eggplants and overlapping foliage. In addition, uneven greenhouse lighting creates intense highlights and deep shadows that significantly degrade detector performance. In practical scenarios, the high density and severe mutual occlusion of eggplant fruits often lead to frequent ID switches for the same target, resulting in repeated counts and placing higher demands on the discriminative capability of tracking algorithms. The main contributions are as follows:
(1) An integrated detection-and-counting framework for eggplant fruits in complex greenhouse environments is proposed by combining an improved YOLOv5s and DeepSORT algorithms, enabling robust video-based detection and counting.
(2) To improve deployment efficiency under greenhouse interference and resource constraints, YOLOv5s is redesigned by replacing the backbone with MobileNetV3, inserting an ECA module, and replacing the neck C3 block with C2f, achieving a better trade-off between detection accuracy and computational cost.
(3) To address duplicate counting caused by frequent ID switches during tracking, a counting-zone strategy is proposed. It replaces traditional ID-accumulation counting with a rule that increments the count when the bounding-box center enters the counting zone, thereby reducing duplicate counts and improving counting accuracy.
2. Materials and Methods
An improved YOLOv5s–DeepSORT-based framework is developed for eggplant detection and counting in greenhouse environments. The workflow is illustrated in
Figure 1. First, greenhouse inspection videos are collected and used to construct an eggplant fruit detection dataset through frame extraction, data augmentation, and manual annotation. Second, the improved YOLOv5s detector is trained to generate fruit bounding boxes for each frame. Third, the detections are associated across frames using DeepSORT to maintain target identity continuity. Finally, a counting-zone strategy is applied: the count is incremented only once per target when the bounding-box center enters the predefined counting zone, which reduces duplicate counting caused by short-term occlusion and identity switches. Detection performance and counting performance are evaluated using standard metrics.
2.1. Data Acquisition
The video data for this study were collected from a greenhouse vegetable facility located in Jijialishuang Village, Heguan Town, Qingzhou City, Shandong Province (coordinates: 36°51′2.890′′ N, 118°37′35.516′′ E). The target crop was Tianjin round eggplant. The planting system in the test area employed flat land cultivation, with a row spacing of 1.5 m and alley lengths ranging from 12 to 15 m. The eggplant fruits are mainly distributed between 0.5 and 1.2 m in height. This planting structure and fruit distribution are suitable for the field operations of harvesting and inspection robots, as shown in
Figure 2.
In this study, an Intel RealSense D435i depth camera (Intel Corporation, Santa Clara, CA, USA) was used for data collection, mounted on a mobile platform. The camera offers six degrees of freedom, with an accelerometer range of ±4 g and sample rates of 62.5 Hz and 250 Hz, and a gyroscope range of ±1000 deg/s at 200 Hz and 400 Hz. It also provides a timestamp accuracy of 50 microseconds. This configuration ensures high precision in capturing eggplant movement and positioning during inspection, essential for accurate detection and tracking in the greenhouse.
The camera lens was positioned 0.8 m above the ground with a pitch angle of approximately −10°, simulating the perspective of an inspection robot. This setup enabled effective monitoring of the eggplants within the dynamic and complex environment of the greenhouse. Data was collected on 12 November 2024, under variable lighting conditions, with challenges such as building interference, leaf occlusion, and inconsistent illumination. The platform was manually towed at a constant speed of 0.1 m/s along the eggplant rows, simulating the motion of an inspection robot. A total of 10 valid video sequences (2–3 min each) were recorded at 960 × 544 resolution and 30 fps, saved in MP4 format.
2.2. Dataset Construction
In complex greenhouse environments, factors such as fluctuating lighting, structural interference, occlusion among stems, leaves, and fruits, and motion blur from uneven platform movement pose significant challenges to accurate model recognition. Frames were extracted every 15 frames from the 7 randomly selected videos. The remaining 3 videos were reserved for the subsequent validation of the DeepSORT tracking and counting algorithm.
For dataset construction, frames were extracted from the 7 training videos at fixed 15-frame intervals to reduce redundancy from adjacent frames. Additionally, to enhance model robustness and prevent overfitting from limited samples, several data augmentation techniques were applied using OpenCV (version 4.8.0.76), including flipping, cropping, brightness adjustment, Gaussian noise, blur, and vignette processing.
After augmentation and careful selection, a final dataset of 4601 images was obtained.
Figure 3 illustrates a sample collection of eggplant fruit images from the augmented greenhouse dataset. The dataset was divided into training, validation, and test sets in a ratio of 8:1:1, comprising 3681 images for training, and 460 images each for the validation and test sets. All eggplant instances were annotated using the LabelMe tool (version 5.5.0), where rectangular bounding boxes were drawn along the contour of each fruit. The category was labeled as “eggplant,” and the annotation results were saved in the TXT format required for YOLO training. Each annotation file contains object category and location information.
2.3. The Improved YOLOv5 Detection Model
2.3.1. Network Architecture of the YOLOv5s Model
In recent years, the YOLO family has been widely adopted for real-time object detection due to its end-to-end, one-stage design. Several versions, including YOLOv5, YOLOv8, YOLOv9, and YOLOv11, have been developed through continuous iteration. For this study, YOLOv5s was selected as the base model for improvement due to its well-established and efficient architecture. Although newer versions of YOLO demonstrate superior performance on general datasets, their effectiveness diminishes in specific environments, such as complex greenhouse settings. In such conditions, targets are often densely packed, severely occluded by stems and branches, and affected by structural interference and fluctuating lighting. These factors not only intensify the limitations of detection but also place higher demands on model robustness. Additionally, the pursuit of higher accuracy often leads to increased computational complexity, which challenges practical deployment in resource-limited environments.
The network architecture of the improved YOLOv5s is shown in
Figure 4. The main modifications are summarized as follows:
- (1)
The backbone is replaced with MobileNetV3 to reduce parameters and FLOPs while maintaining feature extraction capability.
- (2)
An Efficient Channel Attention (ECA) module is inserted to enhance discriminative channel responses under occlusion and clutter.
- (3)
The neck C3 blocks are replaced with C2f blocks to strengthen multi-scale feature fusion and improve gradient flow for small and dense targets.
2.3.2. MobileNetV3 Network
In this work, the original YOLOv5s backbone (CSPDarknet53) is replaced with MobileNetV3 to reduce computation. CSPDarknet53 is accurate but relatively heavy, which limits real-time performance on edge devices. MobileNetV3 is a lightweight CNN designed for mobile deployment. It combines depthwise separable convolutions, NAS-optimized architecture, and the h-swish activation to improve efficiency. As a result, the network can extract useful features with lower FLOPs and latency, making it more suitable for real-time object detection on resource-constrained hardware.
The core idea of depthwise separable convolution is to factorise a standard convolution into two lightweight operations: depthwise convolution followed by pointwise (1 × 1) convolution, which substantially reduces computational cost and parameter count.
In contrast to standard convolutions in the YOLOv5s backbone, depthwise separable convolution performs spatial feature extraction via depthwise convolution and channel fusion via a 1 × 1 pointwise convolution, substantially reducing computation and parameters.
For a convolutional layer with an input feature map of size
DF ×
DF ×
M and an output feature map of size
DF ×
DF ×
N, where M and N denote the numbers of input and output channels, respectively, and the kernel size is
DK ×
DK, the computational cost of standard convolution
Cstd can be expressed as:
The depthwise convolution performs spatial filtering separately for each input channel, and its computational cost
CD is given by:
The pointwise convolution employs 1 × 1 kernels to fuse the
M channels output by the depthwise convolution, thereby constructing
N new output channels. Its computational cost
is expressed as:
The total computational cost
Cdsc of the depthwise separable convolution is the sum of the two components:
Compared to standard convolution, the reduction in computational load achieved by using depthwise separable convolution can be calculated as follows:
With a 3 × 3 kernel (DK = 3) and a large number of output channels (N), the computational ratio converges to 1/9. This decomposition therefore reduces the FLOPs of standard convolution by about ninefold, while maintaining comparable feature extraction capacity.
As shown in
Figure 5, the MobileNetV3 block starts with a 1 × 1 convolution to expand channel dimensions and improve feature representation. A 3 × 3 depthwise separable convolution then captures spatial information. Next, a lightweight Squeeze-and-Excitation (SE) module recalibrates channel responses to emphasize informative features. A final 1 × 1 convolution projects the features back to the original channel dimension, forming a bottleneck. A residual connection links the input and output, improving gradient flow and enabling feature reuse.
2.3.3. ECA Attention
To improve robustness under occlusion and background clutter in greenhouse scenes, an ECA module is integrated into YOLOv5s to enhance channel-wise feature discrimination with minimal computational overhead. This integration strengthens feature discrimination and improves eggplant fruit detection in challenging scenes.
The ECA module is illustrated in
Figure 6. Given an input feature map
χ, global average pooling over the spatial dimensions
H ×
W produces a channel descriptor Z. Unlike the Squeeze-and-Excitation (SE) module, ECA removes fully connected layers and uses a lightweight 1D convolution to model cross-channel interactions. The convolution kernel size K is adaptively determined by a nonlinear function K = ψ(C), where C is the number of channels. The descriptor is then passed through the 1D convolution and a Sigmoid function to obtain channel attention weights A ∈ [0, 1]. Finally, A is reshaped to [B, C,1,1] and multiplied with
χ to generate the recalibrated output feature map.
2.3.4. C2f Module
To strengthen multi-scale feature fusion, the YOLOv5s neck is modified by replacing C3 blocks with C2f blocks, which strengthens feature integration across the feature pyramid. In the original YOLOv5s, the C3 module follows the CSPNet design to reduce computation by splitting feature maps and using partial cross-stage connections. However, this split can weaken feature propagation and reduce information retention, especially in complex scenes and small-object detection. The C2f block introduces multiple bottleneck branches and concatenates their outputs for subsequent fusion, which improves feature reuse and gradient flow.
As shown in
Figure 7a, the C3 module splits the input feature map into two branches: one is processed by a sequence of Bottleneck blocks, while the other is forwarded through a shortcut. The two branches are then concatenated. In contrast, the C2f module in
Figure 7b preserves the full feature flow, applies multiple parallel Bottleneck blocks for multi-scale feature extraction, concatenates all Bottleneck outputs with the original input, and finally fuses them using a convolutional layer.
2.4. Methods for Counting Greenhouse Eggplants
2.4.1. Designated Counting-Area Method
The overall pipeline consists of three stages: object detection, inter-frame tracking, and region-triggered counting.
Accurate yield estimation in greenhouse videos requires not only reliable per-frame detection but also consistent target identities across frames. Therefore, we combine the improved YOLOv5s detector with the DeepSORT tracker and a counting-zone strategy to achieve robust online eggplant counting.
Figure 8 illustrates the DeepSORT architecture adopted in this study.
First, the improved YOLOv5s detector is applied to each video frame. Frames are resized to 640 × 640 using a letterbox strategy, normalized, and fed into the network. The raw predictions are filtered by non-maximum suppression (confidence threshold = 0.6, IoU threshold = 0.5), and only detections of the eggplant class are retained.
Second, detections are associated across frames using DeepSORT to maintain target identity continuity. DeepSORT predicts the state of each track using a Kalman filter and extracts an appearance embedding for each detection using a lightweight ReID network. Association is performed by a matching cascade: confirmed tracks are first matched to current detections by minimizing the cosine distance between embeddings, with Mahalanobis-distance gating based on the Kalman prediction; remaining and unconfirmed tracks are then matched using IoU. In the IoU matching stage, an IoU threshold of 0.3 is used. Tracks are kept for up to max_age (max_life) frames without a match to bridge short-term occlusions. Unmatched detections initialize new tracks, and tracks that exceed max_age are removed.
As shown in
Figure 9, targets near the image boundaries are prone to partial occlusion, unstable detections, or incomplete bounding boxes when they first enter the field of view. Therefore, a counting zone is defined along the horizontal axis of the frame: the left boundary is set at one-fifth of the image width, and the right boundary is set at four-fifths of the image width, forming the valid counting region in between. For each tracked target, the center coordinate of its bounding box is computed. When the target enters the counting region and its ID has not been counted, the system triggers a single counting event and adds the ID to the counted-ID set. The workflow of the proposed counting method is shown in
Figure 10.
Counting is triggered when the following conditions are satisfied:
- (1)
The target is successfully matched to an existing track in the current frame. (IoU > 0.3)
- (2)
The bounding-box center enters the predefined counting zone.
- (3)
The target ID is not in the counted-ID set.
Only when all the above conditions are satisfied simultaneously does the system increment the total count by 1 and insert the corresponding ID into the set. The key motivation is that, once a target enters the counting zone, it typically remains within the zone for multiple consecutive frames. Without a de-duplication constraint, the same target would be repeatedly accumulated across frames, leading to overestimation of the final count. By maintaining a counted-ID set, each track ID is counted at most once throughout the entire video, thereby suppressing duplicate counting.
The proposed region-triggered counting scheme with de-duplication reduces repeated accumulation of the same target across consecutive frames within the counting zone. In addition, the IoU matching threshold and short-term track retention (max_life) can partially reduce the probability of ID switches, further improving counting stability. Finally, the system outputs the cumulative count and generates a visualization video with bounding boxes, IDs, and counting-zone annotations.
2.4.2. Model Training Configuration
The experimental study was conducted using a software environment based on Windows 11 Professional, Python 3.8.20, CUDA 11.3, and PyTorch 1.9.0 for image data processing and model development. The hardware setup consisted of a 12th Gen Core i5-12400F CPU and an NVIDIA GeForce RTX 4060 Ti GPU with 32 GB of RAM.
The network was configured with an input image size of 640 × 640 pixels and a batch size of 16. Model optimization was performed using stochastic gradient descent with an initial learning rate of 0.01 and weight decay of 0.0005. To prevent overfitting, training was conducted for 100 epochs, with Mosaic data augmentation disabled during the final 10 epochs.
2.4.3. Evaluation Metrics
To comprehensively evaluate the detection performance of eggplant fruits in complex greenhouse environments, this study employs Precision (P), Recall (R), and Mean Average Precision (mAP) as metrics to assess the improved YOLOv5s model. The relevant calculation formulas are as follows:
This study adopts Precision (P), Recall (R), and mAP@0.5 to evaluate detection performance. Here, TP denotes eggplant fruits correctly detected by the model, whereas FP refers to background regions or non-target objects mistakenly classified as eggplant fruits, which may lead to overestimation in counting. Therefore, Precision measures the proportion of predicted fruits that are true fruits. FN indicates missed detections of real eggplant fruits, resulting in yield underestimation; thus, Recall reflects the model’s capability to detect real fruits. Higher P and R values indicate better detection performance. In addition, mAP@0.5 is a standard comprehensive metric in object detection, defined as the mean of the average precision across classes; a higher value corresponds to superior performance. In this work, an IoU threshold of 0.5 is used because yield counting primarily requires correct fruit detection rather than extremely precise bounding-box localization; a reasonable overlap between predicted and ground-truth boxes is sufficient for subsequent counting.
For multi-object tracking performance, the Counting Accuracy serves as a core metric for evaluating the system’s overall yield estimation capability. It is calculated as follows:
NP represents the number of eggplant fruits identified in the video, while NG denotes the number of eggplant fruits obtained through manual counting. This metric directly reflects the model’s accuracy in yield estimation and counting. When maintaining a positive value, a result closer to 100% indicates more ideal counting performance by the model.
Relative Error serves as a supplementary metric for evaluating the degree of counting deviation. This indicator intuitively reflects the overall deviation percentage of the counting results, with lower values indicating higher counting accuracy of the system. Its calculation formula is as follows:
3. Results and Analysis
3.1. Lightweight Backbone Module Evaluation
To identify the optimal lightweight backbone network for eggplant detection in complex greenhouse environments, this study conducted comprehensive comparative experiments under identical conditions with several advanced lightweight networks. As summarized in
Table 1, based on the original YOLOv5s model, the backbone was replaced with GhostNet, ShuffleNetV2, EfficientNetV2, and MobileNetV3.
The results show that replacing the backbone significantly reduces the parameter count, computational cost, and model size compared with the original YOLOv5s. Among all variants, the MobileNetV3-based model delivers the best overall efficiency, reducing parameters, FLOPs, and model size by 52.4%, 63.6%, and 51.4%, respectively. GhostNet achieves comparable accuracy but requires 45.0% more parameters and 31.7% higher FLOPs than MobileNetV3. ShuffleNetV2 is the most lightweight, but slightly reduces detection precision. EfficientNetV2 is computationally efficient, yet its accuracy remains marginally lower than MobileNetV3. Overall, MobileNetV3 offers the best trade-off between accuracy and complexity, making it a suitable backbone for subsequent improvements.
3.2. Ablation Study
To assess the effect of adding the ECA attention module and C2f blocks after lightweighting, we conduct a series of ablation experiments comparing different module combinations. The results are summarized in
Table 2.
As shown in
Table 2, replacing the backbone with MobileNetV3 reduces parameters from 7.06 M to 3.36 M and FLOPs from 16.5 GFLOPs to 6.0 GFLOPs, achieving a substantially lighter model. These results demonstrate that MobileNetV3 effectively reduces model complexity.
Adding ECA to the original YOLOv5s improves detection accuracy: Precision increases from 95.5% to 98.3%, and mAP@0.5 rises from 97.0% to 99.6%. This indicates that ECA strengthens channel-wise feature discrimination and helps the model focus on small eggplant targets. Replacing C3 with C2f also yields clear gains, reaching 98.3% Precision, 99.5% Recall, and 99.6% mAP@0.5. However, this modification increases model complexity, with parameters and FLOPs rising to 8.09 M and 18.6 GFLOPs, respectively, exceeding the baseline YOLOv5s. The results suggest that C2f improves multi-scale feature fusion through richer shortcut paths and multi-branch bottlenecks, which benefits detection in dense fruit distributions.
When both C2f and ECA are added to the original YOLOv5s, Precision, Recall, and mAP@0.5 increase to 98.7%, 99.4%, and 99.7%, respectively, suggesting a synergistic improvement. ECA enhances attention to small targets, while C2f strengthens multi-scale feature fusion via multi-branch bottlenecks and denser shortcut connections. However, this configuration substantially increases parameters and FLOPs, limiting its suitability for lightweight deployment.
In this work, we integrate ECA and C2f into a MobileNetV3-based lightweight model. Although Precision and Recall decrease slightly, mAP@0.5 remains high at 99.2%. Meanwhile, parameters and FLOPs are limited to 4.45 M and 8.1 GFLOPs, respectively, which are substantially lower than the original YOLOv5s. These results confirm that MobileNetV3 provides an efficient backbone for lightweight feature extraction, allowing ECA and C2f to improve feature representation without excessive computational overhead.
Overall, the ablation study shows that MobileNetV3 is the primary contributor to model lightweighting, while ECA and C2f provide complementary gains in feature discrimination and multi-scale fusion, particularly for dense and small eggplant targets. Since the counting module uses detected bounding boxes as counting instances, maintaining high detection accuracy is essential for reliable counting. The final design achieves a good trade-off between accuracy and efficiency, supporting deployment in resource-constrained environments.
As shown in
Figure 11a, the improved YOLOv5s achieves higher precision than the baseline shortly after training begins and converges to a higher, more stable level.
Figure 11b shows that the improved model maintains consistently higher mAP@0.5 throughout training and converges faster. These results indicate that the proposed YOLOv5s not only improves detection accuracy but also provides a more stable input for subsequent DeepSORT tracking, supporting reliable counting in video.
3.3. Comparative Experiments of Different Models
To evaluate overall performance, we compare the proposed model with several recent detectors, including YOLOv5m, YOLOv8s, YOLOv9s, YOLOv10n, and YOLOv11n. As shown in
Table 3, the improved YOLOv5s achieves 97.8% precision, outperforming YOLOv5m, YOLOv8s, YOLOv9s, YOLOv10n, YOLOv11n, and the original YOLOv5s by 1.7%, 0.9%, 0.8%, 2.2%, 1.8%, and 2.3%. In addition, the improved YOLOv5s is highly efficient. Compared with YOLOv5m, YOLOv8s, and YOLOv9s, it reduces parameters by 78.9%, 59.9%, and 38.2%, and lowers FLOPs by 84.0%, 70.3%, and 69.9%, respectively. Its mAP@0.5 is comparable to YOLOv8s and 0.2% higher than YOLOv9s, while requiring substantially fewer resources. Compared with YOLOv10n and YOLOv11n, it uses slightly more parameters and FLOPs, but achieves higher precision, making it better suited for greenhouse fruit detection where accuracy is critical.
3.4. Qualitative Analysis
In the visualization experiments, all models were tested under the same settings, with the confidence threshold fixed at 0.1. To evaluate robustness in real greenhouse scenes, we selected five representative images for qualitative comparison, covering five typical conditions: Sunny, Dark, motion blur, Small Object, and Branch and Leaf Occlusion. The selected images include small, medium, and large fruits, enabling a comprehensive assessment under complex visual disturbances.
As shown in
Figure 12, in the Sunny scenario, the ground-truth number of eggplant fruits in the image is three. The improved YOLOv5s detects all three fruits with higher confidence. In the Dark scene, the ground-truth number of fruits is five. The other models miss targets to varying degrees, whereas the improved YOLOv5s detects all five fruits, demonstrating stronger robustness under low-light conditions.
In the Motion Blur scenario, blur is introduced to simulate non-uniform movement of the inspection platform. Except for the improved YOLOv5s, all models exhibit missed detections. In the Small Object scenario, the baseline models struggle to detect small-scale fruits, whereas the improved YOLOv5s achieves more complete detections with higher confidence, benefiting from the ECA attention mechanism.
In the Branch and Leaf Occlusion scenario, severe occlusion and background clutter increase confusion between fruits and non-target regions. YOLOv10n, YOLOv11n, and YOLOv5s produce false positives, and YOLOv9s misses targets, while YOLOv5m, YOLOv8s, and the improved YOLOv5s correctly detect all fruits. Overall, results across the five scenarios show that the improved YOLOv5s provides stronger robustness and generalization in complex greenhouse environments, confirming the effectiveness of the proposed improvements.
3.5. Comparative Analysis of Multi-Object Tracking Algorithms for Eggplant Fruits
To evaluate the proposed counting-zone strategy method, we selected three independent test videos, V01–V03, for counting experiments. Each video lasts approximately 2–3 min and includes different lighting conditions, occlusion levels, and fruit densities.
As shown in
Table 4, we compare the proposed tracking-based counting method with SORT, the original DeepSORT, and ByteTrack to evaluate its robustness in complex greenhouse environments. For a fair comparison, all methods use the same detections produced by the improved YOLOv5s as input. The final count for each video is defined as the ID of the last tracked eggplant fruit and is displayed in the upper left corner of the frame. For V01, V02, and V03, the proposed method produces counts of 50, 56, and 55, which are the closest to the ground truth values among all methods. This indicates that the proposed approach maintains more stable trajectories and reduces frequent ID switches in challenging greenhouse scenes.
In contrast, SORT, DeepSORT, and ByteTrack generate many redundant IDs. As shown in
Figure 13, in V01, the proposed method counts 50 fruits, whereas SORT, DeepSORT, and ByteTrack output 113, 132, and 211, respectively. The proposed method is closest to the ground truth and yields the lowest relative error of 16.7%. In V02, ByteTrack produces 223 IDs, which is 3.3 times the true number of fruits. DeepSORT and SORT output 145 and 113 IDs, corresponding to 2.1 times and 1.7 times overcounting. In V03, the proposed method counts 55 fruits and achieves 94.8% counting accuracy, substantially outperforming the three. These results indicate that conventional trackers struggle to preserve identities under leaf occlusion and highly similar fruit appearances, leading to severe ID inflation in complex greenhouse scenes.