Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s

Zhou, Yu; Li, Zhenye; Xue, Sheng; Wu, Min; Zhu, Tingting; Ni, Chao

doi:10.3390/agriculture15101111

Open AccessArticle

Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s

by

Yu Zhou

,

Zhenye Li

,

Sheng Xue

,

Min Wu

,

Tingting Zhu

and

Chao Ni

^*

College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(10), 1111; https://doi.org/10.3390/agriculture15101111

Submission received: 21 April 2025 / Revised: 19 May 2025 / Accepted: 20 May 2025 / Published: 21 May 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of surface defects on passion fruits is crucial for maintaining market competitiveness. Numerous small defects present significant challenges for manual inspection. Recently, deep learning (DL) has been widely applied to object detection. In this study, a lightweight neural network, StarC3SE-CBAM-DIoU-YOLOv5s (SCD-YOLOv5s), is proposed based on YOLOv5s for real-time detection of tiny surface defects on passion fruits. Key improvements are introduced as follows: the original C3 module in the backbone is replaced by the enhanced StarC3SE module to achieve a more efficient network structure; the CBAM module is integrated into the neck to improve the extraction of small defect features; and the CIoU loss function is substituted with DIoU-NMS to accelerate convergence and enhance detection accuracy. Experimental results show that SCD-YOLOv5s performs better than YOLOv5s, with precision increased by 13.2%, recall by 1.6%, and F₁-score by 17.0%. Additionally, improvements of 6.7% in mAP@0.5 and 5.5% in mAP@0.95 are observed. Compared with manual detection, the proposed model enhances detection efficiency by reducing errors caused by subjective judgment. It also achieves faster inference speed (26.66 FPS), and reductions of 9.6% in parameters and 8.6% in weight size, while maintaining high detection performance. These results indicate that SCD-YOLOv5s is effective for defect detection in agricultural applications.

Keywords:

passion fruit; defect detection; YOLOv5s; attention mechanism

1. Introduction

Passion fruit is a tropical fruit mainly cultivated in the southeastern regions of China. Due to its unique nutritional value [1,2,3], it has become one of the most popular healthy fruits today [4]. However, factors such as the growing environment and pests and diseases [5,6,7] cause numerous small defects on the surface of passion fruits, such as decay and cracks. The presence of decayed areas on the surface of passion fruits not only affects sales but can also impact people’s health [8,9,10]. Therefore, defect detection on passion fruit surfaces is essential. Traditionally, these passion fruits with surface defects are inspected manually. Manual inspection methods are not only inefficient but also heavily influenced by subjective judgment, which leads to high rates of false detections and missed detections, ultimately resulting in decreased passion fruit sales [11,12,13,14,15,16].

In recent years, RGB detection methods have utilized traditional cameras and computer vision algorithms, offering a more cost-effective and flexible approach that has been adopted by researchers for practical fruit inspection tasks. Henila and Chithra [17] proposed a fuzzy clustering-based threshold segmentation (FCBT) method to detect defective areas on the surface of apples. Siridhara et al. [18] proposed using the K-means clustering algorithm combined with Otsu’s thresholding method to detect surface defects in fruits and vegetables. Chithra and Henila [19] proposed a global thresholding algorithm to extract defective regions in apples, followed by K-means clustering and median filtering to detect defects in the samples.

Through an investigation of RGB detection methods, we found that these approaches rely on manual feature extraction, making it challenging to handle complex shapes and textures. Additionally, they are significantly affected by environmental factors, leading to unstable detection accuracy. In contrast, defect detection based on DL algorithms demonstrates greater adaptability, making them applicable in practice. Shruthi and Nagaveni [20] proposed TomVIINet for the accurate and efficient diagnosis of tomato diseases with varying severity levels, and their model demonstrated an impressive 96.91% accuracy in identifying tomato diseases. He et al. [21] integrated a dynamic selective attention mechanism into the backbone of a convolutional neural network (CNN), achieving a precision of 93% in the detection of passion fruit diseases and pests. Nithya et al. [22] employed a CNN to detect defects on mango surfaces. Zhou et al. [23] developed a model based on Wide Residual Network (WideResNet) for the detection of surface defects in green plums. Zhou et al. [24] proposed an improved model based on VGG16 by incorporating the SWA optimizer and the W-Softmax loss function to detect various surface defects of green plums. Fan et al. [25] employed a CNN approach to detect defects in apples. Da Costa et al. [26] used a ResNet classifier to detect surface defects in tomatoes. Nur Alam et al. [27] proposed a method based on deep convolutional neural networks (DCNN) for the effective detection of surface imperfections in apples. Xie et al. [28] proposed an efficient carrot crack segmentation model based on ResNet-50 and U-Net using deep learning algorithms. Additionally, they employed a multi-objective genetic algorithm to fit the crack regions, achieving a segmentation accuracy of 98.48%. Han et al. [29] proposed a neural network based on Mask Region-based Convolutional Neural Network (Mask R-CNN) for identifying apple fruits infected by pests. Wang et al. [30] used Faster Region-based Convolutional Neural Network (Faster R-CNN) to detect young tomatoes affected by pests and diseases.

Our analysis indicates that conventional neural network approaches suffer from several limitations, including slow processing and response times [23,30], insufficient and unbalanced defect sample distributions [22,25], limited detection accuracy [29], and high computational resource demands. These drawbacks hinder their suitability for routine fruit surface defect monitoring in practical agricultural applications. In contrast, the You Only Look Once (YOLO) series [31] has been recognized for its innovative single-stage detection network design, which combines both efficiency and accuracy in object detection. This architecture has gained widespread adoption in real-time fruit inspection applications, owing to its ability to detect small targets with high precision and operational efficiency. Liu et al. [32] proposed an improved YOLOv5s-based network model to perform quality grading of passion fruits, achieving an accuracy of 95.36%. Lu et al. [33] proposed an improved YOLO model, enabling the detection of tiny defects on citrus fruits, achieving a mean Average Precision (mAP) of 98.7%. Yao et al. [34] proposed an improved YOLOv5 network, enabling accurate and rapid detection of surface defects in kiwifruit, achieving a mAP of 94.7%. Huang et al. [35] developed an improved YOLOv5m network model to detect surface defects on tomatoes, achieving a detection accuracy of 89.93%. Obsie et al. [36] improved the YOLOv5 network to identify disease-infected blueberries, achieving an accuracy of 96.30%. Wang et al. [37] proposed an improved YOLOv5 network for postharvest defects detection of Pyrus pyrifolia Nakai pears, with a mAP@0.5 of 93.9%, a detection speed of 454.5 frames per second (FPS). Yang et al. [38] introduced an enhanced YOLOv5s model for the automatic detection of defects such as blue spots, blemishes, and peeling on finished cigars, with an improvement of 2.69% at the mAP@0.5 compared with YOLOv5. Although YOLO-based networks have demonstrated high accuracy in detecting surface defects on fruits, the complex architecture [34] results in a relatively slow detection speed, requiring approximately 0.1 s per image. For real-world production line applications, the inference speed of the detection model is a critical performance indicator. To address this limitation, ongoing research has focused on optimizing the network architecture by reducing parameters and the model’s weight size, while maintaining detection precision after optimization.

For instance, Li et al. [39] proposed a lightweight YOLOv5-based network model for the detection of surface diseases and pests on passion fruit, achieving a detection accuracy of 96.51%. Chen et al. [40] proposed to compress the YOLOv8 model by Slimming Pruning and Class Weight Distillation (CWD); this approach not only achieves network lightweight but also enhances the detection capability for small-scale diseases in passion fruit. Lv et al. [41] integrated the lightweight EfficientNetV2 module into the YOLOv5 architecture for the detection of surface defects in apples, resulting in a 20% reduction in model size while maintaining effective detection performance. Xiao et al. [42] incorporated the lightweight network GhostNetV1 into the YOLOv7 network to detect surface diseases in lychees, resulting in network parameters from 36.5 million (M) to 7.8 million (M). Liu and Yasenjiang [43] proposed a method by replacing the YOLOv8 backbone with MobileOneNet to simplify the YOLOv8 network model for detecting surface defects in apples. Xu and Su [44] enhanced the YOLOv7-tiny network for detecting surface defects in kiwifruit by replacing certain Encoder-Label-Decoder Attention Network (ELAN) modules in the backbone with an optimized C3-Ghost lightweight network structure. Yu and Fu [45] proposed integrating an optimized MobileNetV3 architecture into the backbone of the YOLOv5s model, which reduced the model’s number of parameters and computational complexity, as well as improving the results of multiple defects in apples. Zhang et al. [46] introduced a coordinate attention mechanism into the lightweight YOLOv5 network, enabling accurate localization and recognition of densely packed dragon fruits. Wu et al. [47] introduced the Context Anchor Attention (CAA) module into the lightweight YOLOv8 network, resulting in a detection accuracy of 92%. Sekharamantry et al. [48] introduced a novel loss function into the YOLOv5 network, achieving accurate bounding boxes and significantly improving the detection accuracy of apples. Chen et al. [49] replaced the Complete Intersection over Union (CIoU) loss function with the Soft-IoU (SIoU) loss function in the YOLOv7 network, improving the model’s accuracy in recognizing cherry tomatoes, achieving a recognition precision of 86.6%. Qiu et al. [50] introduced Wise-IoU (WIoU) as a replacement for the standard loss function in YOLOv8n with an 85.2% detection accuracy in identifying defects in dragon fruit.

Although several studies have demonstrated notable success in detecting fruit surface defects using various improved YOLO-based models, limitations remain in terms of model lightweighting and feature extraction capabilities. For instance, ref. [42] effectively reduces model complexity through pruning techniques; however, this comes at the cost of reduced recall accuracy. While ref. [45] achieves a degree of model lightweighting, the improvement is relatively limited. Moreover, ref. [46] increases computational overhead and fails to capture long-range dependencies effectively. Additionally, the detection accuracy of the model proposed in [50] is lower than that of other YOLO-based architectures.

To address these issues and further improve the performance of lightweight models for small-object detection, particularly for routine on-farm inspections of passion fruit surface defects, in this paper, an improved YOLOv5s network was proposed to detect these defects. This study introduces the following key innovations:

(1) The C3 module in the YOLOv5s backbone network was replaced with the StarC3SE module to reduce parameters and accelerate detection speed, simultaneously enhancing feature extraction for small defects on passion fruit surfaces.

(2) The Channel and Batch-wise Multi-head Attention (CBAM) module was incorporated into the neck of the network to enhance the model’s ability to extract features of small defects on the surface of passion fruit, mitigating the risk of missing such defects due to their small size.

(3) The loss function in YOLOv5 was replaced with the Distance-IoU Non-maximum suppression (DIoU-NMS) function in the prediction period, optimizing the prediction results and accelerating the model’s convergence speed, while suppressing the repetition of frames in the inference period, and thus filtering out the optimal detection results.

2. Materials and Methods

2.1. Data Acquisition

This study utilized three Hikvision industrial cameras (model MV-CE120–10GC, Hikvision, Hang Zhou, China) equipped with MVL-HF0824–10 MP (Hikvision, Hang Zhou, China) lenses featuring 8 mm focal lengths and 10 MP resolution to capture images of passion fruit samples for detecting minor surface defects. The cameras were installed on a U-shaped conveyor belt at a farm in Quanzhou, Fujian Province, where samples were collected in 2023. All three cameras were enclosed within a dark box to weaken the influence of external lighting on image clarity, as shown in Figure 1a. After harvesting, the passion fruits passed through a designated section, where the cameras captured images to determine surface defects. As shown in Figure 1b,c, the cameras were positioned at the top of the conveyor belt and on both sides of the conveyor’s movement direction. To ensure uniform and soft illumination across the fruit surfaces, two diffuse LED lights were installed on either side of the dark box. The image acquisition process employed an optoelectronic triggering method, resulting in a dataset of 1157 RGB images. Among these, 464 images corresponded to passion fruits with minor surface defects, while the remaining 693 images represented defect-free samples. Figure 2 provides examples of the acquired images, where Figure 2a shows a sample without defects, and Figure 2b,c depicts samples with biological decay and structural cracks, respectively.

2.2. Dataset Preparation and Augmentation

The size and quality of the training set play a critical role in determining the final performance of the model. Increasing the quantity of training samples, while simultaneously enhancing the quality of the dataset, contributes significantly to improvements in model accuracy and generalization. To simulate diverse environmental conditions and account for rare or complex cases encountered in real-world scenarios, various data augmentation techniques were applied to the 1157 original images as summarized in Table 1, ultimately resulting in a total of 2314 passion fruit images. The dataset was annotated using Roboflow, as shown in Figure 3. Two annotation categories were defined: ‘PF’ (representing the passion fruit) and ‘defect’ (representing surface defects on the fruit). Following the annotation process, 927 images were identified as containing defects, while the remaining 1387 images represented normal passion fruits. To ensure a balanced and reliable evaluation, the labeled dataset was split into training, validation, and testing sets according to a 7:2:1 ratio with random grouping. These steps enhanced the dataset’s diversity and quality, ensuring its suitability for training a robust model. Consequently, the resulting dataset effectively captures a wide range of conditions, enabling the model to achieve improved robustness and accuracy in defect detection tasks.

2.3. Overview of Experimental Overall Stages

In this study, RGB images of harvested passion fruits were captured using industrial cameras. The acquired dataset was then augmented through various data enhancement techniques to improve its diversity and robustness. Following the augmentation process, the annotated dataset was partitioned into training, validation, and test sets according to a predefined ratio. Subsequently, a modified YOLOv5s network was employed to perform training and real-time detection of small surface defects. The overall workflow of the proposed method is illustrated in Figure 4.

2.4. The Proposed SCD-YOLOv5s Network

2.4.1. Overview of the YOLOv5 Model

YOLOv5 is a widely recognized single-stage object detection algorithm, known for its lightweight architecture and efficient inference speed. The YOLOv5 network comprises three primary components: the backbone, responsible for feature extraction; the neck, which facilitates feature fusion; and the head, which generates bounding boxes and classifies objects. YOLOv5 is available in five variants (YOLOv5n to YOLOv5x), differentiated by increasing network depth and corresponding parameters. As the network depth increases, the number of parameters and computational complexity also rise. Considering the constraints of cost and the laboratory environment, this study selects the YOLOv5s variant due to its balance between inference speed and accuracy, making it well-suited for resource-constrained applications. However, the original YOLOv5s architecture may face challenges in detecting minute defects on the surface of passion fruit due to limited feature extraction capabilities. To overcome this limitation, an enhanced YOLOv5s model is proposed to address the detection of small surface defects. The architecture of the modified YOLOv5s network, which incorporates these enhancements, is illustrated in Figure 5.

2.4.2. Enhanced Backbone with StarC3SE Module

SENet Channel Attention Mechanism

Squeeze-and-Excitation Network (SENet) [51] consists of three main components: Transformation, Squeeze, and Excitation. This module, through self-learning, adaptively assigns weights to different channels. Due to its lightweight design, SENet is an excellent choice to incorporate into the StarBlock module as an attention mechanism. It enables the capture of fine-grained details and enhances the detection of rotten regions in passion fruit. The structure of the Squeeze-and-Excitation Network (SENet) is illustrated in Figure 6.

In Figure 6, the input feature map

{F M}_{X} \in R^{H^{'} \times W^{'} \times C^{'}}

undergoes a standard convolution operation, denoted as

F_{T r}

, to produce the feature map

{F M}_{U_{c}} \in R^{H \times W \times C}

. This feature map

{F M}_{U_{c}}

is then processed through global average pooling and adaptive recalibration, resulting in

{F M}_{U_{c}}

; subsequently,

{F M}_{U_{c}}

and

{F M}_{U_{c}}

, undergo channel-wise multiplication to generate the final output feature map

{F M}_{\tilde{X}}

.

The first part of the Transformation operation is a standard convolution that convolves the input feature map

{F M}_{X}

with the parameters

V_{C}

, where

V_{C}

represents the c-th filter in the set of filter kernels

V = [V_{C}^{1}, V_{C}^{2}, \dots V_{C}^{C^{'}}]

. This operation produces the feature map

{F M}_{U_{c}}

. The equation corresponding to the Transformation operation is shown below.

{F M}_{U_{c}} = V_{C} * {F M}_{X^{S}}

(1)

V_{C} = \sum_{S = 1}^{C^{'}} V_{C}^{S}

(2)

where

{F M}_{X^{S}}

represents the input feature map,

V_{C}^{S}

is a 2D spatial kernel applied to the corresponding channel of the input feature map

{F M}_{X}

in the convolution operation.

The Squeeze operation, which involves global average pooling, compresses

{F M}_{U_{c}} \in R^{H \times W \times C}

, containing global information, into

z_{C} \in R^{1 \times 1 \times C}

. The corresponding equation for this operation is shown below.

z_{C} = F_{s q} ({F M}_{U_{c}}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {F M}_{U_{c}} (i, j)

(3)

The Excitation part employs a gating mechanism composed of two fully connected layers, the first fully connected layer reduces the number of channels from C to

\frac{C}{r}

, minimizing computational effort, then a ReLU activation function is applied, and the second fully connected layer restores the number of channels to C. The weights

s \in R^{1 \times 1 \times C}

are then obtained through Sigmoid activation. The corresponding equation is shown below.

s = F_{e x} (z, W) = Θ (g (z, W)) = Θ (W_{2} δ (W_{2} z))

(4)

where

Θ

is the ReLU function, and

W_{1}, W_{2} \in R^{\frac{C}{r} \times C}

, where r is the dimensionality reduction ratio.

The final step is the Scale operation, which performs a channel-by-channel multiplication of the weight coefficients, applying the previously computed attention weights to each channel’s features. This introduces attention in the channel dimension by applying the attention-weighted parameters to the original feature map

{F M}_{U_{c}}

, resulting in the final feature map

{F M}_{\tilde{X}}

. The corresponding equation is shown below.

{F M}_{\tilde{X}} = F_{s c a l e} ({F M}_{U_{c}}, s)

(5)

where

F_{scale}

is to multiply each feature map in the feature map

{F M}_{U_{c}}

by the corresponding weights.

At present, lightweight YOLO network models often encounter challenges such as insufficient feature extraction for small defects and unstable detection accuracy. To address these limitations, we propose the StarC3SE module, which simultaneously reduces network parameters and enhances feature extraction capabilities for detecting small defects on passion fruits.

StarBlock

The StarBlock module, shown in Figure 7a, is designed to enhance feature representation in deep neural networks, particularly for object detection tasks. This module employs a multi-branch architecture to process input features through multiple parallel convolutional layers, each with different kernel sizes, enabling the capture of spatial information at various scales. Specifically, the input feature map (FM), denoted as

{F M}_{i n}

, is first convolved using a kernel size of 7 to generate an intermediate feature map

{F M}_{i n t e r 1}

. Subsequently,

{F M}_{i n t e r 1}

is processed through two parallel 2D convolutional layers, resulting in

{F M}_{{i n t e r}_{2 - 1}}

and

{F M}_{{i n t e r}_{2 - 2}}

. The

{F M}_{{i n t e r}_{2 - 1}}

output is then activated using the ReLU6 function and undergoes an element-wise multiplication with

{F M}_{{i n t e r}_{2 - 2}}

, producing

{F M}_{{i n t e r}_{3}}

. Finally,

{F M}_{i n}

and

{F M}_{{i n t e r}_{3}}

are regularized and summed element-wise, generating the output feature map for the subsequent convolution operation. This multi-scale processing design allows the StarBlock to effectively capture and integrate features at varying spatial resolutions.

StarC3SE Module

In the passion fruit dataset, many small decayed regions were observed, posing significant challenges for effective defect detection. To overcome these challenges, we integrate the SENet module into the StarBlock module, creating the StarSE block, as illustrated in Figure 7b. This enhancement allows the network to focus more on essential features critical for defect detection The improved StarC3SE module, shown in Figure 7c, builds upon the StarSE module by embedding it within the C3 framework of YOLOv5, further enhancing feature extraction for small target defects.

2.4.3. Enhanced Neck with CBAM Module

In the detection of small defects on passion fruits, the defect regions are relatively small. The Channel and Batch-wise Multi-head Attention (CBAM) module [52] is capable of extracting useful features from images in both spatial and channel dimensions while discarding irrelevant information. Therefore, the CBAM module was incorporated into the YOLOv5s network to enhance the detection performance of passion fruit defects.

The CBAM module, a novel attention mechanism, comprises a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). The CBAM module aims to enhance the representational capacity of neural networks by simultaneously focusing on both channel and spatial dimensions, thereby capturing interdependencies between feature maps. This dual focus enables the model to better capture relationships between features, thereby improving feature selection and fusion. In the YOLOv5s network, CBAM enhances the network’s ability to detect complex patterns in input images by dynamically adjusting the importance of different feature channels. The structure of CBAM is illustrated in Figure 8.

As is illustrated in Figure 8, the feature map

F \in R^{H \times W \times C}

is first passed through the CAM module to produce the feature map

F M_{C A M} (F) \in R^{1 \times 1 \times C}

. The modified feature map

F^{'} \in R^{H \times W \times C}

is obtained by combining

F M_{C A M} (F)

with the original feature map

F

, as defined in Equation (6). Next,

F^{'}

is processed by the SAM module, generating

F M_{S A M} (F^{'}) \in R^{1 \times 1 \times C}

, which is again combined with

F^{'}

, as defined in Equation (7). And the final feature map

F^{″} \in R^{H \times W \times C}

is derived.

F^{'} = F \otimes F M_{C A M} (F)

(6)

F^{″} = F^{'} \otimes F M_{S A M} (F^{'})

(7)

In the CAM module, the input

F \in R^{H \times W \times C}

undergoes global maximum pooling and global average pooling operations to obtain the feature maps

F_{M a x}^{C}

and

F_{A v g}^{C}

, respectively. These feature maps

F_{M a x}^{C} \in R^{1 \times 1 \times C}

and

F_{A v g}^{C} \in R^{1 \times 1 \times C}

are then passed through the same multilayer perceptron (MLP) to generate

F_{M a x - M L P}^{C}

and

F_{A v g - M L P}^{C}

. Next,

F_{M a x}^{C}

and

F_{A v g}^{C}

are derived from the outputs of the MLP. Finally,

F M_{C A M} (F)

is computed by element-wise summation of these maps, followed by activation through a sigmoid function. The related functions of the CAM module are shown below.

F_{M a x}^{C} = F_{M a x} (F)

(8)

F_{A v g}^{C} = F_{A v g} (F)

(9)

F_{M a x - M L P}^{C} = F_{M a x}^{C} \times F_{M L P}

(10)

{F_{A v g - M L P}^{C} = F}_{A v g}^{C} \times F_{M L P}

(11)

{F^{'}}_{M a x}^{C} = F_{M a x} (F_{M a x - M L P}^{C})

(12)

{F^{'}}_{A v g}^{C} = F_{A v g} (F_{A v g - M L P}^{C})

(13)

F M_{C A M} (F) = Θ ({F^{'}}_{M a x}^{C} + {F^{'}}_{A v g}^{C})

(14)

where

F_{M a x}

represents maximum pooling,

F_{A v g}

represents average pooling,

F_{M L P}

is the output matric of the MLP, and

Θ

is the Sigmoid activation function.

In the SAM module, the FM

F^{'} \in R^{H \times W \times C}

undergoes global maximum pooling and global average pooling to produce

{F^{'}}_{M a x}^{s}

and

{F^{'}}_{A v g}^{s}

, respectively. The combined feature map

{F^{'}}_{M a x + A v g}^{s} \in R^{H \times W \times 2}

is then obtained by element-wise summation. The combined feature map is processed with a convolution operation using a kernel size of 7, followed by activation with the Sigmoid function to generate

F M_{S A M} (F^{'})

. The corresponding equations for the SAM module are shown below.

{F^{'}}_{M a x + A v g}^{s} = {F^{'}}_{M a x}^{s} \oplus {F^{'}}_{A v g}^{s}

(15)

F M_{S A M} (F^{'}) = Θ (F_{C o n v} ({F^{'}}_{M a x + A v g}^{s}))

(16)

where

F_{C o n v}

denotes a convolutional operation with a convolutional kernel size of 7 × 7.

2.4.4. Enhanced Head with DIoU-NMS Loss Function

The loss function is a critical component in evaluating the performance of object detection models, as it directly influences how well a model makes predictions. A well-designed bounding box loss function can significantly improve model accuracy by effectively guiding the learning process. In the YOLOv5 network, the overall loss function comprises three main components: classification loss (

L_{c l s}

), objectness loss (

L_{o b j}

), and localization loss (

L_{l o c}

), which are formulated as follows:

L = L_{c l s} + L_{o b j} + L_{l o c}

(17)

where

L_{c l s}

is classification loss,

L_{l o c}

represents objectivity loss, and

L_{l o c}

is localization loss.

Given the spherical shape of passion fruits and the presence of numerous decayed regions along their edges, we propose using the Distance-Intersection over Union Non-Maximum Suppression (DIoU-NMS) instead of the standard loss function to enhance detection accuracy.

IoU measures the overlap between the predicted bounding box and the ground truth box by calculating the ratio of their intersection area to their union area. The formula for calculating IoU is shown below.

I o U = \frac{A r e a o f I n t e r s e c t i o n}{A r e a o f U n i o n} = \frac{|A ⋂ B|}{|A ⋃ B|} \in (0, 1)

(18)

where A is the area of the prediction bounding box, and B denotes the area of the ground truth bounding box.

DIoU [53] is an improvement over the standard IoU loss, specifically designed to enhance the localization accuracy of bounding boxes in object detection tasks. Unlike IoU loss, DIoU incorporates the distance between the center points of the predicted and ground truth boxes and considers the aspect ratio of the bounding box. This approach addresses not only the overlap between boxes but also encourages the predicted box to move closer to the center of the ground truth box, resulting in faster and more accurate convergence during training. The method for deriving DIoU is explained below.

L_{D I o U} = 1 - I o U + \frac{σ^{2} (b, b^{g t})}{c^{2}} \in (- 1, 1)

(19)

where

L_{D I o U}

represents the DIoU loss function, IoU is the intersection-over-union value between the predicted box and the ground truth box,

σ

is the distance between the center of the predicted box and the center of the ground truth box, and c is the diagonal length of the smallest enclosing box that can contain both the predicted and ground truth boxes. The expression for

σ

is shown below.

σ (b, b^{g t}) = \sqrt{{(x_{p r e d i c t} - x_{t r u e})}^{2} + {(y_{p r e d i c t} - y_{t r u e})}^{2}}

(20)

where

(x_{p r e d i c t}, x_{t r u e})

represents the coordinates of the center point of the prediction frame, and (

x_{t r u e}, y_{t r u e}

) are the coordinates of the center point of the real frame.

Despite DIoU considering the center point distance in bounding box regression, it does not directly improve the post-processing of detection results.

Non-maximum suppression (NMS) is typically the final step in most object detection algorithms. NMS enhances detection accuracy by retaining the most relevant predicted boxes and eliminating redundant ones—any redundant detection boxes that overlap with the highest-scoring box beyond a certain threshold are discarded. The calculation function of NMS is as follows:

s_{i} = \{\begin{array}{l} s_{i}, & I o U (M, B_{i}) < δ \\ 0, & I o U (M, B_{i}) \geq δ \end{array}

(21)

where

s_{i}

represents the confidence score of the

i

detection box,

M

denotes the detection box with the highest confidence score,

B_{i}

represents the bounding boxes that need to be deleted by simultaneously considering the IoU and the distance between their center points, and

δ

is the predefined threshold.

However, traditional NMS only considers the IoU and ignores the distance between the center points of bounding boxes. This limitation can lead to missed detections, especially when objects are dense or adjacent.

To overcome this issue, we introduce the distance measure from DIoU into the NMS algorithm, forming DIoU-NMS, to more effectively remove redundant detection boxes. Since defects on the surface of passion fruits may be close to each other, traditional NMS might mistakenly treat these adjacent defects as a single object, resulting in missed detections. By incorporating the center point distance, DIoU-NMS can better distinguish between these adjacent defects, thereby improving detection accuracy. The calculation function of the DIoU-NMS algorithm is as follows:

s_{i} = \{\begin{array}{l} s_{i}, & I o U - R_{D I o U} (M, B_{i}) < δ \\ 0, & I o U - R_{D I o U} (M, B_{i}) \geq δ \end{array}

(22)

In this formula, the term

R_{D I o U} (M, B_{i})

represents the normalized distance between the center points of the two boxes. By subtracting this term from the IoU, the DIoU-NMS algorithm penalizes bounding boxes that are farther apart, effectively preventing the suppression of adjacent bounding boxes that may represent separate defects. This refinement is particularly beneficial for detecting small and closely spaced defects on the surfaces of passion fruits.

By integrating DIoU-NMS into our model, we enhance both the localization accuracy and the post-processing of detection results. This integration leads to improved performance in detecting surface defects on passion fruits, addressing the limitations of traditional NMS and standard loss functions. Our approach ensures that adjacent defects are accurately identified as separate entities, thereby significantly improving the overall detection accuracy.

3. Experiments and Results

3.1. Experimental Environment Configuration and Training Parameters Setting

The experimental environment configuration for detecting small defects in passion fruits is detailed in Table 2, which includes hardware and software settings. The key training parameters of the model, such as learning rate, batch size, and optimizer, are summarized in Table 3. The dataset was split according to the ratio described in Section 2.2. To ensure consistency and enhance experimental reliability, all collected images were resized to 640 × 640 during training, addressing the variability in image dimensions. Training epochs were set to 400 to provide sufficient training iterations, considering the relatively small dataset size and the need to avoid undertraining. The defect detection model was based on Python 3.10 using the PyTorch framework, selected for its flexibility and efficiency in developing deep learning models.

3.2. Evaluation Metrics of YOLOv5

This study adopts five key evaluation metrics to evaluate the proposed network’s performance in detecting surface defects on passion fruits: precision, recall, mAP,

F_{1} − s c o r e

, and frames per second (FPS). The corresponding calculation formulas for these metrics are provided as follows:

P r e c i s i o n = \frac{T r u e P o s i t i v e (T P)}{T r u e P o s i t i v e (T P) + F a l s e P o s i t i v e (F P)}

(23)

R e c a l l = \frac{T r u e P o s i t i v e (T P)}{T r u e P o s i t i v e (T P) + F a l s e N e g a t i v e (F N)}

(24)

A P = \int_{0}^{1} P_{r} d r

(25)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(26)

F_{1} − s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(27)

The above evaluation metrics are derived based on a region-level assessment framework, consistent with standard object detection protocols. In this context, TP (True Positive) denotes the number of correctly predicted bounding boxes that sufficiently overlap with ground truth annotations, FP (False Positive) refers to predicted boxes that either do not correspond to any annotated object or have insufficient overlap, and FN (False Negative) represents missed detections where no prediction is associated with a labeled object.

Where n denotes the total number of instances evaluated, and

A P_{i}

represents the average precision for category i in multi-class settings. Specifically, mAP@0.5 indicates that a predicted bounding box is counted as a true positive when its Intersection over Union (IoU) with a corresponding ground truth box exceeds the threshold of 0.5.

3.3. Results

3.3.1. Comparison of Different Target Detection Models

To further validate the advantages of the YOLO network, the proposed SCD-YOLOv5s model was compared with two-stage object detection algorithms, including R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN. Given that YOLO is a single-stage detection algorithm, it inherently outperforms two-stage detection algorithms in detection speed. To ensure a comprehensive evaluation, this comparison focuses on precision, recall, mAP, as well as model size and Giga Floating-point Operations Per Second (GFLOPs). The results of these comparisons are summarized in Table 4. As shown in Table 4, two-stage detection algorithms are capable of detecting surface defects in passion fruits, but their performance is limited compared to the proposed SCD-YOLOv5s model. Specifically, Faster-RCNN achieves the highest precision of 64.0% and recall of 72.12% among the two-stage models; moreover, Faster R-CNN achieves the highest mAP@0.5 of 75.67%. Despite these strengths, the overall performance of these algorithms is significantly inferior to the SCD-YOLOv5s model. This comparison highlights the superior detection accuracy, efficiency, and scalability of the proposed network for detecting small surface defects in passion fruits.

3.3.2. Comparison of Different YOLO Network Versions

The defect detection experiments for passion fruits were conducted using the augmented dataset described in Section 2.2. Our proposed SCD-YOLOv5s network was compared with YOLOv7, YOLOv8 (in various sizes), and the latest YOLO11s network. The results are summarized in Table 5. As illustrated in Table 5, SCD-YOLOv5s model outperforms all compared models in detecting small defects on the surface of passion fruits. Specifically, the SCD-YOLOv5s achieves a precision of 95.9%, which is 10.1% higher than YOLOv7x, 7.5% higher than YOLOv8s, and 9.2% higher than YOLO11s. For recall, the SCD-YOLOv5s attains 84.7%, representing a 3.8% improvement over YOLOv7x, 8% over YOLOv8n, and 5.8% over YOLO11s. Regarding mAP@0.5, the SCD-YOLOv5s achieves 88.4%, surpassing YOLOv7x by 4.2%, YOLOv8n by 9%, and YOLO11s by 4.7%. For the more stringent mAP@0.95 metric, the SCD-YOLOv5s reaches 71.2%, outperforming YOLOv7x, YOLOv8s, and YOLO11s by 6.33%, 4.57%, and 1.3%, respectively. Lastly, for the

F_{1} − s c o r e

, the SCD-YOLOv5s achieves 93%, which is 12% higher than YOLOv7, 14% higher than YOLOv8n, and 11% higher than YOLO11s. These results demonstrate that the SCD-YOLOv5s network effectively detects small surface defects on passion fruits and achieves superior performance across various metrics compared to the state-of-the-art YOLO-based networks. This validates the robustness and accuracy of SCD-YOLOv5s for agricultural defect detection tasks.

3.3.3. Comparison of Different YOLOv5 Sizes

The performance of the proposed SCD-YOLOv5s model was further evaluated by comparing it with YOLOv5 models of various sizes. The experimental results are summarized in Table 6. As shown in Table 6, the lightweight SCD-YOLOv5s model achieves superior performance across all key metrics in detecting small defects on passion fruits. Specifically, it achieves a precision of 95.9%, which is 13.2%, 12.6%, 10.6%, and 8% higher than YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, respectively. For recall, the SCD-YOLOv5s model attains a value of 84.7%, surpassing YOLOv5s by 1.6%, YOLOv5m by 6.9%, YOLOv5l by 10.2%, and YOLOv5x by 8.1%. The proposed model also achieves the highest values for mAP@0.5 and mAP@0.95, demonstrating its effectiveness in accurately detecting small surface defects. In terms of

F_{1} − s c o r e

, the SCD-YOLOv5s model achieves 93%, outperforming YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x by 17%, 12%, 11%, and 11%, respectively. Figure 9a compares the detection results between YOLOv5 models with different sizes and SCD-YOLOv5s. When using the YOLOv5s model, one annotated defect was not detected in a randomly selected image, whereas the YOLOv5m, YOLOv5l, and YOLOv5x models successfully detected all annotated defects. However, these models demonstrated lower detection accuracy compared to the proposed SCD-YOLOv5s model. Figure 9b presents two representative cases of misdetection made by the proposed SCD-YOLOv5s model on the passion fruit dataset. In Group 1, the defective region is present in the sample, but the model fails to detect it, resulting in a false negative. In Group 2, the model incorrectly identifies a non-defective area as a defect, producing a false positive. Upon further examination of the dataset and model structure, it was observed that such misclassifications are primarily caused by the unclear or ambiguous definition of defect regions during the annotation process, which introduces subjectivity and inconsistency in the training data, thereby affecting detection accuracy in edge cases.

As illustrated in Figure 10a, the proposed SCD-YOLOv5s model exhibits a noticeably faster convergence rate during training compared to other YOLOv5 variants. This indicates that the model achieves optimal performance in fewer epochs, thereby improving training efficiency. Furthermore, as described in Figure 10b, the SCD-YOLOv5s outperforms the other models in terms of mAP@0.5, achieving the highest detection accuracy. Together, the results presented in Figure 10a,b highlight the superior performance of the proposed model in detecting defects on passion fruit.

The superior performance of the SCD-YOLOv5s model is attributed to key architectural improvements. Specifically, the C3 module in the backbone was replaced with the enhanced StarC3SE module, the CBAM module was incorporated into the neck, and DIoU-NMS was introduced in the head, as detailed in Table 6. These modifications not only improved detection performance but also reduced the network’s parameters and model size. The SCD-YOLOv5s model is 1.2 MB smaller than YOLOv5s, with an FPS value of 26.66 (

f \cdot s^{- 1}

), which is 2.21 (

f \cdot s^{- 1}

) higher than YOLOv5s. This increase in FPS enables the model to detect more defects within the same time frame, further enhancing its practicality for real-time applications in agricultural defect detection.

3.3.4. Comparative Experiments of Different MODULES

To assess the effectiveness of the enhanced StarC3SE module integrated into the SCD-YOLOv5s network, this study first substituted the C3 module in the backbone part with the StarC3SE module, without incorporating any attention mechanisms or additional loss functions. Subsequently, a series of experiments were conducted, where the StarC3SE module was replaced with modules [54,55,56] for detecting surface defects on passion fruit. The results are summarized in Table 7.

As indicated in Table 7, the alternative modules evaluated [54,55,56] did not improve the precision of the baseline YOLOv5s network. In contrast, integrating the StarC3SE module resulted in a notable 9.3% improvement in precision. Regarding recall, none of the four evaluated modules enhanced performance; however, the StarC3SE module achieved the highest recall rate among them. For the mAP@0.5 metric, only the StarC3SE module demonstrated an improvement. In the mAP@0.95 evaluation, while the method [54] exhibited a decrease, the remaining three methods, including the StarC3SE module, achieved improvements, with the StarC3SE module showing the largest enhancement. In terms of the

F_{1} − s c o r e

, all four enhanced modules outperformed the baseline YOLOv5s network. Among them, the method [55] and this study demonstrated the greatest improvements, yielding identical final results. The experimental findings suggest that the superior performance of the StarC3SE module may be attributed to its enhanced structural design, which better captures spatial features relevant to passion fruit surface defects. Based on the results presented in Table 7, it can be concluded that the introduction of the StarC3SE module into the baseline YOLOv5s network significantly improves its performance for passion fruit surface defect detection, particularly in precision, mAP@0.5 and mAP@0.95 metrics, and

F_{1} − s c o r e

.

3.3.5. Comparative Experiments of Different Attention Mechanisms

To validate the effectiveness of the CBAM module employed in the SCD-YOLOv5s network, we replaced the CBAM module with SimAM [57], Coordinate Attention (CA) [58], and Efficient Channel Attention (ECA) [59] modules and conducted four sets of comparative experiments based on the baseline network StarC3-YOLOv5s. The experimental results are presented in Table 8.

As indicated in Table 8, incorporating ECA, SimAM, and CA modules into the StarC3SE-YOLOv5s network failed to enhance detection accuracy. Instead, accuracy decreased compared to the baseline network. In contrast, integrating the CBAM module improved accuracy to 94.2%, representing a 2.2% increase over the baseline StarC3-YOLOv5s network.

In terms of recall, only the CA module and CBAM module achieved a slight improvement, while both the ECA and SimAM modules led to decreases.

For mAP@0.5, incorporating ECA, SimAM, and CA modules resulted in reductions compared to the baseline network. By contrast, the CBAM module improved. For mAP@0.95, all attention mechanism modules resulted in reductions compared to the baseline network.

In terms of the

F_{1} − s c o r e

, all evaluated attention mechanisms, including ECA, SimAM, and CA, achieved improvements over the baseline network. However, the CBAM module delivered the most significant enhancement, with the

F_{1} − s c o r e

reaching 85%, representing a 2% increase over the baseline StarC3-YOLOv5s network.

To better visualize the effectiveness of the CBAM attention mechanism in focusing on small defects on the surface of passion fruits, heatmaps were generated to compare the attention distributions of four different mechanisms. Figure 11 presents the heatmaps for four sets of images under each attention mechanism.

As indicated in Figure 11a, the CBAM module accurately identifies defective areas on the fruit surface. In contrast, the ECA mechanism, while capable of detecting defects, exhibits lower confidence and, in some cases, incorrectly focuses on non-defective regions. The CA mechanism fails to detect certain defects, such as the one in the lower-right corner, and similarly misidentifies normal areas as defective. SimAM, though effective in detecting obvious defects, misses smaller ones, demonstrating limitations in its focus on finer details.

In Figure 11b, CBAM again effectively identifies surface defects with high confidence. By comparison, ECA detects defects but with reduced confidence levels. The CA mechanism continues to exhibit a false focus on non-defective areas and fails to detect prominent defects. SimAM shows similar issues, with lower detection confidence relative to CBAM.

Figure 11c highlights further limitations of the other mechanisms. The CA mechanism fails to detect a defect in the lower-right corner of the fruit and identifies a more obvious defect with only 82% confidence. Similarly, ECA identifies the defect but with lower confidence than CBAM, while SimAM misses smaller defects.

In Figure 11d, CBAM effectively identifies all defective areas, whereas the CA mechanism fails to detect a defect in the upper-right region. SimAM exhibits similar limitations, and although ECA identifies the defect, its performance is inferior to CBAM in terms of confidence.

In summary, while the ECA, CA, and SimAM mechanisms demonstrate varying degrees of effectiveness, they either misidentify normal areas as defective or detect defects with lower confidence. In contrast, the CBAM attention mechanism consistently outperforms the others, accurately identifying defects with higher confidence and focusing more effectively on the defective regions.

3.3.6. Comparative Experiments Using Different Bounding Box Loss Functions

To evaluate the impact of different bounding box loss functions on detecting small defects on passion fruit surfaces, we conducted a series of experiments tailored to various detection targets. These experiments aimed to validate the effectiveness of the DIoU-NMS bounding box loss function proposed in this study. Specifically, we compared the DIoU-NMS function with EIoU [60], WIoU [61], SIoU [62], and the baseline CIoU loss functions, with detection results summarized in and the corresponding loss curves presented in Figure 12.

In the YOLOv5s 6.0 version, the CIoU loss function is employed as the default bounding box loss function. Therefore, we selected the StarC3-CBAM-CIoU-YOLOv5s network as the baseline and replaced the loss function with EIoU, WIoU, SIoU, and DIoU-NMS for comparison. As indicated in Figure 12, all bounding box loss functions, with the exception of WIoU, exhibit a clear convergence trend. In contrast, the WIoU demonstrates divergent behavior, indicating instability and poor convergence characteristics. Consequently, the loss curve corresponding to the WIoU loss function is omitted from Figure 12 to maintain clarity. To further evaluate the effectiveness of the proposed DIoU-NMS loss function, additional comparative experiments were conducted using the standard DIoU loss. As shown in Figure 12, the DIoU-NMS loss function not only accelerates the convergence rate but also prolongs the effective training period by approximately 20 epochs. This extended convergence window allows the network to achieve superior performance and enhanced stability, underscoring the advantages of integrating DIoU-NMS into the proposed detection framework. The quantitative results in Table 9 reveal that replacing the standard loss function with EIoU, WIoU, or SIoU led to declines in all performance metrics, including precision, recall, mAP@0.5, mAP@0.95, and

F_{1} − s c o r e

. In contrast, incorporating the DIoU-NMS loss function significantly enhanced detection performance. Specifically, the DIoU-NMS function improved precision by 1.7%, recall by 4.2%, mAP@0.5 by 2.9%, mAP@0.95 by 2.8%, and the

F_{1} − s c o r e

by 9% compared to the baseline network. When compared to the DIoU loss function, the DIoU-NMS further increased precision by 1.7%, recall by 6.2%, and achieved additional improvements in mAP@0.5 and mAP@0.95.

These results demonstrate that the DIoU-NMS loss function outperforms EIoU, WIoU, SIoU, DIoU, and CIoU loss functions in detecting subtle surface defects on passion fruits. Its superior performance across multiple metrics highlights its suitability for tasks requiring the precise detection of small and intricate features.

3.3.7. Ablation Experiment

Based on the results of the comparative experiments presented in Table 7, Table 8 and Table 9, each of the proposed modules demonstrates effectiveness in detecting small defects on the surface of passion fruits. To further assess their contributions, ablation experiments were conducted using the baseline SCD-YOLOv5s network, as shown in Table 10; the results of experimental groups indexed 6 to 9 in Table 10 reveal that removing the DIoU-NMS bounding box loss function had a relatively minor impact on detection performance. Specifically, this removal resulted in a 1.7% reduction in precision, a 4.2% decrease in recall, and declines of 2.9% and 2.7% in mAP@0.5 and mAP@0.95, respectively, as well as an 8% drop in the

F_{1} − s c o r e

. In contrast, removing the CBAM module had a more pronounced effect on detection accuracy, leading to an 8.6% decrease in precision, a 9% reduction in recall, and declines of 5.6% and 3.2% in mAP@0.5 and mAP@0.95, respectively, along with a 12% drop in the

F_{1} − s c o r e

.

In conclusion, while the DIoU-NMS loss function provides incremental improvements, the CBAM and StarC3 modules contribute more significantly to the enhanced detection performance of the SCD-YOLOv5s network. These findings highlight the critical role of attention mechanisms and structural design in improving the accuracy and robustness of small defect detection in passion fruits.

4. Discussion

This study proposes a series of methods to enhance the accuracy of passion fruit surface defect detection. The effectiveness of the proposed improved network is validated through extensive comparative and ablation experiments. Compared to other models, the proposed approach demonstrates superior applicability for passion fruit defect detection tasks.

In comparison with the model reported in [63], both approaches achieve comparable accuracy, with results exceeding 95%. However, the method employed in [63] involves higher operational costs and complexity, making it unsuitable for real-time detection applications in agricultural settings. Relative to the model in [64], which is also based on the YOLO architecture, the method proposed in this study offers advantages in terms of model size and parameter count, resulting in faster inference and response times, thereby enhancing its suitability for real-time on-farm detection tasks.

Moreover, existing models [65,66,67,68] primarily focus on detecting passion fruits within complex backgrounds, emphasizing fruit identification rather than surface defect detection. In contrast, the proposed method not only targets surface-level defects but also demonstrates faster inference speeds than [67]; specifically, the proposed model attains a detection speed of 26.66 FPS, outperforming [67] by a margin of 20.94 FPS, and has already been deployed for routine on-farm inspections.

However, this study is limited to detecting surface decay and cracks on passion fruits. Future research should address additional types of defects, such as shrinkage resulting from prolonged storage, to further enhance the applicability of the proposed approach.

5. Conclusions

In this paper, we proposed an improved target detection network named SCD-YOLOv5s to address the limitations of manual detection of passion fruit surface defects. Our approach enhances detection accuracy by replacing the C3 module in the YOLOv5s backbone with a lightweight StarC3 module integrated with the SENet attention mechanism. We further introduced the CBAM module into the neck to focus on tiny surface defects and substituted the original loss function with the DIoU-NMS bounding box loss function to accelerate convergence and improve localization precision. Experimental results demonstrate that the SCD-YOLOv5s network achieves a precision of 95.9% in detecting small defects on passion fruit surfaces, which is 13.2% higher than the original YOLOv5s network. The recall rate improved by 1.6%, and mAP@0.5 and mAP@0.95 increased by 6.7% and 5.5%, respectively. Additionally, the SCD-YOLOv5s network is more lightweight, with a 9.6% reduction in model parameters and an 8.6% decrease in model size. Furthermore, the FPS have increased by 2.21, effectively reducing computational complexity. For future work, we plan to extend our method to detect defects in other types of fruits and vegetables, aiming to generalize the network for broader agricultural applications. We also intend to optimize the network architecture further for deployment on edge devices, enhancing its practicality for on-site usage. Integrating multispectral imaging techniques could be another avenue to improve defect detection under varying environmental conditions.

Author Contributions

Conceptualization, Y.Z.; data curation, Y.Z.; funding acquisition, C.N.; investigation, Y.Z., S.X. and M.W.; methodology, Y.Z.; project administration, C.N.; software, Y.Z., S.X. and Z.L.; supervision, T.Z.; visualization, Y.Z.; writing—original draft, Y.Z.; writing—review and editing, Y.Z., S.X., Z.L. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the 2023 Fujian Provincial Key Technology Innovation and Industrialization Project (Grant No. 2023G015).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to the data being part of an ongoing study.

Acknowledgments

This research was supported by the 2023 Jiangsu Province Industry-University-Research Cooperation Project (BY20230943).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

YOLOv5	You only look once version 5
SCD-YOLOv5s	StarC3SE-CBAM-DIoUNMS-YOLOv5s
DL	Deep learning
FPS	frames per second
CNN	Convolutional Neural Network
ResNet	Residual Network
DCNN	Deep Convolutional Neural Networks
R-CNN	Region-based Convolutional Neural Network
mAP	mean Average Precision
CWD	Class Weight Distillation
M	Million
ELAN	Encoder-Label-Decoder Attention Network
CAA	Context Anchor Attention
DIoU	Distance-IoU
NMS	Non-maximum suppression
PF	Passion fruit
SENet	Squeeze-and-Excitation Network
FM	Feature map
CBAM	Convolutional Block Attention Module
CAM	Channel Attention Module
SAM	Spatial Attention Module
IoU	Intersection over Union
CIoU	Complete-IoU
EIoU	Expected-IoU
WIoU	Wise-IoU
SIoU	Soft-IoU
GLOPs	Giga Floating-point Operations Per Second
DSConv	Dynamic Snake Convolution
SAConv	Switchable Atrous Convolution
SPDConv	Spatial Pyramid Dilated Convolution
CA	Coordinate Attention
ECA	Efficient Channel Attention

References

Corrêa, R.C.G.; Peralta, R.M.; Haminiuk, C.W.I.; Maciel, G.M.; Bracht, A.; Ferreira, I.C.F.R. The Past Decade Findings Related with Nutritional Composition, Bioactive Molecules and Biotechnological Applications of Passiflora Spp. (Passion Fruit). Trends Food Sci. Technol. 2016, 58, 79–95. [Google Scholar] [CrossRef]
Duarte, I.d.A.E.; Milenkovic, D.; Borges, T.K.; Oliveira, L.d.L.d.; Costa, A.M. Brazilian Passion Fruit as a New Healthy Food: From Its Composition to Health Properties and Mechanisms of Action. Food Funct. 2021, 12, 11106–11120. [Google Scholar] [CrossRef] [PubMed]
Fonseca, A.M.; Geraldi, M.V.; Junior, M.R.M.; Silvestre, A.J.; Rocha, S.M. Purple Passion Fruit (Passiflora Edulis f. Edulis): A Comprehensive Review on the Nutritional Value, Phytochemical Profile and Associated Health Effects. Food Res. Int. 2022, 160, 111665. [Google Scholar] [PubMed]
Kawakami, S.; Morinaga, M.; Tsukamoto-Sen, S.; Mori, S.; Matsui, Y.; Kawama, T. Constituent Characteristics and Functional Properties of Passion Fruit Seed Extract. Life 2022, 12, 38. [Google Scholar] [CrossRef]
Chebet, D.; Savini, I.; Rimberia, F.K. Passion Fruits Resilience to Global Warming and Climate Change. In Cultivation for Climate Change Resilience, Volume 1; CRC Press: Boca Raton, FL, USA, 2023; pp. 146–162. [Google Scholar]
Lo, P.-H.; Huang, J.-H.; Chang, C.-C.; Namisy, A.; Chen, C.Y.; Chung, W.-H. Diversity and Characteristics of Fusarium solani Species Complex (FSSC) Isolates Causing Collar Rot and Fruit Rot of Passion Fruit in Taiwan. Plant Dis. 2024, 109, 170–182. [Google Scholar] [CrossRef]
Lima, L.K.S.; de Jesus, O.N.; Soares, T.L.; de Oliveira, S.A.S.; Haddad, F.; Girardi, E.A. Water Deficit Increases the Susceptibility of Yellow Passion Fruit Seedlings to Fusarium Wilt in Controlled Conditions. Sci. Hortic. 2019, 243, 609–621. [Google Scholar] [CrossRef]
Anaruma, N.D.; Schmidt, F.L.; Duarte, M.C.T.; Figueira, G.M.; Delarmelina, C.; Benato, E.A.; Sartoratto, A. Control of Colletotrichum Gloeosporioides (Penz.) Sacc. in Yellow Passion Fruit Using Cymbopogon Citratus Essential Oil. Braz. J. Microbiol. 2010, 41, 66–73. [Google Scholar] [CrossRef]
Bano, A.; Gupta, A.; Prusty, M.R.; Kumar, M. Elicitation of Fruit Fungi Infection and Its Protective Response to Improve the Postharvest Quality of Fruits. Stresses 2023, 3, 231–255. [Google Scholar] [CrossRef]
Pereira, Z.C.; dos Anjos Cruz, J.M.; Corrêa, R.F.; Sanches, E.A.; Campelo, P.H.; de Araújo Bezerra, J. Passion Fruit (Passiflora Spp.) Pulp: A Review on Bioactive Properties, Health Benefits and Technological Potential. Food Res. Int. 2023, 166, 112626. [Google Scholar] [CrossRef]
Costa, A.P.; Peixoto, J.R.; Blum, L.E.B.; Pires, M.d.C. Standard Area Diagram Set for Scab Evaluation in Fruits of Sour Passion Fruit. J. Agric. Sci. 2019, 11, 298–305. [Google Scholar] [CrossRef]
Cubero, S.; Lee, W.S.; Aleixos, N.; Albert, F.; Blasco, J. Automated Systems Based on Machine Vision for Inspecting Citrus Fruits from the Field to Postharvest—A Review. Food Bioprocess Technol. 2016, 9, 1623–1639. [Google Scholar] [CrossRef]
Li, W.; Ran, F.; Long, Y.; Mo, F.; Shu, R.; Yin, X. Evidences of Colletotrichum Fructicola Causing Anthracnose on Passiflora edulis Sims in China. Pathogens 2021, 11, 6. [Google Scholar] [CrossRef]
Lin, F.; Chen, D.; Liu, C.; He, J. Non-Destructive Detection of Golden Passion Fruit Quality Based on Dielectric Characteristics. Appl. Sci. 2024, 14, 2200. [Google Scholar] [CrossRef]
Riascos, D.; Quiroga, I.; Gómez, R.; Hoyos-Carvajal, L. Cladosporium: Causal Agent of Scab in Purple Passion Fruit or Gulupa (Passiflora edulis Sims.). Agric. Sci. 2012, 3, 299–305. [Google Scholar]
Wang, Y.; Teng, Y.; Zhang, J.; Zhang, Z.; Wang, C.; Wu, X.; Long, X. Passion Fruit Plants Alter the Soil Microbial Community with Continuous Cropping and Improve Plant Disease Resistance by Recruiting Beneficial Microorganisms. PLoS ONE 2023, 18, e0281854. [Google Scholar] [CrossRef] [PubMed]
Henila, M.; Chithra, P. Segmentation Using Fuzzy Cluster-based Thresholding Method for Apple Fruit Sorting. IET Image Process. 2020, 14, 4178–4187. [Google Scholar] [CrossRef]
Siridhara, A.L.; Manikanta, K.V.; Yadav, D.; Varun, P.; Saragada, J. Defect Detection in Fruits and Vegetables Using K Means Segmentation and Otsu’s Thresholding. In Proceedings of the 2023 International Conference on Networking and Communications (ICNWC), Chennai, India, 5–6 April 2023; pp. 1–5. [Google Scholar]
Chithra, P.; Henila, M. Apple Fruit Sorting Using Novel Thresholding and Area Calculation Algorithms. Soft Comput. 2021, 25, 431–445. [Google Scholar] [CrossRef]
Shruthi, U.; Nagaveni, V. TomSevNet: A Hybrid CNN Model for Accurate Tomato Disease Identification with Severity Level Assessment. Neural Comput. Applic 2024, 36, 5165–5181. [Google Scholar] [CrossRef]
He, Y.; Zhang, N.; Ge, X.; Li, S.; Yang, L.; Kong, M.; Guo, Y.; Lv, C. Passion Fruit Disease Detection Using Sparse Parallel Attention Mechanism and Optical Sensing. Agriculture 2025, 15, 733. [Google Scholar] [CrossRef]
Nithya, R.; Santhi, B.; Manikandan, R.; Rahimi, M.; Gandomi, A.H. Computer Vision System for Mango Fruit Defect Detection Using Deep Convolutional Neural Network. Foods 2022, 11, 3483. [Google Scholar] [CrossRef]
Zhou, C.; Wang, H.; Liu, Y.; Ni, X.; Liu, Y. Green Plums Surface Defect Detection Based on Deep Learning Methods. IEEE Access 2022, 10, 100397–100407. [Google Scholar] [CrossRef]
Zhou, H.; Zhuang, Z.; Liu, Y.; Liu, Y.; Zhang, X. Defect Classification of Green Plums Based on Deep Learning. Sensors 2020, 20, 6993. [Google Scholar] [CrossRef] [PubMed]
Fan, S.; Li, J.; Zhang, Y.; Tian, X.; Wang, Q.; He, X.; Zhang, C.; Huang, W. On Line Detection of Defective Apples Using Computer Vision System Combined with Deep Learning Methods. J. Food Eng. 2020, 286, 110102. [Google Scholar] [CrossRef]
Da Costa, A.Z.; Figueroa, H.E.; Fracarolli, J.A. Computer Vision Based Detection of External Defects on Tomatoes Using Deep Learning. Biosyst. Eng. 2020, 190, 131–144. [Google Scholar] [CrossRef]
Nur Alam, M.; Saugat, S.; Santosh, D.; Sarkar, M.I.; Al-Absi, A.A. Apple Defect Detection Based on Deep Convolutional Neural Network. In Proceedings of International Conference on Smart Computing and Cyber Security; Pattnaik, P.K., Sain, M., Al-Absi, A.A., Kumar, P., Eds.; Lecture Notes in Networks and Systems; Springer: Singapore, 2021; Volume 149, pp. 215–223. ISBN 978-981-15-7989-9. [Google Scholar]
Xie, W.; Huang, K.; Wei, S.; Yang, D. Extraction and Modeling of Carrot Crack for Crack Removal with a 3D Vision. Comput. Electron. Agric. 2024, 224, 109192. [Google Scholar] [CrossRef]
Han, C.H.; Kim, E.; Doan, T.N.N.; Han, D.; Yoo, S.J.; Kwak, J.T. Region-Aggregated Attention CNN for Disease Detection in Fruit Images. PLoS ONE 2021, 16, e0258880. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Niu, T.; He, D. Tomato Young Fruits Detection Method under near Color Background Based on Improved Faster R-CNN with Attention Mechanism. Agriculture 2021, 11, 1059. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, C.; Lin, W.; Feng, Y.; Guo, Z.; Xie, Z. ATC-YOLOv5: Fruit Appearance Quality Classification Algorithm Based on the Improved YOLOv5 Model for Passion Fruits. Mathematics 2023, 11, 3615. [Google Scholar] [CrossRef]
Lu, J.; Chen, W.; Lan, Y.; Qiu, X.; Huang, J.; Luo, H. Design of Citrus Peel Defect and Fruit Morphology Detection Method Based on Machine Vision. Comput. Electron. Agric. 2024, 219, 108721. [Google Scholar] [CrossRef]
Yao, J.; Qi, J.; Zhang, J.; Shao, H.; Yang, J.; Li, X. A Real-Time Detection Algorithm for Kiwifruit Defects Based on YOLOv5. Electronics 2021, 10, 1711. [Google Scholar] [CrossRef]
Huang, Y.; Xiong, J.; Yao, Z.; Huang, Q.; Tang, K.; Jiang, D.; Yang, Z. A Fluorescence Detection Method for Postharvest Tomato Epidermal Defects Based on Improved YOLOv5m. J. Sci. Food Agric 2024, 104, 6615–6625. [Google Scholar] [CrossRef] [PubMed]
Obsie, E.Y.; Qu, H.; Zhang, Y.-J.; Annis, S.; Drummond, F. Yolov5s-CA: An Improved Yolov5 Based on the Attention Mechanism for Mummy Berry Disease Detection. Agriculture 2022, 13, 78. [Google Scholar] [CrossRef]
Wang, B.; Hua, J.; Xia, L.; Lu, F.; Sun, X.; Guo, Y.; Su, D. A Defect Detection Method for Akidzuki Pears Based on Computer Vision and Deep Learning. Postharvest Biol. Technol. 2024, 218, 113157. [Google Scholar] [CrossRef]
Yang, X.; Gao, S.; Xia, C.; Zhang, B.; Chen, R.; Gao, J.; Zhu, W. Detection of Cigar Defect Based on the Improved YOLOv5 Algorithm. In Proceedings of the 2024 IEEE 4th International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 21–23 June 2024; pp. 99–106. [Google Scholar]
Li, K.; Wang, J.; Jalil, H.; Wang, H. A Fast and Lightweight Detection Algorithm for Passion Fruit Pests Based on Improved YOLOv5. Comput. Electron. Agric. 2023, 204, 107534. [Google Scholar] [CrossRef]
Chen, D.; Lin, F.; Lu, C.; Zhuang, J.; Su, H.; Zhang, D.; He, J. YOLOv8-MDN-Tiny: A Lightweight Model for Multi-Scale Disease Detection of Postharvest Golden Passion Fruit. Postharvest Biol. Technol. 2025, 219, 113281. [Google Scholar] [CrossRef]
Lv, L.; Yilihamu, Y.; Ye, Y. Apple Surface Defect Detection Based on Lightweight Improved YOLOv5s. IJICT 2024, 24, 113–128. [Google Scholar] [CrossRef]
Xiao, J.; Kang, G.; Wang, L.; Lin, Y.; Zeng, F.; Zheng, J.; Zhang, R.; Yue, X. Real-Time Lightweight Detection of Lychee Diseases with Enhanced YOLOv7 and Edge Computing. Agronomy 2023, 13, 2866. [Google Scholar] [CrossRef]
Liu, P.; Yasenjiang, M. Defect Apple Detection Method Based on Lightweight YOLOv8. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 1193–1196. [Google Scholar]
Xu, C.; Su, J. Research on Kiwi Surface Defect Detection Algorithm Based on Improved YOLOv7-Tiny. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 1022–1029. [Google Scholar]
Yu, J.; Fu, R. Lightweight Yolov5s-Super Algorithm for Multi-Defect Detection in Apples. Eng. Agríc. 2024, 44, e20230175. [Google Scholar] [CrossRef]
Zhang, B.; Wang, R.; Zhang, H.; Yin, C.; Xia, Y.; Fu, M.; Fu, W. Dragon Fruit Detection in Natural Orchard Environment by Integrating Lightweight Network and Attention Mechanism. Front. Plant Sci. 2022, 13, 1040923. [Google Scholar] [CrossRef]
Wu, M.; Lin, H.; Shi, X.; Zhu, S.; Zheng, B. MTS-YOLO: A Multi-Task Lightweight and Efficient Model for Tomato Fruit Bunch Maturity and Stem Detection. Horticulturae 2024, 10, 1006. [Google Scholar] [CrossRef]
Sekharamantry, P.K.; Melgani, F.; Malacarne, J. Deep Learning-Based Apple Detection with Attention Module and Improved Loss Function in YOLO. Remote Sens. 2023, 15, 1516. [Google Scholar] [CrossRef]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-Task Deep Convolutional Neural Network for Cherry Tomato Fruit Bunch Maturity Detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Qiu, Z.; Huang, Z.; Mo, D.; Tian, X.; Tian, X. GSE-YOLO: A Lightweight and High-Precision Model for Identifying the Ripeness of Pitaya (Dragon Fruit) Based on the YOLOv8n Improvement. Horticulturae 2024, 10, 852. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6070–6079. [Google Scholar]
Qiao, S.; Chen, L.-C.; Yuille, A. Detectors: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Machine Learning and Knowledge Discovery in Databases; Amini, M.-R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 13715, pp. 443–459. ISBN 978-3-031-26408-5. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism 2023. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression 2022. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Lu, Y.; Wang, R.; Hu, T.; He, Q.; Chen, Z.S.; Wang, J.; Liu, L.; Fang, C.; Luo, J.; Fu, L. Nondestructive 3D Phenotyping Method of Passion Fruit Based on X-Ray Micro-Computed Tomography and Deep Learning. Front. Plant Sci. 2023, 13, 1087904. [Google Scholar] [CrossRef]
Ou, J.; Zhang, R.; Li, X.; Lin, G. Research and Explainable Analysis of a Real-Time Passion Fruit Detection Model Based on FSOne-YOLOv7. Agronomy 2023, 13, 1993. [Google Scholar] [CrossRef]
Abdo, A.; Hong, C.J.; Kuan, L.M.; Pauzi, M.M.; Sumari, P.; Abualigah, L.; Zitar, R.A.; Oliva, D. Markisa/Passion Fruit Image Classification Based Improved Deep Learning Approach Using Transfer Learning. In Classification Applications with Deep Learning and Machine Learning Technologies; Abualigah, L., Ed.; Studies in Computational Intelligence; Springer International Publishing: Cham, Switzerland, 2023; Volume 1071, pp. 143–189. ISBN 978-3-031-17575-6. [Google Scholar]
Sidehabi, S.W.; Suyuti, A.; Areni, I.S.; Nurtanio, I. Classification on Passion Fruit’s Ripeness Using K-Means Clustering and Artificial Neural Network. In Proceedings of the 2018 International Conference on Information and Communications Technology (ICOIACT), Online, 6–7 March 2018; pp. 304–309. [Google Scholar]
Tu, S.; Pang, J.; Liu, H.; Zhuang, N.; Chen, Y.; Zheng, C.; Wan, H.; Xue, Y. Passion Fruit Detection and Counting Based on Multiple Scale Faster R-CNN Using RGB-D Images. Precis. Agric 2020, 21, 1072–1091. [Google Scholar] [CrossRef]
Sun, Q.; Li, P.; He, C.; Song, Q.; Chen, J.; Kong, X.; Luo, Z. A Lightweight and High-Precision Passion Fruit YOLO Detection Model for Deployment in Embedded Devices. Sensors 2024, 24, 4942. [Google Scholar] [CrossRef]

Figure 1. Detection equipment. (a) Detection equipment, (b) industrial cameras, (c) positions of the light source, camera, and passion fruit.

Figure 2. Passion fruit defect categories. (a) Normal, (b) decay, (c) crack.

Figure 3. Annotated passion fruit. (a) Original image, (b) annotated image.

Figure 4. Overall workflow of the proposed method.

Figure 5. Improved YOLOv5s structure.

Figure 6. SENet module structure.

Figure 7. Improved C3 module structure. (a) Starblock module structure, (b) StarSE module structure, (c) StarC3SE module structure.

Figure 8. CBAM module structure.

Figure 9. Results of detection of YOLOv5 and SCD-YOLOv5s networks. (a) Results of detection of different size YOLOv5 and SCD-YOLOv5s. (b) Results of error identification.

Figure 10. Loss value and mAP@0.5 of different YOLOv5 networks. (a) Value of loss; (b) Value of mAP@0.5.

Figure 11. Heatmap of using different attention mechanisms. (a–d) are four randomly selected passion fruit samples with surface defects.

Figure 12. Loss curves of using four bounding box loss functions.

Table 1. Parameters of the data augmentation.

Method	Parameters
Flipping	Horizon, vertical
Scaling	0–16%
Rotation	0–30°
Brightness	−21–+21%
Exposure	−11–+11%
Cropping	0–10°

Table 2. Computer configuration used in the experiment.

Name	Configuration
Random Access Memory	30 GB
CPU	7 vCPU Intel (R) Xeon (R) CPU E5-2680 v4 @ 2.40 GHz
Graphics Card	$NVIDIA RTX 3080 \times$ 2 (20 GB)
System	ubuntu20.04
Pytorch Version	1.10.0
Cuda Version	11.3

Table 3. Training parameters used in the experiment.

Training Parameters	Values
Image size	640 × 640
Epochs	400
Batch-size	16
Initial learning rate	0.01
Optimizer	SGD

Table 4. Experimental results of different target detection models.

Models	Precision (%)	Recall (%)	mAP@0.5 (%)	GLOPs	Model Size (MB)
R-CNN	55.6	60.45	66.56	80.2	224.3
Fast-RCNN	59.8	58.7	70.12	52.3	125.7
Faster-RCNN	64.0	68.75	75.67	40.6	108.2
Mask-RCNN	60.5	72.12	73.88	70.2	187.5
SCD-YOLOv5s	95.9	84.7	88.4	14.3	12.6

Table 5. Results of different versions of the YOLO network with different depths.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$ (%)	Parameters	Model Size (MB)
YOLOv7tiny	83.3	77.9	81.6	64.62	80.0	6,017,694	12.3
YOLOv7	85.5	77.9	80.4	64.84	81.0	37,201,950	74.8
YOLOv7x	85.8	80.9	84.2	64.87	78.0	70,821,830	142.1
YOLOv8n	80.3	76.7	79.4	66.59	79.0	3,011,238	6.3
YOLOv8s	88.4	71.2	79.3	66.63	78.0	11,136,374	22.5
YOLO11s	86.7	78.9	83.7	69.9	82.0	9,428,566	19.2
SCD-YOLOv5s (Ours)	95.9	84.7	88.4	71.2	93.0	6,408,833	12.6

Table 6. Results of SCD-YOLOv5s and different YOLOv5 models with different sizes.

Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$ (%)	Parameters	Model Size (MB)	FPS $(f \cdot s^{- 1}$ )
YOLOv5s	82.7	83.1	81.7	65.7	76.0	7,025,023	13.8	24.45
YOLOv5m	83.3	77.8	80.7	68.2	81.0	20,875,359	40.3	25.7
YOLOv5l	83.3	74.5	81.0	67.1	82.0	46,563,709	88.6	25.97
YOLOv5x	87.9	76.6	78.7	61.9	82.0	46,563,709	165.0	17.92
SCD-YOLOv5s(Ours)	95.9	84.7	88.4	71.2	93.0	6,408,833	12.6	26.66

Table 7. Experimental results of different modules.

Index	Model	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$ (%)
1	YOLOv5s	82.7	83.1	81.7	65.7	76.0
2	YOLOv5s + DSConv [54]	81.7	78.1	79.8	64.9	82.0
3	YOLOv5s + SAConv [55]	77.9	76.3	78.6	67.1	83.0
4	YOLOv5s + SPDConv [56]	82.6	78.8	79.7	68.5	81.0
5	YOLOv5s + StarC3SE	92.0	78.9	83.2	68.8	83.0

Table 8. Experimental results of different attention mechanisms.

Index	StarC3-YOLOv5s	ECA	SimAM	CA	CBAM	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$ (%)
1	✔					92.0	78.9	83.2	68.8	83.0
2	✔	✔				87.3	78.9	72.1	66.0	83.0
3	✔		✔			85.7	77.4	77.0	61.2	84.0
4	✔			✔		80.4	82.7	82.5	66.6	82.0
5	✔				✔	94.2	80.5	85.5	68.5	85.0

Table 9. Experimental results of different bounding box loss function.

Index	StarC3-CBAM-CIoU-YOLOv5s	EIoU	WIoU	SIoU	DIoU	DIoU-NMS	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$ (%)
1	✔						94.2	80.5	85.5	68.5	85.0
2	✔	✔					82.2	74.3	82.3	66.7	81.0
3	✔		✔				68.3	44.6	25.2	66.8	26.0
4	✔			✔			82.6	76.6	79.4	66.0	80.0
5	✔				✔		94.2	78.5	81.7	64.7	86.0
6	✔					✔	95.9	84.7	88.4	71.2	93.0

Table 10. Experimental results of the ablation experiments.

Index	YOLOv5s	StarC3SE	DIoU-NMS	CBAM	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.95 (%)	$F_{1} − S c o r e$
1	✔				82.7	83.1	81.7	65.7	76.0
2	✔	✔			92.0	78.9	83.2	68.8	83.0
4	✔		✔		86.6	77.9	81.1	64.8	79.0
5	✔			✔	89.4	78.7	85.1	68.6	83.0
6	✔		✔	✔	89.1	79.9	84.8	67.7	84.0
7	✔	✔		✔	94.2	80.5	85.5	68.5	85.0
8	✔	✔	✔		87.3	75.7	82.8	68.0	81.0
9	✔	✔	✔	✔	95.9	84.7	88.4	71.2	93.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Li, Z.; Xue, S.; Wu, M.; Zhu, T.; Ni, C. Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s. Agriculture 2025, 15, 1111. https://doi.org/10.3390/agriculture15101111

AMA Style

Zhou Y, Li Z, Xue S, Wu M, Zhu T, Ni C. Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s. Agriculture. 2025; 15(10):1111. https://doi.org/10.3390/agriculture15101111

Chicago/Turabian Style

Zhou, Yu, Zhenye Li, Sheng Xue, Min Wu, Tingting Zhu, and Chao Ni. 2025. "Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s" Agriculture 15, no. 10: 1111. https://doi.org/10.3390/agriculture15101111

APA Style

Zhou, Y., Li, Z., Xue, S., Wu, M., Zhu, T., & Ni, C. (2025). Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s. Agriculture, 15(10), 1111. https://doi.org/10.3390/agriculture15101111

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight SCD-YOLOv5s: The Detection of Small Defects on Passion Fruit with Improved YOLOv5s

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Dataset Preparation and Augmentation

2.3. Overview of Experimental Overall Stages

2.4. The Proposed SCD-YOLOv5s Network

2.4.1. Overview of the YOLOv5 Model

2.4.2. Enhanced Backbone with StarC3SE Module

SENet Channel Attention Mechanism

StarBlock

StarC3SE Module

2.4.3. Enhanced Neck with CBAM Module

2.4.4. Enhanced Head with DIoU-NMS Loss Function

3. Experiments and Results

3.1. Experimental Environment Configuration and Training Parameters Setting

3.2. Evaluation Metrics of YOLOv5

3.3. Results

3.3.1. Comparison of Different Target Detection Models

3.3.2. Comparison of Different YOLO Network Versions

3.3.3. Comparison of Different YOLOv5 Sizes

3.3.4. Comparative Experiments of Different MODULES

3.3.5. Comparative Experiments of Different Attention Mechanisms

3.3.6. Comparative Experiments Using Different Bounding Box Loss Functions

3.3.7. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI