Apple Defect Detection in Complex Environments

Shan, Wei; Yue, Yurong

doi:10.3390/electronics13234844

Open AccessArticle

Apple Defect Detection in Complex Environments

by

Wei Shan

^1,2,*

and

Yurong Yue

^1,2

¹

School of Physics and Electronic Information, Huaibei Normal University, Huaibei 235000, China

²

Anhui Key Laboratory of Intelligent Computing and Applications, Huaibei 235000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4844; https://doi.org/10.3390/electronics13234844

Submission received: 25 October 2024 / Revised: 6 December 2024 / Accepted: 6 December 2024 / Published: 9 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Aiming at the problem of high false detection and missed detection rate of apple surface defects in complex environments, a new apple surface defect detection network: space-to-depth convolution-Multi-scale Empty Attention-Context Guided Feature Pyramid Network-You Only Look Once version 8 nano (SMC-YOLOv8n) is designed. Firstly, space-to-depth convolution (SPD-Conv) is introduced before each Faster Implementation of CSP Bottleneck with 2 convolutions (C2f) in the backbone network as a preprocessing step to improve the quality of input data. Secondly, the Bottleneck in C2f is removed in the neck, and Multi-scale Empty Attention (MSDA) is introduced to enhance the feature extraction ability. Finally, the Context Guided Feature Pyramid Network (CGFPN) is used to replace the Concat method of the neck for feature fusion, thereby improving the expression ability of the features. Compared with the YOLOv8n baseline network, mean Average Precision (mAP) 50 increased by 2.7% and 1.1%, respectively, and mAP50-95 increased by 4.1% and 2.7%, respectively, on the visible light apple surface defect data set and public data set in the self-made complex environments.The experimental results show that SMC-YOLOv8n shows higher efficiency in apple defect detection, which lays a solid foundation for intelligent picking and grading of apples.

Keywords:

apple defect detection; spatial to deep convolution; multi-scale hollow attention; content-guided feature pyramid network; SMC-YOLOV8n

1. Introduction

In view of the simultaneous development of life abundance and health awakening, the market’s desire for high-quality fruits is gradually expanding. As a representative of healthy fruit, there is a strong market demand for healthy apples. Apple defect detection in complex environments (ADDCE) has increasingly become a research hotspot. The complex environments refers to a variety of interference factors in the planting process, including uneven light conditions, messy background, occlusion, climate change and soil quality differences. The manual inspection of ADDCE will face many problems: low efficiency, unstable accuracy, high labor cost, strong subjectivity, and inability to quantify indicators. Automatic detection technologies such as machine vision have opened up a new way to solve these problems.With the passage of time, Artificial Intelligence (AI) technology is steadily emerging and growing, and has been widely used in agriculture [1], autonomous driving [2], industrial automation [3], smart home [4] and medical [5]. With the maturity and large-scale application of AI technology, the agricultural industry will undergo tremendous changes. Fast and accurate quality inspection including appearance, maturity and defect detection through image recognition and machine learning can help reduce labor costs, improve Apple’s quality and market competitiveness. The combination of artificial intelligence and intelligent agriculture can achieve more accurate, efficient and sustainable development in agricultural production.

The ADDCE research work is mainly divided into two categories: one is the operation technology focusing on the invisible part of the electromagnetic spectrum, and the other is the technology of machine vision using the visible part of the electromagnetic spectrum [6] as shown in Figure 1. The invisible part technology based on electromagnetic spectrum has the characteristics of non-destructive, high precision and accuracy, rapid detection, wide application, safety and environmental protection. The academic community has carried out rich exploration and research on this topic. Els Herremans et al. (2014) [7] used X-ray computed tomography to detect healthy apples by detecting apple water cores. Omid Doosti-Irani et al. (2015) [8] used thermal imaging to estimate the severity of apple bruising.Yuzhen Lu et al. (2016) [9] established a structured illumination reflection imaging system (SIRI) to detect apple fresh wounds (such as bruises). Sanaz Jarolmasjed et al. (2017) [10] used near-infrared (NIR) spectroscopy to classify healthy apples and diseased apples. Baohua Zhang et al. (2018) [11] studied the technological leap from hyperspectral imaging (HIS) to multi-band imaging (MIS), and proved the stability and portability of common defect detection algorithms on apple surface. On the whole, the operation technology based on the invisible part of the electromagnetic spectrum usually has strong equipment dependence, high operation complexity and skill requirements, strict site and facility requirements, and the test results are affected by many factors.With the advancement of agricultural modernization and the continuous progress and expansion of intelligent agricultural technology, the agricultural field has ushered in the upsurge of machine vision technology application, and the participation of scholars has been continuously improved. By simulating the human visual function, the machine can acquire, process and interpret image data, so as to realize the quality detection of agricultural products. Machine vision-driven ADDCE technology is divided into two categories: traditional methods with a long history and cutting-edge deep learning technologies. Early scholars have done in-depth research on traditional algorithms. For example, Akira Mizushima et al. (2012) [12] used linear support vector machine (SVM) to calculate the optimal linear separation hyperplane, grayed the color image, and obtained the optimal threshold near the apple by Otsu method to sort and classify it. Based on the color and defect characteristics of apples, Ji, YH et al. (2018) [13] used Otsu algorithm to segment the hue value of apples, and Canny edge detection algorithm to extract defect features. Finally, the defect area was segmented by morphology to achieve effective detection of defective apples. Ann Nosseir et al. (2019) [14] focused on the color richness and texture complexity of fruit images. K-nearest neighbor algorithm was used to classify apples, strawberries, mangoes and bananas, and SVM was used to classify their quality (fresh, rotten). Practice shows that the traditional feature extraction method mainly relies on manual design, high complexity, vulnerability to noise and interference, difficulty in feature selection, etc. [15]. In view of these, the practicability of traditional feature extraction methods in the field of ADDCE is greatly reduced. With the development of deep learning and other technologies, more and more researchers have begun to explore data-driven automatic feature extraction methods to overcome the shortcomings of traditional methods. Deep learning methods mainly include two-stage and single-stage. The core idea of the former is to decompose the detection task into two continuous and interdependent stages. The first stage is mainly responsible for preliminary screening, generating candidate regions that may contain the target object, and then classifying and accurately locating these candidate regions [16]. Such algorithms usually perform well in accuracy, but the speed is relatively slow. The latter can complete the detection task more quickly by omitting the step of candidate region generation, and usually shows higher detection speed. It has won the favor of the majority of scholars. For example, Shuxiang Fan et al. (2022) [17] proposed a real-time apple defect detection technology using YOLOV4 architecture, but the algorithm is only for fruit defect recognition in commercial fruit packaging lines and does not provide technical support for ADDCE. Yu, JA et al. (2024) [18] proposed an improved lightweight algorithm based on YOLOV5 s for apple multi-defect detection. However, the main goal of the research is the growth characteristics of apple defects in the indoor environment. Bo Han et al. (2024) [19] proposed an improved VEW-YOLOV8 n for the classification of diseased apples, which provides a novel and effective method for non-destructive grading detection in apples. However, the data set used is partly derived from a single sample on the network, not the natural environment.

ADDCE faces random defect locations, low saliency, difficult image acquisition, high data labeling costs, and low accuracy. Focusing on the above problems, an improved ADDCE method, SMC-YOLOV8n, is proposed by making apple defect data sets in natural environment. Firstly, SPDConv is introduced before the backbone C2f of the network to enhance the feature capture ability of the model in low-resolution apples and defective apples. The MSDA is embedded in the C2f of the neck of the model to effectively suppress noise and redundant information, further improve the detection accuracy and computational efficiency, and better adapt to the ADDCE task requirements. CGFPN is designed to use context information to assist apple defect detection, and has higher robustness when dealing with complex scenes such as occlusion, deformation, and illumination changes.

2. Overview of YOLOv8 Target Detection

YOLOv8 [20] is the next major update of YOLOv5 released by Ultralytics in 2023. The architecture is shown in Figure 2, which consists of three key components: Backbone, Neck, Head. The backbone network follows the CSP network design architecture in YOLOv5 and replaces CSP Bottleneck with 3 convolutions (C3) with C2f. The Neck end still follows the PAN-FPN idea. The detection head separates the regression branch from the prediction branch, and the loss function is classified by BCE Loss.Regression combines DFL Loss and CIOU Loss to further improve the detection performance.

3. Materials and Methods

3.1. Tools for Low-Definition and Small Object Recognition

In ADDCE, Apple, which has a relatively small vision, has always been a challenging task. That is, when the image fineness is lacking and the apple defect feature is not obvious, the performance of YOLOv8 will decrease rapidly when it is used for detection. The reason is that the step convolution and pooling mechanism is easy to cause excessive information compression, which weakens the efficiency of the model to obtain fine structure. Inspired by Jian Zhang et al. [21], it relies on YOLOv8 to build a core processing network in the ship detection algorithm in complex water surface environment, and introduces SPD-Conv and focus modulation modules to replace the spatial pyramid pooling module with FPN.Experimental results show that the optimized algorithm has improved detection accuracy and has stronger generalization ability. Due to the similarity between the ship detection algorithm in complex water environment and ADDCE in terms of complex environments, challenges such as unclear target features, and technical requirements, this paper introduces SPD-Conv [22] before each C2f of Bachbone.When scale = 2, the structure is shown in Figure 3. It consists of SPD layer and stepless convolution layer.The main idea of the SPD layer is to split the original dimension into four new convolution layers with only one-half of the original dimension according to the strategy of the minimum field lattice, and its depth remains unchanged. Then four new convolution layers are arranged vertically. In other words, the original dimension is transformed into depth. After rearrangement, the depth is 4 times of the original, the dimension becomes half of the original, and as much spatial information as possible is retained. A convolutional layer without downsampling is applied to the output of the ‘SPD layer’, and feature learning is performed on the premise of maintaining spatial resolution, which further maintains the fine-grained information of apple defects and maintains the information integrity of the feature map in the process of convolution and pooling. This combination of SPD layer and stepless convolution layer can improve context information capture, improve model efficiency and performance, and adapt to complex scenes.SPD-Conv is introduced into the C2f pre-processing of YOLOv8 for image processing, which has significant advantages over traditional methods [23]. SPD-Conv can more effectively extract the subtle features of low-definition and small objects, and improve the understanding of complex scenes. At the same time, it shows stronger robustness under different illumination and background conditions, and can effectively deal with interference. In addition, SPD-Conv optimizes the convolution operation, reduces the computational complexity, improves the processing speed, is suitable for real-time applications, and can adaptively adjust the shape and size of the convolution kernel to adapt to different object features. It can implement effective information recovery strategies for YOLOv8’s shortcomings in low resolution and small object detection, and improve the performance and robustness of the model in ADDCE scenarios.

3.2. C2f-MSDA

The C2f module [20] (as shown in Figure 4a) first receives an input feature map, and the feature map is processed in depth through its built-in convolution block. Then it is divided into two parts, part of which is directly transmitted to the subsequent Concat, and merged with the processed feature map; another portion is passed to multiple Bottlenecks for further processing. The Bottleneck structure is shown in Figure 4b. The input feature map is processed by convolution, normalization and activation, which helps to refine and enhance the features so that they contain more information related to apple defects. After the Bottleneck processing is completed, the output feature map will be merged with the directly transmitted part along the channel axis through Concat to achieve multi-level feature fusion, which helps the model to comprehensively utilize the multi-level information of apple surface defects and enhance the reliability of apple defect detection. After splicing, the feature map will be input into a final convolution block for further processing, and the final output feature map will be generated. The number of channels and space sizes of the feature map are adapted to optimize the performance and meet the needs of subsequent apple defect detection tasks. The introduction of C2f will increase the number of parameters. When the number of Bottleneck blocks is relatively large, Concat stitches a large number of similar feature maps or residual connections, the parameters in the C2f module may be redundant, resulting in low parameter utilization efficiency. Inspired by Zhujin et al. [24], the recognition ability of the underwater garbage detection model is improved by introducing EMA into C2f with YOLOv8 n as the baseline. Due to the similarity between the underwater garbage detection algorithm and ADDCE in terms of challenges such as complex environments, unobvious target features, and technical requirements, this paper combines the attention mechanism MSDA in C2f to further improve performance. Using MSDA, the model can focus more on identifying apple ’s defect features when processing data. Ignoring irrelevant information can make the model pay more attention to the local areas and features of apples in complex backgrounds in the process of feature fusion and enhancement, thereby enhancing the accuracy and efficiency of detection.

MSDA [25] A novel multi-scale hollow attention, proposed by Jiayu Jiao et al. in 2023, is an important part of DilateFormer. The main idea of MSDA is to reduce the redundant computation in the global attention mechanism by simulating the interaction between local and sparse image blocks in a small range, while retaining the ability of multi-scale feature extraction. The formation of the design scheme benefits from the in-depth analysis of the global attention mechanism characteristics of visual converters (ViTs). It is found that there is redundancy in global dependency modeling on shallow features, so it can be optimized by local and sparse attention mechanisms. It is mainly used to process basic visual tasks, optimize the extraction process of multi-scale semantic information, and reduce invalid calculations in the self-attention mechanism. Figure 5 depicts the structural layout of MSDA in detail. Firstly, the introduced feature map data is converted into three new tensors: query, key and value for subsequent calculations by linear projection.Then, the channel of the feature map is decomposed into multiple parallel Heads components. Each Head uses different expansion rates (r = 1, 2, 3), and expands the attention through the sliding window to capture local and sparse image block interactions. Finally, the output is merged through the Concat operation, and then the linear layer is used to process and aggregate the features, and the feature information from different heads is aggregated into a unified feature representation for apple defect detection.

In this paper, MSDA is used to construct a new C2f, named C2f-MSDA. Figure 6 shows the structure in detail. The specific operation is to replace the Bottleneck of Neck’s C2f with MSDA, initially use the convolution operation to extract features and improve the dimension, and then divide the feature map into two parts for parallel processing through the segmentation operation. One part is processed through the MSDA module, and the other part is not processed. Finally, it is connected to obtain richer multi-scale features and attention weighting, and then the features are further fused and enhanced by convolution.

3.3. Context Guided Feature Pyramid Network

In YOLOv8n, Concat is mainly used to connect different levels of feature maps, and is committed to improving the accuracy and processing efficiency of target detection. Through the Concat operation, the input multiple feature maps are spliced along the channel axis, thereby increasing the number of channels of the output feature map, which is equal to the sum of the number of channels of all input feature maps. Concat can fuse different levels of feature information, giving the model stronger context-aware ability to capture more detailed information, thereby improving the accuracy and robustness of detection. In the deep learning model, Concat, as a common method of feature fusion, has significant advantages. However, there are also some shortcomings. Concat merges multiple feature maps in the depth dimension, resulting in a sharp increase in the number of output feature map channels, which increases the complexity and computational cost of subsequent calculations. At the same time, direct splicing of feature maps may lead to model parameter redundancy, affecting the generalization ability and training efficiency of the model.Concat is only a simple feature splicing method, and lacks a flexible fusion strategy. Inspired by Linfeng Tang et al. [26], it introduces an innovative network of infrared and visible light fusion-PSFusion, which subtly integrates progressive semantic enhancement and scene fidelity maintenance mechanisms to better serve machine vision and deep learning technology, and has made great contributions to target detection. The design based on channel and spatial attention mechanism provides an innovative module for the fine fusion of shallow features and enhances the expression ability of features. On this basis, this paper introduces the content-guided feature pyramid fusion mechanism, and dynamically optimizes the fusion coefficient according to the richness of the feature map and the context clues. Specifically, SE is used to assign different weights to each feature channel, so as to enhance useful features and suppress useless features, and construct a more efficient, flexible and powerful content-guided feature pyramid network, aiming to solve the limitations of Concat method and optimize the performance of ADDCE task.

In the traditional CNN architecture, the traditional CNN architecture using convolution and pooling technology, although powerful and widely used, has limitations in the in-depth mining of feature channel relationships, resulting in the relatively small contribution of certain channels to certain tasks. In order to improve the feature expression ability of the network to capture the deep information of the data, Junhui Zhao et al. [27] focused on the channel and proposed a new architecture unit. In order to enhance the representation ability of the network, the interdependence model between the convolution feature channels is clearly constructed. The architecture overview is shown in Figure 7, and the detailed implementation steps are shown in Figure 8. The whole process is divided into three main operation steps. Squeeze, Excitation, Scale. Squeeze refines and captures global feature information through spatial feature compression. It is usually implemented by Global Average Pooling. The feature map of each channel is compressed in the spatial dimension to obtain a single value representing the global information of the channel. The process is shown in Formula (1). After the Squeeze operation, the Excitation module generates a weight (Scale) for each channel by learning the dependencies between different channels. The general strategy is to construct two fully connected layers and introduce a nonlinear activation function (such as ReLU, Sigmoid) to enhance the nonlinear ability of the network. The former fully connected layer is used for dimensionality reduction to reduce the amount of calculation; the second fully connected layer is used to increase the dimension and restore the number of channels. The pre-full connection layer simplifies the complexity of data processing by means of dimensionality reduction. The subsequent fully connected layer ensures the recovery and maintenance of the number of channels through the dimension-up operation. The nonlinearity is introduced by the ReLU activation function. At the end of the network, we use the Sigmoid function to compress the output to ensure that its value is between 0 and 1 as the weight coefficient corresponding to each channel. The process is shown in Formula (2). Which represents the activation function (ReLU, Sigmoid). Finally, the weight derived by Squeeze is used to assign weight to the feature mapping X to obtain the required feature mapping. The process is shown in Formula (3). As a technology that effectively enhances the model’s understanding of the importance of feature channels, SE Attention has performed well in computer vision tasks and has won wide adoption and praise.

Z = S q u e e z e (X) = \frac{1}{H * ω} \sum_{i = 1}^{H} \sum_{j = 1}^{ω} X (i, j)

(1)

S = E x c i t a t i o n (Z) = σ (ω_{2} σ (ω_{1} z))

(2)

X_{1} = S c a l e (X, S) = X S

(3)

Based on SE Attention, this paper draws on the idea of image fusion to realize the fusion of channels, and designs CGFPN. The specific structure is shown in Figure 9. Firstly, at the channel level, the input X0 and X1 feature maps are connected in series to create a new combined feature map, and then segmented by SE Attention to obtain the weighted x0_weight and x1_weight. By weighting and adding, x0 and x1_weight are added, and x1 and x0_weight are added to fuse multi-feature information. Through this feature fusion mechanism, we promote the model to fully consider the mutual influence and complementarity between each feature map in the combination process. Then, these deep-fused feature maps are spliced along the channel direction to form the final output of CGFPN, thereby enhancing the comprehensiveness and expression ability of the output feature map. The design of CGFPN considers the flexible and effective use of context information when fusing different scale features, so as to realize the precise customization and optimization of feature representation. SE technology helps feature fusion to capture and integrate critical context data.Therefore, the effectiveness of feature representation is enhanced, and the model is effectively guided to deeply understand and absorb the key feature information of the detection target, so as to promote the model to achieve higher accuracy in the detection process. Through the weighted feature recombination operation, the module can enhance the important features, suppress the unimportant features, and improve the sensitivity of the feature map to subtle differences. The module structure is relatively simple and does not introduce too much computational overhead, which is suitable for ADDCE.

3.4. Overall Network Architecture

In summary, Figure 10 is a schematic diagram of the SMC-YOLOV8 n algorithm in this paper. SPD-Conv is introduced before each C2f of the backbone network, and a new C2f-MSDA is designed at the Neck end. The specific operation is to abandon the Bottleneck of the original C2f, introduce MSDA, and change the channel splicing to CGFPN.

4. Experimental Results and Analysis

4.1. Experimental Environment and Parameter Configuration

The ADDCE experiment uses the python3.10 programming language and uses torch2.2.2 to build a deep learning framework. This experiment runs on the Windows 11 operating system and uses the NVIDIA GeForce RTX 4090 D image processor for GPU acceleration. The CPU processor is Intel (R) Core (TM) i9-14900KF. The specific values of the experimental configuration parameters are summarized in Table 1.

4.2. Data Sets and Preprocessing

The data set was taken in the agricultural standardization demonstration area of Songtuan Town and Hecun Village, Lieshan District, Huaibei City, Anhui Province, and No.100 Haoyi Shopping Supermarket, Xiangshan District, Huaibei City, Anhui Province. The research object was Red Fuji Apple. The acquisition equipment was DJI Tello drone and five different types of mobile phones. Table 2 lists the name and resolution of all acquisition equipment. After removing the unimaging and blurred images, it is expanded by cutting and rotating. In YOLOv8, using mobile phones with multiple resolutions to take photos can enhance data diversity and help the model learn diversified features, thereby improving its generalization ability. This approach enables the model to adapt to images taken by different devices and improve its performance in real scenes. In addition, image enhancement with multiple resolutions enhances the robustness of the model to different shooting conditions, reduces recognition errors, and optimizes overall detection performance. Finally, this method better simulates the image quality differences in the real world, making the trained model perform better in practical applications. The expanded apple defect dataset contains 2400 sample images. Some data sets are shown in Figure 11. (a) Frontlight: the light source is located in front of the apple, directly irradiated on the apple, can clearly show the details and color of the apple. (b) Backlight: The light source is located behind the apple and shines on the back of the apple, which may cause details to be lost. (c) Uneven: uneven distribution of light, color or texture on apple surface. (d) Long Shot: Shoot Apple from a greater distance to show a wider context or context, providing contextual information but possibly making the details less clear. (e) Occlusion: The phenomenon in which an apple is partially occluded by other apples or branches. (f) Market: Apple’s sales and distribution environments, including supermarkets, fruit and vegetable markets and online platforms.With the help of LabelImg, the task of manual annotation is completed, and it is divided into training set, verification set and test set according to the ratio of 7:2:1. Model training verification and testing. In this paper, the test set is used as the only standard to evaluate the effectiveness of the proposed algorithm.

4.3. Evaluating Indicator

The experiment uses a number of key indicators to comprehensively evaluate the performance of the target detection or classification system: Precision. Through the calculation of Formula (4), the ability of the system to accurately identify positive samples is measured. The higher the value, the higher the proportion of the real positive class in the samples predicted to be positive, that is, the less false positives, which reflects the good precision performance of the network. The recall rate (Recall), based on Formula (5), evaluates the ability of the system to recall positive samples. The larger the value, the more real positive samples the network can find, and the less underreporting, showing the strong recall performance of the network. In this formula, TP, FP and FN represent the number of samples of real cases, false positive cases and false negative cases, respectively. Average precision (AP) is the calculation result of the area under the P-R curve. It combines the accuracy performance at different recall levels and is obtained by Formula (6) or the corresponding numerical integration method. The higher the AP value, the better the balance between precision and recall of the model, and the better the overall performance. mAP, that is, the average accuracy of each type, is the average of AP values in multiple categories, which is calculated by Formula (7).

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P r e c i s i o n (t) d t

(6)

m A P = \frac{\sum_{n = 1}^{N} A P_{n}}{N}

(7)

4.4. Experimental Result

4.4.1. Ablation Experiment

In order to verify the effectiveness of the proposed algorithm for ADCE performance, a set of ablation experiments were designed. Based on the YOLOv8n model, the effects of SPD-Conv, C2f-MSDA and CGFPN were verified respectively. The experimental results are shown in Table 3.

From the comparison in Table 3, SMC-YOLOv8n shows significant optimization and improvement in multiple core performance indicators compared with the baseline model. Its mAP50 and mAP50-95 increased by 2.7% and 4.1%, respectively. Figure 12a–c gives the loss curves, mAP50 and mAP50-95 training curves. Figure 12d is the P-R curve of the test set, and Figure 13d is the confusion matrix of the test set.

4.4.2. Contrast Test

In order to further verify the effectiveness and advancement of the algorithm in this paper, we selected excellent detection algorithms as benchmarks, including Faster R-CNN, RT-DETR [28], and YOLOv3 [29], YOLOv5 [30], YOLOv6 [31], and even YOLOv8 [20] in the YOLO series. These algorithms have a wide range of recognition and application foundations in the field of target detection. In addition, in order to more comprehensively evaluate the ability of our algorithm in dealing with non-significant features, we also introduce detection algorithms that perform well in this area, such as YOLOv8-C2f-Faster-EMAv3 [24], which was recently proposed by Jin Zhu et al. (2024), to ensure that our comparison is both extensive and in-depth, thus fully verifying the innovation and superiority of our algorithm. The training and evaluation of the model are performed under the same experimental settings. The experimental results presented in Table 4 clearly indicate that the proposed algorithm shows superiority over other mainstream object detection algorithms in terms of three key performance indicators: precision, recall and mAP.In summary, compared with a series of advanced target detection technologies, the proposed algorithm shows obvious competitive advantages and outstanding performance.

4.4.3. Visualization of Test Results

In order to further intuitively show the detection effect of the algorithm in this paper, six sets of detection result graphs in complex scenes ((a) frontlight (b) backlight (c) uneven (d) long shot (e) occlusion (f) market) are randomly selected for visual presentation, and subjective in-depth analysis is carried out, as shown in Figure 14 (the first behavior original graph, the second behavior YOLOv8 detection effect graph, and the third behavior algorithm detection effect graph used in this paper). Note:The red square represents the detection effect of the algorithm. Purple circle is false detection, purple inverted triangle is missed detection, It further proves its effectiveness and accuracy in ADDCE.

4.4.4. Verify the Generalization of the Proposed Algorithm

In order to verify the generalization performance of SMC-YOLOv8n on ADDCE, experiments were conducted on the apple defect data set of the public pp flying propeller platform. The download address of the public data set is https://aistudio.baidu.com/datasetdet//ail/53376, accessed on 30 June 2024 (training set: 499, verification set: 142, test set: 72). The training and evaluation of the model are performed under the same experimental settings. The comparison results in Table 5 show that for public datasets with diverse target sizes, SMC-YOLOv8 is 1.1% and 2.7% higher than YOLOv8 on mAP50 and mAP50-95, respectively, and Params is also within the acceptable range. Compared with YOLOv8-C2f-Faster-EMAv3, mAP is also improved, which effectively verifies the high efficiency and generalization performance of SMC-YOLOv8n.

5. Conclusions

In this study, we proposed an ADDCE: SMC-YOLOV8n based on the basic framework of YOLOv8n. A brief summary of the results: an apple defect data set of an orchard and a supermarket under different occlusion conditions and different lighting conditions was created. Experiments show that the pre-deployment of the SPD-Conv module in the C2f layer of Backbone can significantly improve its performance in processing images with low pixel density and small defect apple detection. Aiming at the shortcomings of YOLOv8 in small target detection, computational cost, generalization ability and redundant global attention mechanism, the introduction of MSDA can bring significant improvement and promotion. Through CGFPN, the sensitivity and accuracy of the feature layer in detecting small defects on apples can be improved. The model can more accurately locate the information of apple defects and enhance the flow of apple defects in CGFPN. SMC-YOLOV8n was comprehensively and meticulously tested and verified using a carefully constructed apple defect data set. The mAP50 of SMC-YOLOV8n on apple and apple defects was 91.4%, and mAP50-95 was 70.9%.Compared with the excellent detection algorithm, the model proposed in this paper has the advantage of excellent detection accuracy. The performance on the shared apple dataset shows that the SMC-YOLOv8n has stronger detection ability for apples and defects in complex environments. The mAP has a certain improvement, which accelerates the operation speed of the model and effectively reduces the false detection rate and missed detection rate.

However, the existing data sets may not be able to fully cover all types of defects, resulting in poor detection results for some types of defects. The apple data set established by divides defects into a large category, without considering the diversity of actual apple defect types. The main focus of subsequent research work is the diversity of defect types that may exist on the surface of apples, including the subdivision of cracks, scars, and insect eyes.

Author Contributions

W.S.: writing—review and editing, conceptualization, resources department; Y.Y.: writing—original, verification, methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Anhui University Scientific Research Project (2022AH050392) and Quality Engineering Project of Huaibei Normal University (2023jxyj020,2023kcszkc016).

Data Availability Statement

https://github.com/WeiSHAN-CHNU/Defect-detection.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, W.; Zhang, J.; Guo, B.; Wei, Q.; Zhu, Z. An Apple Detection Method Based on Des-YOLO v4 Algorithm for Harvesting Robots in Complex Environment. Math. Probl. Eng. 2021, 2021, 7351470. [Google Scholar] [CrossRef]
Gao, X.; Bian, X. Autonomous driving of vehicles based on artificial intelligence. J. Intell. Fuzzy Syst. 2021, 41, 4955–4964. [Google Scholar] [CrossRef]
Su, W.; Xu, G.; He, Z.; Machica, I.K.; Quimno, V.; Du, Y.; Kong, Y. Cloud-Edge Computing-Based ICICOS Framework for Industrial Automation and Artificial Intelligence: A Survey. J. Circuits Syst. Comput. 2023, 32, 2350168. [Google Scholar] [CrossRef]
Guo, X.; Shen, Z.; Zhang, Y.; Wu, T. Review on the Application of Artificial Intelligence in Smart Homes. Smart Cities 2019, 2, 402–420. [Google Scholar] [CrossRef]
Zhong, G. Application and Development of Artificial Intelligence in Medical Field. Digit. Technol. Appl. 2019, 37, 195–196. [Google Scholar]
Siddiqi, R. Automated apple defect detection using state-of-the-art object detection techniques. SN Appl. Sci. 2019, 1, 1345. [Google Scholar] [CrossRef]
Herremans, E.; Melado-Herreros, A.; Defraeye, T.; Verlinden, B.; Hertog, M.; Verboven, P.; Val, J.; Fernández-Valle, M.E.; Bongaers, E.; Nicolaï, B.M.; et al. Comparison of X-ray CT and MRI of watercore disorder of different apple cultivars. Postharvest Biol. Technol. 2014, 87, 42–50. [Google Scholar] [CrossRef]
Doosti-Irani, O.; Golzarian, M.R.; Aghkhani, M.H.; Sadrnia, H.; Doosti-Irani, M. Development of multiple regression model to estimate the apple’s bruise depth using thermal maps. Postharvest Biol. Technol. 2016, 116, 75–79. [Google Scholar] [CrossRef]
Lu, Y.; Li, R.; Lu, R. Structured-illumination reflectance imaging (SIRI) for enhanced detection of fresh bruises in apples. Postharvest Biol. Technol. 2016, 117, 89–93. [Google Scholar] [CrossRef]
Jarolmasjed, S.; Zúñiga Espinoza, C.; Sankaran, S. Near infrared spectroscopy to predict bitter pit development in different varieties of apples. J. Food Meas. Charact. 2017, 11, 987–993. [Google Scholar] [CrossRef]
Zhang, B.; Liu, L.; Gu, B.; Zhou, J.; Huang, J.; Tian, G. From hyperspectral imaging to multispectral imaging: Portability and stability of HIS-MIS algorithms for common defect detection. Postharvest Biol. Technol. 2018, 137, 95–105. [Google Scholar] [CrossRef]
Mizushima, A.; Lu, R. An image segmentation method for apple sorting and grading using support vector machine and Otsu’s method. Comput. Electron. Agric. 2013, 94, 29–37. [Google Scholar] [CrossRef]
Ji, Y.; Zhao, Q.; Bi, S.; Shen, T. Apple Grading Method Based on Features of Color and Defect. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 5364–5368. [Google Scholar] [CrossRef]
Nosseir, A.; Ahmed, S.E.A. Automatic Classification for Fruits’ Types and Identification of Rotten Ones Using k-NN and SVM. Int. J. Online Biomed. Eng. 2019, 15, 47. [Google Scholar] [CrossRef]
Liang, X.; Jia, X.; Huang, W.; He, X.; Li, L.; Fan, S.; Li, J.; Zhao, C.; Zhang, C. Real-time grading of defect apples using semantic segmentation combination with a pruned YOLO V4 network. Foods 2022, 11, 3150. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Fan, S.; Liang, X.; Huang, W.; Zhang, V.J.; Pang, Q.; He, X.; Li, L.; Zhang, C. Real-time defects detection for apple sorting using NIR cameras with pruning-based YOLOV4 network. Comput. Electron. Agric. 2022, 193, 106715. [Google Scholar] [CrossRef]
Yu, J.; Fu, R. Lightweight YOLOV5S-Super Algorithm for Multi-Defect Detection in Apples. Eng. Agríc. 2024, 44, e20230175. [Google Scholar] [CrossRef]
Han, B.; Lu, Z.; Dong, L.; Zhang, J. Lightweight Non-Destructive Detection of Diseased Apples Based on Structural Re-Parameterization Technique. Appl. Sci. 2024, 14, 1907. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar] [CrossRef]
Zhang, J.; Huang, W.; Zhuang, J.; Zhang, R.; Du, X. Detection Technique Tailored for Small Targets on Water Surfaces in Unmanned Vessel Scenarios. J. Mar. Sci. Eng. 2024, 12, 379. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Ha, Y.S.; Oh, M.; Pham, M.V.; Lee, J.S.; Kim, Y.T. Enhancements in image quality and block detection performance for Reinforced Soil-Retaining Walls under various illuminance conditions. Adv. Eng. Softw. 2024, 195, 103713. [Google Scholar] [CrossRef]
Zhu, J.; Hu, T.; Zheng, L.; Zhou, N.; Ge, H.; Hong, Z. YOLOv8-C2f-Faster-EMA: An Improved Underwater Trash Detection Model Based on YOLOv8. Sensors 2024, 24, 2483. [Google Scholar] [CrossRef] [PubMed]
Jiao, J.; Tang, Y.M.; Lin, K.Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
Zhao, J.; Ren, R.; Wu, Y.; Zhang, Q.; Xu, W.; Wang, D.; Fan, L. SEAttention-residual based channel estimation for mmWave massive MIMO systems in IoV scenarios. Digit. Commun. Netw. 2024; in press. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Wei, X.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]

Figure 1. ADDCE research work classification.

Figure 2. The overall architecture of YOLOv8.

Figure 3. SPD-Conv structure diagram.

Figure 4. C2f structure diagram.

Figure 5. MSDA structure diagram.

Figure 6. C2f-MSDA structure diagram.

Figure 7. SE Attention module.

Figure 8. SE Attention structure diagram.

Figure 9. Context guide feature pyramid network architecture diagram.

Figure 10. SMC-YOLOV8n.

Figure 11. Examples of some data sets.

Figure 12. Training curve and test curve.

Figure 13. Test set confusion matrix.

Figure 14. Part of the apple detection map.

Table 1. Experimental parameter configuration.

Parameter Name	Parameter Value
Image Size	640 × 640 × 3
Learning Rate	0.01
Batch Size	32
Epochs	300
Momentum	0.937
Weight Decay	0.0005
Optimizer	SGD

Table 2. Acquisition equipment parameters.

Acquisition Equipments	Resolution
Tello drones	2592 × 1936
iPhone 13promax	3024 × 4032
OPPO A32	720 × 1600
Redmi note11	2400 × 1080
Huawei P30pro	3648 × 2736
Xiaomi 12S	2400 × 1080

Table 3. Ablation experiment.

YOLOv8	SPD-Conv	C2f-MSDA	CGFPN	mAP50	mAP50-95	Parameters/M
✓				0.887	0.668	3.01
✓	✓			0.909	0.703	3.33
✓		✓		0.898	0.684	2.65
✓			✓	0.894	0.685	3.31
✓	✓	✓		0.911	0.707	2.97
✓	✓	✓	✓	0.914	0.709	3.46

Table 4. Comparison of mainstream target detection algorithms.

Algorithm	Precision (%)	Recall (%)	mAP0.5	mAP0.5:0.95	Params
Faster R-CNN	0.744	0.654	0.694	–	136.80
RT-DETR [28]	0.841	0.776	0.841	0.604	3.28
YOLOv3 [29]	0.854	0.869	0.893	0.693	103.69
YOLOv5 [30]	0.867	0.863	0.898	0.672	2.50
YOLOv6 [31]	0.88	0.836	0.883	0.663	4.23
YOLOv8 [20]	0.857	0.858	0.887	0.668	3.01
YOLOv8-C2f-Faster-EMAv3 [24]	0.876	0.844	0.893	0.669	2.65
Ours	0.883	0.871	0.914	0.709	3.46

Table 5. Comparison of public dataset algorithms.

Algorithm	mAP0.5	mAP0.5:0.95	Params	FPS
Faster R-CNN	0.528	–	136.80	14.58
RT-DETR [28]	0.624	0.367	3.28	17.00
YOLOv3 [29]	0.796	0.544	103.69	13.09
YOLOv5 [30]	0.828	0.591	2.50	43.48
YOLOv6 [31]	0.831	0.559	4.23	42.37
YOLOv8 [20]	0.829	0.57	3.01	43.86
YOLOv8-C2f-Faster-EMAv3 [24]	0.833	0.575	2.65	31.95
Ours	0.84	0.597	3.61	32.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shan, W.; Yue, Y. Apple Defect Detection in Complex Environments. Electronics 2024, 13, 4844. https://doi.org/10.3390/electronics13234844

AMA Style

Shan W, Yue Y. Apple Defect Detection in Complex Environments. Electronics. 2024; 13(23):4844. https://doi.org/10.3390/electronics13234844

Chicago/Turabian Style

Shan, Wei, and Yurong Yue. 2024. "Apple Defect Detection in Complex Environments" Electronics 13, no. 23: 4844. https://doi.org/10.3390/electronics13234844

APA Style

Shan, W., & Yue, Y. (2024). Apple Defect Detection in Complex Environments. Electronics, 13(23), 4844. https://doi.org/10.3390/electronics13234844

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Apple Defect Detection in Complex Environments

Abstract

1. Introduction

2. Overview of YOLOv8 Target Detection

3. Materials and Methods

3.1. Tools for Low-Definition and Small Object Recognition

3.2. C2f-MSDA

3.3. Context Guided Feature Pyramid Network

3.4. Overall Network Architecture

4. Experimental Results and Analysis

4.1. Experimental Environment and Parameter Configuration

4.2. Data Sets and Preprocessing

4.3. Evaluating Indicator

4.4. Experimental Result

4.4.1. Ablation Experiment

4.4.2. Contrast Test

4.4.3. Visualization of Test Results

4.4.4. Verify the Generalization of the Proposed Algorithm

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI