YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments

Cao, Hao; Zhang, Gengming; Zhao, Anbang; Wang, Quanchao; Zou, Xiangjun; Wang, Hongjun

doi:10.3390/agronomy14091924

Open AccessArticle

YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments

by

Hao Cao

^1,†,

Gengming Zhang

^1,†

,

Anbang Zhao

¹,

Quanchao Wang

¹,

Xiangjun Zou

²

and

Hongjun Wang

^1,*

¹

College of Engineering, South China Agricultural University, Guangzhou 510642, China

²

Xinjiang Agricultural and Pastoral Robotics and High-End Equipment Engineering Research Center, Xinjiang University, Ürümqi 830046, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2024, 14(9), 1924; https://doi.org/10.3390/agronomy14091924

Submission received: 11 August 2024 / Revised: 25 August 2024 / Accepted: 26 August 2024 / Published: 27 August 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The accurate detection of litchi fruit cluster is the key technology of litchi picking robot. In the natural environment during the day, due to the unstable light intensity, uncertain light angle, background clutter and other factors, the identification and positioning accuracy of litchi fruit cluster is greatly affected. Therefore, we proposed a method to detect litchi fruit cluster in the night environment. The use of artificial light source and fixed angle can effectively improve the identification and positioning accuracy of litchi fruit cluster. In view of the weak light intensity and reduced image features in the nighttime environment, we proposed the YOLOv8n-CSE model. The model improves the recognition of litchi clusters in night environment. Specifically, we use YOLOv8n as the initial model, and introduce the CPA-Enhancer module with chain thinking prompt mechanism in the neck part of the model, so that the network can alleviate problems such as image feature degradation in the night environment. In addition, the VoVGSCSP design pattern in Slimneck was adopted for the neck part, which made the model more lightweight. The multi-scale linear attention mechanism and the EfficientViT module, which can be deeply divided, further improved the detection accuracy and detection rate of YOLOv8n-CSE. The experimental results show that the proposed YOLOv8n-CSE model can not only recognize litchi clusters in the night scene, but also has a significant improvement over previous models. In mAP@0.5 and F1, YOLOv8n-CSE achieved 98.86% and 95.54% respectively. Compared with the original YOLOv8n, RT-DETR-l and YOLOv10n, mAP@0.5 is increased by 4.03%, 3.46% and 3.96%, respectively. When the number of parameters is only 4.93 m, F1 scores are increased by 5.47%, 2.96% and 6.24%, respectively. YOLOv8n-CSE achieves an inference time of 36.5ms for the desired detection results. To sum up, the model can satisfy the criteria of the litchi cluster detection system for extremely accurate nighttime environment identification.

Keywords:

litchi; YOLOv8n-CSE; night recognition; lightweight

1. Introduction

Litchi fruit (hereinafter referred to as litchi) is the fruit of litchi fruit trees growing in southern China, and is more common in Guangdong, Guangxi, Fujian, Hainan and other places. Litchi has a wide variety of varieties, of which there are more than ten varieties, such as Feizixiao, Nuomici and Guiwei, etc. Taking Nuomici as an example, its appearance is heart-shaped, approximately round, green in the initial growth period, bright red in the mature period, and scaly protrudates on the surface [1,2]. Litchi tastes delicious, juicy, delicate flesh, white translucent flesh, containing soluble solid up to 20%, rich in a variety of nutrients. Litchi is rich in sugar, vitamin C, vitamin B group, also contains potassium, calcium, iron and other minerals, in addition to contain some antioxidants, such as flavonoids and so on [3,4]. Litchi can not only replenish body energy, enhance immunity and promote metabolism, have a positive effect on maintaining body function, but also help to resist free radicals and protect health. Litchi, as a nutritious and tasty fruit, is very popular in China. Its annual production is increasing gradually. In 2023, the production area of litchi in China is about 7.9 million mu, and the output is expected to reach 3.29 million tons. Compared with 222,700 tons in 2022, there is a significant increase [5].

Currently, the primary factor impeding the growth of the litchi sector is the harvesting connection. Due to the short shelf life of litchi, the concentrated picking period, and the majority of litchi orchards are unstructured orchards, these factors lead to the reliance on a large number of labor forces to manually pick and transport litchi, which is not only costly but also inefficient. In order to achieve mechanical harvesting, the key point is the accurate positioning of litchi. Only by obtaining the accurate positioning of litchi can we carry out a series of operations, such as subsequent mechanical automatic harvesting [6]. In view of this situation, Yang et al. [7] analyzed the challenges faced by fruit harvesting robots. For example, fruit occlusion, fruit damage, light and background interference, as well as the influence of complex environment, the application of vision-based fruit recognition and positioning technology in harvesting robots is comprehensively reviewed. The author also analyzes the development of target recognition and location technology, including fruit recognition method, stem location method and image matching and 3D reconstruction technology. Different from apple, grapefruit, dragon fruit and other single fruit [8], litchi identification and positioning method should not only identify the location of the fruit itself, but also identify its fruit stem or picking point. Liu et al. [9] proposed the FF3D framework, consisting of a 3D convolutional neural network-based fruit detection network and an anisotropic Gaussian distribution-based next-best view estimator, which achieved an average precision of 76.3% and an average recall of 92.3% on the apple test set. Chu et al. [10] proposed a new deep learning apple detection framework called suppression Mask R-CNN. The framework uses the backbone network of Mask R-CNN to learn apple features and generate candidate regions. Then the candidate regions are screened by clustering and convolutional networks to remove the non-apple features and improve the detection accuracy. However, suppression Mask R-CNN requires two networks for training and inference, resulting in high computational complexity. Although the framework shows good performance in a variety of lighting conditions, its performance is slightly degraded in backlight conditions, and its robustness in complex lighting conditions needs to be further improved. Dai et al. [11] proposed a cutting area recognition method of grapefruit picking robot based on machine vision. Firstly, the pomelo centroid and external rectangle are obtained by image preprocessing. Then, the centroid is taken as the center of the circle, the vertical line is made on the upper edge of the rectangle, and the intersection point is taken as the center of the cutting area, and the radius Rocc is determined according to the center point. Next, draw a circle with Rocc as the radius and the center of the circle as the cutting point. By analyzing the influence of different Rocc and rocc radius on the relationship between cutting area and cutting point, the optimum value was determined. However, under the backlight condition, the recognition accuracy is relatively low, which needs further research and improvement. Gao et al. [12] described the current state of use of deep learning in apple detection and pointed out the limitations of existing methods to identify all apple targets as a single category. The authors divided apples into four categories: unshielded apples, leaf shielded apples, branch/wire shielded apples, and fruit shielded apples, and used Faster R-CNN for multi-class detection. However, for some robots with faster picking speed, the existing detection speed may not be able to meet the demand.

Litchi is a multi-branched, complex clumping fruit. Unlike ordinary fruit that grows alone, it is produced in bunches and therefore needs to be picked in bunches. Compared with single fruit picking, it needs to identify and locate the stem of a bunch of fruits, so that the whole bunch of fruits can be picked. Yu et al. [13] proposed a method to identify litchi in a natural environment using RGB-D images. The proposed method takes into account both color, texture and depth information, and effectively improves the accuracy of litchi recognition, especially in the case of uneven lighting and litchi color similar to the background. Through deep segmentation, background noise and areas inaccessible to the robot can be effectively removed, thereby improving recognition efficiency and accuracy. However, depth sensors may not work properly under certain lighting conditions, like direct sunlight. A machine learning and image analysis-based vision arithmetic was presented by Bai et al. to precisely detect tomato bunches and locate picking sites in complex conditions [14]. The algorithm first extracts the shape, texture and color features of tomatoes, and uses support vector machine (SVM) for recognition. Then, the algorithm uses Hough circle detection and spatial symmetric spline interpolation to estimate the stems of clusters of tomatoes, and fits the fruit contour, and finally locates the picking point. But the illumination change will affect the image quality and feature extraction effect. To mitigate the effects of natural light conditions, Bac et al. [15] proposed a method to locate the stems of sweet pepper plants by using visual cues and support lines. At the same time, adaptive thresholds were used to detect the support lines in response to light changes, which improved the accuracy of detection to avoid collision between the picking robot and the stems. However, the performance of the algorithm proposed by the author is degraded under strong lighting conditions, and further optimization is needed. Ji et al. [16] took citrus orchard as the research object, collected citrus image data at different time periods and observation distances, and designed an enhancement method based on Retinex algorithm. They include single-scale Retinex (SSR), multi-scale retinex (MSR), multi-scale retinex with color recovery (MSRCR), multi-scale retinex with chrominance preservation (MSRCP), and MSRCR with automatic color level adjustment (AutoMSRCR). The experimental results show that the average accuracy (MAP) of YOLOv5 can be improved by 9.28% and 6.32% respectively by using MSR and MSRCR-based models for data collected during 10:00–11:00 and 14:00–15:00 at 1 m observation distance. At the observation distance of 2 m, the increase was 4.92% and 16.91%, respectively. However, the Retinex algorithm needs to set parameters manually, and the calculation is large, so its real-time performance is difficult to be guaranteed.

Compared with traditional machine learning and image processing algorithms, deep learning in machine learning develops rapidly, and its accuracy, robustness and real-time performance are far beyond traditional algorithms [17,18,19,20]. In the field of target recognition, deep learning’s YOLO series algorithms shine brightly, and its successive models are highly favored and applied by researchers in industry and academia [21]. Jiang et al. [22] introduced the development of YOLO algorithm and its five major versions (V1–V5), and analyzes the differences and similarities between them. The paper first introduces the characteristics of YOLO algorithm, such as fast speed, small model and strong versatility, but also points out its disadvantages of low positioning accuracy and low recall rate. Then, the main improvement measures of YOLO versions are introduced in detail, such as batch normalization, high resolution classifier, fine features, etc. Terven et al. [23] provided a comprehensive review of the YOLO family of object detection architectures, covering versions from YOLOv1 to YOLOv8 and YOLO-NAS and YOLO with Transformer. This paper first introduces the standard indexes and post-processing techniques of target detection, such as average accuracy (AP) and non-maximum suppression (NMS). The article then delves into the major changes in network architecture and training techniques in each version, such as YOLOv8 offering models at different scales, supporting multiple visual tasks, and introducing new loss functions. The paper also discusses the trade-off between speed and accuracy of YOLO models, and looks forward to the future development direction of YOLO framework. Gai et al. [24] proposed a cherry fruit detection algorithm based on the improved YOLO-v4 model. The author replaced the CSPDarknet53 backbone network of YOLO-v4 with DenseNet, which enhanced the interlayer connection density, promoted feature propagation and reuse, and improved feature extraction capability. Change the YOLO-v4 loss function to Leaky ReLU to further optimize the model performance. The rectangular anchor frame of YOLO-v4 is changed to a circular anchor frame, which is more in line with the shape of the cherry fruit and improves the positioning accuracy. Although the detection speed of YOLO-V4-Dense has been faster, the real-time performance still needs to be improved. Zhaoxin et al. [25] proposed a robot system for picking tomatoes based on deep learning, which aims to solve the problems of inefficiency, fruit damage and difficult to control stem length in existing picking methods. The system uses Realsense d435 depth camera to obtain image information, and uses YOLO v5 target detection algorithm to identify tomatoes and fruit stems. By analyzing the position relationship between tomatoes and fruit stems, the picking area is determined, and the three-dimensional coordinates of the picking point are extracted by combining the depth information. Finally, the UR3 robot arm completes the picking task. However, the recognition effect of this algorithm is still not ideal when the length of fruit stem is too short, and it is easy to misrecognize. Factors such as changes in lighting and shading of branches and leaves may affect the identification and positioning results. Zhang et al. [26] proposed a model called YOLOv8n-DDA-SAM to identify and locate cherry tomato picking points. The object detection branch adopts YOLOv8n model, and enhances the identification accuracy of thin stems by introducing the deformable large convolution kernel attention method (DLKA) and dynamic DySnakeConv. The semantic segmentation branch adopts SAM model, which can achieve high-precision segmentation without labeling data sets.

Since litchi picked in the early morning has the best quality [27] it is more convenient for litchi packaging and transportation during the day, and the unstable light intensity and sunlight angle changes during the day will reduce the image quality and affect the accuracy of the detection algorithm [28,29,30]. Therefore, we deliberately chose to identify litchi at night. To enhance the precision of fruit recognition and minimize the impact of unsteady lighting and chaotic surroundings on fruit identification and placement, many scholars try to use artificial light sources to control the lighting conditions for fruit recognition and positioning at night. Xiong et al. [31] studied the identification of litchi and the calculation of picking points in the natural environment at night. A set of litchi image acquisition system is designed, and a night litchi recognition method based on improved fuzzy C-means clustering (FCM) algorithm and otsu segmentation algorithm is proposed, as well as a picking point calculation method based on Harris corner detection and curve analysis techniques. The recognition accuracy of litchi at night is as high as 93.75%, and the average recognition time is only 0.516 s, which can meet the needs of practical applications. In order to solve the image recognition problem of apple picking robot at night. Liu et al. [32] proposed a night apple segmentation method based on color and position information. Due to poor lighting conditions at night, there is insufficient color information in the image, so the method combined the color and position information of pixels for segmentation. Firstly, a neural network model is trained to perform initial segmentation using the color components of the sampled pixels in the RGB and HSI color space, and then the color and position information of the pixels around the segmentation region and the border pixels of the segmentation region are further considered to segment the edge region of the apple. The final segmentation result is the union of the two segmentation results. Liang et al. [33] proposed a night vision detection method for litchi fruits and stems based on deep learning, and designed a suitable lighting system. The experiment verified that the effect of normal brightness was the best under three conditions: high brightness, normal brightness, and low brightness. However, some fruit stalks are not actually located in the ROI area, which may cause false positives. Xiang et al. [34] proposed a nighttime tomato plant image segmentation algorithm based on a simplified pulse coupled neural network (PCNN). The algorithm first extracts the green and red deviations of the image as color features, and converts them into grayscale images for PCNN segmentation. By calculating the information entropy gradient, the optimal segmentation result can be more accurately determined, avoiding the tedious process of manually adjusting parameters. However, PCNN itself is a complex neural network with high algorithm complexity and computational complexity. Therefore, for images with a large number of pixels, the real-time performance of the algorithm is poor.

In order to solve the problems such as the greater impact of the above lighting conditions on the accuracy of fruit recognition and positioning, as well as real-time performance, we modified the original YOLOv8n network model and proposed the YOLOV8n-CSE network model. In view of the problems such as weak light intensity in night scenes, this paper will use the image enhancement module CPA-Enhancer based on chain thinking, so that YOLOv8n can effectively extract image feature information, and then use the generic lightweight module SlimNeck to lightweight the network on this basis. Finally, a multi-scale linear attention mechanism efficient visual transformation module EfficientViT was introduced, which made the proposed model YOLOv8n-CSE achieve very good results.

The contributions of this study are summarized as follows:

A lightweight litchi cluster recognition network, YOLOv8n-CSE, was proposed for picking environment with weak light intensity, which could accurately identify litchi clusters in night environment.
The image enhancement module CPA-Enhancer, which introduces chain thinking, can achieve very good results for the task of recognizing litchi clusters in low-light scenes at night.
Lightweight SlimNeck and the introduction of the multi-scale linear attention mechanism module EfficientViT further improved the model’s performance in low-light environments.

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquisition

The litchi Culture Expo Park, Conghua District, Guangzhou City, Guangdong Province, China provided the litchi cluster picture collection utilized in this study (Longitude 23°59′ E, Latitude 113°62′ N), and the images used were taken on 25 April 2024, 1–3 a.m. For low light environment and real litchi picking robot, we designed a device combining camera, LED light and end, as shown in Figure 1a. The camera we use is the Realsense D435i Depth Camera. The LED lamp is a LED matrix lamp produced by TCL Company. The LED lamp and the camera are installed together through a connector, which is made by a 3D printer. LED lights are mounted directly below the camera and provide a wide range of light. In order to simulate the real picking environment, we installed the whole device vertically on the end of the robot arm, using a fixed lighting angle, that is, the light source is shining vertically on the litchi fruit cluster directly in front of us. In the actual collection of image data, we used a litchi picking robot to shoot, and the distance between its end and litchi fruit cluster was within 300–800 mm, which was within the working radius of the robot arm. The real scenario is shown in Figure 1b. LED matrix lamps have three levels of brightness, providing a low light intensity range of 20–50 Lux, normal light intensity of 210–350 Lux, high light intensity of 580–740 Lux (measured at the litchi fruit cluster). According to the previous team work [33], and then the actual verification, it is concluded that the recognition effect of litchi is the best under normal light intensity. Therefore, experiments are carried out with normal brightness in the rest of this paper.

The D435i depth camera uses a resolution of 640 × 480, collects 1658 original images of litchi clusters, and saves them as “.jpg” format. This dataset includes photos from a variety of perspectives, growth locations, light intensities, and occlusion situations in order to improve sample diversity and lower the possibility of model overfitting. Figure 2 illustrates the different conditions of capture, including (a) Low light; (b) Normal light; (c) Strong light; (d) Individual clusters; (e) Multiple clusters; (f) No cover; (g) Occlusion. Poor quality photos, such as those with excessive overexposure, blurring, and substantially occluded areas, are filtered out of the collected images. The result was 1502 initial images.

2.1.2. Data Augmentation

This study uses data augmentation for the original photos in an effort to somewhat increase the model’s robustness. The diversity of the data can aid in enhancing the generalization and resilience of the model [35]. Several data enhancement techniques, such as horizontal flipping, panning, brightness modification, notching, and noise addition, are employed in this study. For random improvement, 474 photos were chosen at random. Every improvement is a chance combination of the aforementioned techniques, with a 10% chance for each technique. The outcome was the creation of 474 new samples, bringing the total number of samples in the litchi cluster dataset to 1976. As illustrated in Figure 3, the litchi cluster dataset presents a randomized combination of strategies for the several data augmentation methods.

2.1.3. Dataset Details

In this study, 1976 photos are labeled using the LabelImg program, which also produces corresponding “.txt” documents with classification data. The bunches of the litchi cluster belong to a single class of items. Table 1 displays the situation following the dataset’s augmentation. The dataset is split into three subsets, numbered 1383, 395, and 198 sheets each, for training, validation, and test sets, respectively, based on a 7:2:1 ratio.

2.2. Methodology

2.2.1. YOLOv8n-CSE Algorithm Model

In this research, the litchi clusters in the night scene as the object, and the physical structure suitable for operation as well as the YOLOv8n-CSE network model suitable for this scene are designed to overcome the difficulties such as weak light and degradation of image features in the night time. The benchmark network is chosen to be YOLOv8n. YOLOv8n, an anchorless frame model proposed by [36] in 2023, is a lightweight parametric structure with excellent accuracy and quick detection speed. However, due to the characteristics of target feature degradation caused by low light intensity in the night environment, its night detection accuracy cannot achieve the expected effect. Therefore, the CPA-Enhancer module with the chain thinking cueing mechanism is introduced on its basis, which enables the network to alleviate the image feature degradation problems in the night environment, and achieve a great improvement in both detection accuracy and accuracy rate. The number of model parameters increased as a result of the addition of this module, so the VoVGSCSP design paradigm in Slimneck is adopted in the Neck section to make the model lighter and reduce the complexity of the model to improve the generalisation performance of the model. The EfficientViT module with multi-scale linear attention mechanism and deep separable convolution is introduced in Backbone, which enables global sense field and multi-scale learning through simplified operations, resulting in further improvement of YOLOv8n-CSE detection accuracy and verification accuracy. As shown in Figure 4, the YOLOv8n-CSE network model proposed in this research.

2.2.2. Chain Thinking Prompts Mechanism

The CPA-Enhancer Chained Thinking Network is an image enhancement network based on the attention mechanism, which can effectively extract feature information of an image and perform targeted enhancement according to the degree of low illumination of the image [37]. The network consists of multiple enhancement modules, each of which focuses on enhancing specific image features such as edges, textures, colors, and so on. By concatenating multiple enhancement modules, multi-layered image enhancement can be achieved to improve image quality. The method is able to dynamically analyse and adapt to image degradation using Chain of Thought (CoT) cues to significantly improve object detection performance. The underlying features are first obtained using RFAConv (Receptive-field attention convolution) [38], which are then fed into a 4-level hierarchical coder-decoder containing an RFAConv at each level, and finally the resulting enhanced features are used as inputs to the YOLOv8n input to the CSE detector. The overall structure of the CPA-Enhancer is shown in Figure 5.

The CPB (Content Driven Cue Block) is designed to facilitate the interaction between the input features

F_{i}

and the cues

P_{i}

, and is the ability of the model to adapt its enhancement strategy according to the type of degradation. As shown in Figure 6, the CPB works as follows. Channel attention and spatial attention mechanisms are mixed sufficiently to fully capture the features. Then, after interacting with the prompt, it is segmented into n equal parts. Each segmented part is processed in a separate Transformer Block to exploit the encoded degradation information and transform the input features. Finally, all outputs are concatenated along the channel dimension.

A detailed view of the Transformer Block is shown in Figure 7. An existing Transformer module is used [39]. The module has two sequentially connected sub-modules including the head transpose of the multi-DConv as well as the feed-forward network.

F_{p}^{i, j}

is input to the multi-DConv head transpose, self-attention weights are computed, and

F_{m}^{i, j}

is obtained via Equation (2).

F_{p}^{i, j} = F_{p}^{i} [:, :, (j - 1) \frac{C_{i}}{n} : j \frac{C_{i}}{n}]

(1)

F_{m}^{i, j} = PC (V \cdot σ (K \cdot Q / α)) + F_{p}^{i, j}

(2)

where Q, K and V denote query, key and value projections, respectively, obtained by applying 1 × 1 point convolution and 3 × 3 deep convolution to layer-normalised input feature maps.

PC

denotes pointwise convolution,

α

denotes the learnable scaling parameter and

(\cdot)

denotes the dot product.

F_{m}^{i, j}

is subjected to controlled eigen-transformations according to Equations (3)–(5) to obtain

F_{g}^{i, j}

.

X_{1} = GELU ({DC}_{3 \times 3} (PC (LN (F_{m}^{i, j}))))

(3)

X_{2} = {DC}_{3 \times 3} (PC (LN (F_{m}^{i, j})))

(4)

F_{g}^{i, j} = PC (X_{1} ⊙ X_{2}) + F_{m}^{i, j}

(5)

where GELU is the activation function and LN is the layer normalisation. Finally, the plots are concatenated along the channel dimensions and the final output plot of the CPB is obtained by Equation (6).

F_{g}^{i} = [F_{g}^{i, 1}, \dots, F_{g}^{i, j}, \dots, F_{g}^{i, n}], j \in {1, 2, \dots, n}

(6)

2.2.3. Lightweight Neck Slim

In this research, the introduction of the CPA-Enhancer module in Backbone enables the neural network to better handle features with low light intensity at night, which is useful for many computer vision in dealing with low-light scene recognition tasks, especially in target detection and image segmentation and it allows the network to adaptively enhance different low-light regions in the image, which improves the performance. However, the introduction of CPA-Enhancer makes the network require more computational resources and parameters, so we investigated the application of SlimNeck’s design paradigm [40] in the Neck part to lighten the design.

The Depth-wise Separable Convolution (DSC) processes are currently the foundation of many lightweight designs in order to minimize the amount of parameters and floating-point operations required to accomplish, such as Xception [41], MobileNets [42,43,44] and ShuffleNets [45,46]. One benefit of DSC is that it minimizes the amount of floating-point operations and parameters, but its disadvantage is that the input image information needs to be separated from the channel information, so the feature extraction and fusion ability is more than the traditional SC (The channel-dese-convolution operations) operations are lacking.

The use of GSConv in SlimNeck to process the connected feature maps can reduce the redundant information and does not require additional compression of feature map information.GSConv convolutional structure map, as shown in Figure 8, DWConv for the DSC operation, GSConv through the homogeneous mixing strategy [45] to penetrate the SC-generated information into the DSC-generated information of each part, maximising the role of DSC lightweight model and reducing the impact of DSC defects on the model.

The structure of GSbottleneck, VoVGSCSP designed based on GSConv is shown in Figure 9, and we can combine them like building blocks Neck layers to complete the lightweight model.

2.2.4. EfficientViT-Multiscale Linear Attention

EfficientViT is an efficient visual transformation network specifically designed for high-resolution image processing [47]. It improves the performance of the model through an innovative multi-scale linear attention mechanism. Figure 10 illustrates the structure of the EfficientViT network presented in this research.

The module primarily utilises streamlined attention mechanisms and convolutional operations, which are devised to enhance the model’s performance without an undue increase in the number of parameters. Specifically, the model attains both high performance and low computational cost through the use of a feedforward network with a multiscale linear attention mechanism and deep convolution. The multiscale linear attention mechanism is a lightweight attention module that enables global sense field and multiscale learning through simplified operations and is able to capture long-range dependencies effectively. In this research, the proposed network for nighttime recognition of litchi is optimised with CPA-Enhancer high resolution to enhance the visibility of the features, and then the EfficientViT module is introduced to further demonstrate the model’s high performance.

ReLU-based attention mechanisms [48] and other linear attention mechanisms [49,50] have demonstrated efficacy in achieving good results. However, in targeting the scenario specificity of this research, the lightweight multiscale attention mechanism (Lightweight MSA) is more compatible. As illustrated in Figure 11, a comprehensive depiction of the Lightweight MSA is presented on the right side. Subsequent to the acquisition of Q, K, and V through the linear projection layer, the lightweight small kernel convolution is employed to generate multi-scale tokens, which are then subjected to ReLU linear attention and ultimately feature fusion. This methodology effectively captures contextual and local information.

2.3. Evaluation Metrics

Metrics including precision (P), recall (R), mean accuracy (mAP@0.5), and F1 score are used in target identification to assess the model’s performance. The above metrics are calculated as follows:

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

A P = \int_{0}^{1} P (R) d R

(9)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(10)

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

The variable TP denotes the number of positive samples predicted as positive categories, FP stands for negative samples predicted as positive categories, FN stands for positive samples predicted as negative categories, AP represents the area under the PR curve, and mAP@0.5 represents the average of the AP of each category. The symbol n indicates how many categories the object was able to identify. This study used n = 1.

2.4. Training Parameters and Experimental Environment

Based on the deep learning framework PyTorch, this study used an experimental configuration with Python 3.8 (version 1.12.1+cu116) and Windows 10. The Intel(R) Core(TM) i7-10700 CPU @ 2.90 GHz processor that was utilized had 32 GB of RAM. NVIDIA GeForce RTX 3090Ti was the graphics card model, and CUDA 11.6 and CUDNN 8.6.0 were used to speed up GPU computing. In Table 2, the particular arrangement is arranged. Table 3 provides specifics on the training settings that were employed in the studies.

3. Results

3.1. Experiments Were Conducted to Evaluate the Performance of YOLOv8n-CSE

To demonstrate the suggested YOLOv8n-CSE model’s detection performance, a series of comparative experiments were conducted on several mainstream neural network models, including YOLOv8n, RE-DETR-l, and YOLOv10n. Next, the outcomes were contrasted using the F1-score, mAP, precision, and recall metrics. Table 4 presents the findings.

Table 4 illustrates how well the YOLOv8n-CSE developed in this study performs when compared to a number of more widely used network models. With respect to accuracy, YOLOv8n-CSE outperforms network models like RT-DETR-l, YOLOv8n, and YOLOv10n by 3.09%, 2.50%, and 4.87%, in that order. It was enhanced by 7.66%, 3.4%, and 7.53%, in terms of verifying completeness. It increased by 4.03%, 3.46%, and 3.96%, on average, in accuracy. It has improved by 5.47%, 2.96%, and 6.24% in terms of F1-score, respectively. We also provide the F1-score metrics findings for the studies with 300 training epochs, as illustrated in Figure 12. The results of the RT-DETR-l, YOLOv10n, YOLOv8n, and YOLOv8n-CSE models for the detection of the identical image are displayed in Figure 13.

3.2. Ablation Experiments with YOLOv8n-CSE

In order to ascertain the impact of the modules presented in this research, including CPA-Enhancer, SlimNeck, and EfficientViT, on the model, this research conducts ablation tests utilising incremental settings for the standard original YOLOv8n. The results are presented in Table 5.

As evidenced by the preceding table, the integration of the CPA-Enhancer module into the YOLOv8n algorithm for litchi recognition in nighttime scenarios has been demonstrated to enhance the performance of YOLOv8n-C, exhibiting improvements of 0.40%, 1.93%, 1.22%, and 2.21% in precision, recall, average precision, and F1-score, respectively, when compared to the initial YOLOv8n. This demonstrates the use of the CPA-Enhancer module, which is feature-optimised for scenes with weak light intensity, as the Backbone can more accurately detect litchi clusters in night-time scenes. Furthermore, this research addresses the increase in the number of parameters resulting from the introduction of the low light enhancement module, which is achieved through the incorporation of the SlimNeck module. This lightweight module serves to reduce redundant information and enables the model to process feature information in a more lightweight and flexible manner. The introduction of this mechanism has been demonstrated to enhance the performance of the model once more. In comparison to YOLOv8n-C, YOLOv8n-CS has been shown to elevate the precision, recall, average precision, and F1-score by 1.00%, 3.51%, 1.02%, and 1.30%, respectively. In order to enhance the efficiency of the network model, we incorporated the EfficientViT module, which employs a multi-scale linear attention mechanism, into the YOLOv8n-CSE detection model. The model exhibits high precision (96.37%), recall (94.73%), and F1-score (95.54%), while maintaining a reduced number of parameters (1.33 million) compared to YOLOv8n-C. In conclusion, the performance of YOLOv8-CSE is evident. As illustrated in Figure 14, the F1-score curve in the ablation experiment is presented.

4. Discussion

In this research, we address the difficulty of detecting litchi clusters due to weak light in nighttime scenes by introducing an image enhancement network, CPA-Enhancer Chained Thinking Network, which is based on the attention mechanism, the lightweight SlimNeck design paradigm, and the multi-scale linear attention mechanism module EfficientViT. The result is a YOLOv8n-CSE network model that is capable of accurately identifying litchi clusters in night-time scenes while remaining lightweight. The results demonstrate that our proposed YOLOv8n-CSE model is capable of accurately detecting litchi clusters in low-light conditions at night. This establishes a robust foundation for the subsequent guidance of picking robots in nighttime environments, facilitating the efficient and precise completion of picking tasks.

The model is compared with three different deep learning networks for target detection: YOLOv8n, RT-DETR-l, and YOLOv10n, in order to validate the performance of target detection for litchi clusters in nighttime situations. The experimental findings show that the model attains the best possible balance between speed and precision, with a discernible improvement in recall and precision of detection. Compared to the YOLOv8n, RT-DETR-l, and YOLOv10n network models, there is an improvement in precision of 3.09%, 2.5%, and 4.87%, respectively. The improvements were 7.66%, 3.4%, and 7.53%, respectively, in terms of completeness. The increases in average accuracy were 4.03%, 3.46%, and 3.96%, in that order. The improvements were 5.47%, 2.96%, and 6.24%, respectively, in terms of F1-score. Compared with other night recognition methods [51], our accuracy rate, recall rate, F1-score and mAP improved by 5.14%, 1.93%, 3.54% and 8.06%, respectively.

It is imperative to state that our study has several limitations, notwithstanding some noteworthy accomplishments. For instance, the unstructured picking environment may cause omissions in our procedure. Consequently, additional optimization of the system is required in future work to make it more compatible with unstructured environments and to optimize the localization algorithm for low light conditions at night, particularly in dense situations.

5. Conclusions

In this research, we proposed a litchi cluster detection method for nighttime low-light environments called YOLOv8n-CSE, which is designed to accurately identify litchi clusters in nighttime low-light environments for picking robots to guide the task of nighttime robot visual perception. We utilized a range of data augmentation methods on the gathered photos, such as noise addition, horizontal flipping, panning, brightness correction, and notching. By increasing the dataset’s diversity, these enhancement techniques contribute to raising the model’s robustness and learning process’ efficacy. Compared to the original YOLOv8n, RT-DETR-l and YOLOv10n, the average accuracy of YOLOv8n-CSE is improved by 4.03%, 3.46%, 3.96%, and in terms of F1 score, it is even better by 5.47%, 2.96%, and 6.24%, respectively. The introduction of CPA-Enhancer enables the model to enhance image features in nighttime. The introduction of CPA-Enhancer enables the model to enhance image features in low light environments at night, while the SlimNeck lightweight paradigm design and the introduction of EfficientViT, a multi-scale linear attention mechanism module, enable the model to further improve detection accuracy and further decrease the number of parameters. The ablation tests underscore the influence of CPA-Enhancer, SlimNeck, and EfficientViT on the model, leading to YOLOv8n-CSE demonstrating superior performance on numerous metrics.

Next steps will involve creating high-quality datasets and developing methods to optimize inference time. Additionally, picking trials for litchi will be carried out in actual orchard situations, and the model will be implemented on mobile devices appropriate for nighttime litchi picking activities.

Author Contributions

Conceptualization, H.C. and H.W.; Methodology, H.C., G.Z. and X.Z.; Software, H.C. and G.Z.; Validation, H.C., G.Z., Q.W. and A.Z.; Formal analysis, Q.W. and A.Z.; Data curation, H.C.; Writing—original draft, H.C.; Writing—review & editing, H.W.; Supervision, H.W.; Project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant 32372001 and Guangzhou Science and Technology Project (2023B01J0046).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, J.; Zhang, S.; Zou, T.; Dong, L.; Peng, Z.; Wang, H. A Dense Litchi Target Recognition Algorithm for Large Scenes. Math. Probl. Eng. 2022, 2022, 4648105. [Google Scholar] [CrossRef]
Wang, C.; Li, C.; Han, Q.; Wu, F.; Zou, X. A performance analysis of a litchi picking robot system for actively removing obstructions, using an artificial intelligence algorithm. Agronomy 2023, 13, 2795. [Google Scholar] [CrossRef]
Xiong, J.; Zou, X.; Wang, H.; Peng, H.; Zhu, M.; Lin, G. Recognition of ripe litchi in different illumination conditions based on Retinex image enhancement. Trans. Chin. Soc. Agric. Eng. 2013, 29, 170–178. [Google Scholar]
Suo, R.; Gao, F.; Zhou, Z.; Fu, L.; Song, Z.; Dhupia, J.; Li, R.; Cui, Y. Improved multi-classes kiwifruit detection in orchard to avoid collisions during robotic picking. Comput. Electron. Agric. 2021, 182, 106052. [Google Scholar] [CrossRef]
Chen, H.; Su, Z.; Yang, S. Investigation and Analysis of the Litchi Production in China in 2023. Trop. Agric. China 2023, 3, 13–22. [Google Scholar] [CrossRef]
Dawn, N.; Ghosh, S.; Saha, A.; Chatterjee, S.; Ghosh, T.; Guha, S.; Sarkar, S.; Mukherjee, P.; Sanyal, T. A Review on Digital Twins Technology: A New Frontier in Agriculture. In Artificial Intelligence and Applications; Bon View Publishing Pte Ltd.: Singapore, 2022. [Google Scholar]
Yang, Y.; Han, Y.; Li, S.; Yang, Y.; Zhang, M.; Li, H. Vision based fruit recognition and positioning technology for harvesting robots. Comput. Electron. Agric. 2023, 213, 108258. [Google Scholar] [CrossRef]
Li, H.; Gu, Z.; He, D.; Wang, X.; Huang, J.; Mo, Y.; Li, P.; Huang, Z.; Wu, F. A lightweight improved YOLOv5s model and its deployment for detecting pitaya fruits in daytime and nighttime light-supplement environments. Comput. Electron. Agric. 2024, 220, 108914. [Google Scholar] [CrossRef]
Liu, T.; Wang, X.; Hu, K.; Zhou, H.; Kang, H.; Chen, C. FF3D: A Rapid and Accurate 3D Fruit Detector for Robotic Harvesting. Sensors 2024, 24, 3858. [Google Scholar] [CrossRef]
Chu, P.; Li, Z.; Lammers, K.; Lu, R.; Liu, X. Deep learning-based apple detection using a suppression mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Dai, N.; Xie, H.; Yang, X.; Zhan, K.; Liu, J. Recognition of cutting region for pomelo picking robot based on machine vision. In Proceedings of the 2019 ASABE Annual International Meeting. American Society of Agricultural and Biological Engineers, Boston, MA, USA, 7–10 July 2019; p. 1. [Google Scholar]
Gao, F.; Fu, L.; Zhang, X.; Majeed, Y.; Li, R.; Karkee, M.; Zhang, Q. Multi-class fruit-on-plant detection for apple in SNAP system using Faster R-CNN. Comput. Electron. Agric. 2020, 176, 105634. [Google Scholar] [CrossRef]
Yu, L.; Xiong, J.; Fang, X.; Yang, Z.; Chen, Y.; Lin, X.; Chen, S. A litchi fruit recognition method in a natural environment using RGB-D images. Biosyst. Eng. 2021, 204, 50–63. [Google Scholar] [CrossRef]
Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2023, 24, 727–743. [Google Scholar] [CrossRef]
Bac, C.W.; Hemming, J.; Van Henten, E.J. Stem localization of sweet-pepper plants using the support wire as a visual cue. Comput. Electron. Agric. 2014, 105, 111–120. [Google Scholar] [CrossRef]
Ji, W.; Liu, D.; Meng, Y.; Liao, Q. Exploring the solutions via Retinex enhancements for fruit recognition impacts of outdoor sunlight: A case study of navel oranges. Evol. Intell. 2022, 15, 1875–1911. [Google Scholar] [CrossRef]
Rebahi, Y.; Gharra, M.; Rizzi, L.; Zournatzis, I. Combining Computer Vision, Artificial Intelligence and 3D Printing in Wheelchair Design Customization: The Kyklos 4.0 Approach. Artif. Intell. Appl. 2022. [Google Scholar] [CrossRef]
Xie, J.; Jing, T.; Chen, B.; Peng, J.; Zhang, X.; He, P.; Yin, H.; Sun, D.; Wang, W.; Xiao, A.; et al. Method for Segmentation of Litchi Branches Based on the Improved DeepLabv3+. Agronomy 2022, 12, 2812. [Google Scholar] [CrossRef]
Mao, L.; Liang, Z.; Peng, Y.; Wang, J.; Wang, L. Detection and Location of Litchi Fruit based on Object Detector and Depth Image. In Proceedings of the 2023 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 18–19 August 2023; pp. 894–901. [Google Scholar]
Bello, R.; Oladipo, M. Mask YOLOv7-based drone vision system for automated cattle detection and counting. Artif. Intell. Appl. 2024. [Google Scholar] [CrossRef]
Chen, M.; Chen, Z.; Luo, L.; Tang, Y.; Cheng, J.; Wei, H.; Wang, J. Dynamic visual servo control methods for continuous operation of a fruit harvesting robot working throughout an orchard. Comput. Electron. Agric. 2024, 219, 108774. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Zhaoxin, G.; Han, L.; Zhijiang, Z.; Libo, P. Design a robot system for tomato picking based on YOLO v5. IFAC-PapersOnLine 2022, 55, 166–171. [Google Scholar] [CrossRef]
Zhang, G.; Cao, H.; Jin, Y.; Zhong, Y.; Zhao, A.; Zou, X.; Wang, H. YOLOv8n-DDA-SAM: Accurate Cutting-Point Estimation for Robotic Cherry-Tomato Harvesting. Agriculture 2024, 14, 1011. [Google Scholar] [CrossRef]
Purbey, S.; Pongener, A.; Kumar, V.; Nath, V. Effect of time of harvest and packaging on quality and shelf life of litchi fruit. In V International Symposium on Lychee, Longan and Other Sapindaceae Fruits 1211; ISHS: Brisbane, Australia, 2016; pp. 65–70. [Google Scholar]
Wang, C.; Zou, X.; Tang, Y.; Luo, L.; Feng, W. Localisation of litchi in an unstructured environment using binocular stereo vision. Biosyst. Eng. 2016, 145, 39–51. [Google Scholar] [CrossRef]
Chen, C.; Lu, J.; Zhou, M.; Yi, J.; Liao, M.; Gao, Z. A YOLOv3-based computer vision system for identification of tea buds and the picking point. Comput. Electron. Agric. 2022, 198, 107116. [Google Scholar] [CrossRef]
Deka, B.; Chakraborty, D. UAV Sensing-Based Litchi Segmentation Using Modified Mask-RCNN for Precision Agriculture. IEEE Trans. AgriFood Electron. 2024; early access. [Google Scholar]
Xiong, J.; Lin, R.; Liu, Z.; He, Z.; Tang, L.; Yang, Z.; Zou, X. The recognition of litchi clusters and the calculation of picking point in a nocturnal natural environment. Biosyst. Eng. 2018, 166, 44–57. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ruan, C.; Tang, S.; Shen, T. A method of segmenting apples at night based on color and position information. Comput. Electron. Agric. 2016, 122, 118–123. [Google Scholar] [CrossRef]
Liang, C.; Xiong, J.; Zheng, Z.; Zhong, Z.; Li, Z.; Chen, S.; Yang, Z. A visual detection method for nighttime litchi fruits and fruiting stems. Comput. Electron. Agric. 2020, 169, 105192. [Google Scholar] [CrossRef]
Xiang, R. Image segmentation for whole tomato plant recognition at night. Comput. Electron. Agric. 2018, 154, 434–442. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Yolo by Ultralytics, January 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 August 2024).
Zhang, Y.; Wu, Y.; Liu, Y.; Peng, X. CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations. arXiv 2024, arXiv:2403.11220 2024. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.048612017. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic Korea, 27 October–2 November 2019. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17302–17313. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2022, arXiv:2009.14794. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra attention: Efficient attention with many heads. In Proceedings of the European Conference on Computer Vision, Virtual Event, 6–8 February 2022; Springer: Berlin/Heidelberg, Germany; pp. 35–49. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 5156–5165. [Google Scholar]
Mirhaji, H.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S. Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditionss. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar] [CrossRef]

Figure 1. Litchi picking scene structure. (a) Details of structure. (b) Actual scene.

Figure 2. The main unstructured environments of litchi, including: (a) Low light; (b) Normal light; (c) Strong light; (d) Individual clusters; (e) Multiple clusters; (f) No cover; (g) Occlusion.

Figure 3. Effect of using different data augmentation methods on litchi images. (a) Original image. (b) Shift and noise-addition. (c) Flip, rotation, brightness darkening and noise-addtion. (d) Rotation. (e) Shift. (f) Flip, shift, noise-addtion, brightness darkening and cutout. (g) Noise-addtion. (h) Flip and rotation.

Figure 4. Overall structure of YOLOv8n-CSE.

Figure 5. Architecture of the CPA-Enhancer module.

Figure 6. Architecture of the CPB module.

Figure 7. Architecture of the Transformer Block module.

Figure 8. Architecture of the GSConv module.

Figure 9. Architecture of the GSbottleneck and oVGSCSP module.

Figure 10. Architecture of the EfficientViT module.

Figure 11. Architecture of the Lightweight MSA module.

Figure 12. F1-score achieved by four models YOLOv8n, RT-DETR-l, YOLOv10n, and YOLOv8n-CSE model after 300 epochs training.

Figure 13. Results of YOLOv8n, RT-DETR-l, YOLOv10n, and YOLOv8n-CSE models for the same image. (a) Original figure. (b) YOLOv8n. (c) RT-DETR-l. (d) YOLOv10n. (e) YOLOv8n-CSE.

Figure 14. F1-score performance curve of YOLOv8n-CSE for each improvement point in the model.

Table 1. Details dataset for the classification information of the training samples.

	Number of Images
Train	1383
Validation	395
Test	198

Table 2. Experimental environment settings.

Parameter	Configuration
Operating system	Windows 10
Deep learning framework	Torch1.12.1+cu116
Programming language	Python3.8
GPU	NVIDIA GeForce RTX 3090Ti
CPU	Intel(R) Core(TM) i7-10700 @ 2.90 GHz

Table 3. The parameters in the training process.

Parameter	Configuration
Epoch	300
Initial learning rate	0.01
Batch size	16
Momentum	0.937
Weight decay	0.0005

Table 4. Comparison of different detection algorithms.

Model	P (%)	R (%)	mAP@0.5 (%)	F1-Score (%)	Processing Time per Photo (ms)	Param (M)
YOLOv8n	93.28	87.07	94.83	90.07	25.00	3.01
RT-DETR-l	93.87	91.33	95.40	92.58	41.52	31.98
YOLOv10n	91.50	87.20	94.90	89.30	19.6	2.69
YOLOv8n-CSE	96.37%	94.73	98.86	95.54	36.5	4.93

Table 5. YOLOv8n-CSE ablation study on CPA-Enhancer, SlimNeck, and EfficientViT. The percentage (%) changes in Precision, Recall, mAP@0.5, F1-Score, and Parameters are shown by the values. The YOLOv8n model with the CPA-Enhancer added is referred to as YOLOv8n-C. The suffix “CSE” indicates the model’s final configuration, YOLOv8n-CSE, and the suffix “CS” indicates the addition of SlimNeck to the CPA-Enhancer model.

Model Abbreviation	P (%)	R (%)	mAP@0.5 (%)	F1-Score (%)	Param (M)
YOLOv8n	93.28	87.07	94.83	90.07	3.01
YOLOv8n-C	93.68	89.00	96.05	92.28	6.26
YOLOv8n-CS	94.68	92.51	97.07	93.58	3.42
YOLOv8n-CSE	96.37	94.73	98.86	95.54	4.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, H.; Zhang, G.; Zhao, A.; Wang, Q.; Zou, X.; Wang, H. YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments. Agronomy 2024, 14, 1924. https://doi.org/10.3390/agronomy14091924

AMA Style

Cao H, Zhang G, Zhao A, Wang Q, Zou X, Wang H. YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments. Agronomy. 2024; 14(9):1924. https://doi.org/10.3390/agronomy14091924

Chicago/Turabian Style

Cao, Hao, Gengming Zhang, Anbang Zhao, Quanchao Wang, Xiangjun Zou, and Hongjun Wang. 2024. "YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments" Agronomy 14, no. 9: 1924. https://doi.org/10.3390/agronomy14091924

APA Style

Cao, H., Zhang, G., Zhao, A., Wang, Q., Zou, X., & Wang, H. (2024). YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments. Agronomy, 14(9), 1924. https://doi.org/10.3390/agronomy14091924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8n-CSE: A Model for Detecting Litchi in Nighttime Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquisition

2.1.2. Data Augmentation

2.1.3. Dataset Details

2.2. Methodology

2.2.1. YOLOv8n-CSE Algorithm Model

2.2.2. Chain Thinking Prompts Mechanism

2.2.3. Lightweight Neck Slim

2.2.4. EfficientViT-Multiscale Linear Attention

2.3. Evaluation Metrics

2.4. Training Parameters and Experimental Environment

3. Results

3.1. Experiments Were Conducted to Evaluate the Performance of YOLOv8n-CSE

3.2. Ablation Experiments with YOLOv8n-CSE

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI