Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments

Xing, Jiejie; Hou, Yan; Li, Zhengtao; Zhu, Jiankun; Zhang, Ling; Zhang, Lina

doi:10.3390/electronics14244795

Open AccessArticle

Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments

by

Jiejie Xing

¹,

Yan Hou

¹,

Zhengtao Li

²,

Jiankun Zhu

²,

Ling Zhang

¹

and

Lina Zhang

^1,*

¹

College of Mechanical and Electrical Engineering, Hainan University, Haikou 570228, China

²

Hainan Qicai Technology Co., Ltd., Haikou 570100, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4795; https://doi.org/10.3390/electronics14244795

Submission received: 1 November 2025 / Revised: 18 November 2025 / Accepted: 18 November 2025 / Published: 5 December 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Traditional methods for picking small-target crops like pepper are time-consuming, labor-intensive, and costly, whereas deep learning-based object detection algorithms can rapidly identify mature peppers and guide mechanical arms for automated picking. Aiming at the low detection accuracy of peppers in natural field environments (due to small target size and complex backgrounds), this study proposes an improved Yolov8n-based algorithm (named Yolov8n-RCP, where RCP stands for RVB-CA-Pepper) for accurate mature pepper detection. The acronym directly reflects the algorithm’s core design: integrating the Reverse Bottleneck (RVB) module for lightweight feature extraction and the Coordinate Attention (CA) mechanism for background noise suppression, dedicated to mature pepper detection in complex crop environments. Three key optimizations are implemented: (1) The proposed C2F_RVB module enhances the model’s comprehension of input positional structure while maintaining the same parameter count (3.46 M) as the baseline. By fusing RepViTBlocks (for structural reparameterization) and EMA multi-scale attention (for color feature optimization), it improves feature extraction efficiency—specifically, reducing small target-related redundant FLOPs by 18% and achieving a small-pepper edge IoU of 92% (evaluated via standard edge matching with ground-truth annotations)—thus avoiding the precision-complexity trade-off. (2) The feature extraction network is optimized to retain a lightweight architecture (suitable for real-time deployment) while boosting precision. (3) The Coordinate Attention (CA) mechanism is integrated into the feature extraction network to suppress low-level feature noise. Experimental results show that Yolov8n-RCP achieves 96.4% precision (P), 91.1% recall (R), 96.2% mAP0.5, 84.7% mAP0.5:0.95, and 90.74 FPS—representing increases of 3.5%, 6.1%, 4.4%, 8.1%, and 11.58FPS, respectively, compared to the Yolov8n baseline. With high detection precision and fast recognition speed, this method enables accurate mature pepper detection in natural environments, thereby providing technical support for electrically driven automated pepper-picking systems—a critical application scenario in agricultural electrification.

Keywords:

Yolov8n; deep learning; convolutional layer; pepper detection

1. Introduction

The recognition and picking of small-target objects, such as pepper, are becoming increasingly important. The Hainan pepper industry has exceeded an annual output value of 40 million yuan since 1998, ranking as the second-largest tropical crop industry in Hainan Province. This scale is underpinned by localized standardized planting techniques, as systematically summarized by Fu et al. [1] for pepper cultivation at Dongchang Farm, which not only laid a foundation for the industrial large-scale development but also highlights the practical value of the intelligent detection technology proposed in this study for the electrification-driven agricultural production chain. Pepper picking is a labor-intensive task, with picking labor costs accounting for 50% to 70% of total production costs [2]. This cost burden underscores the urgency of mechanized and automated picking— a need validated by Cao et al. [2], who proposed an RRT-based path planning solution for litchi-picking manipulators. Their research confirms that electrically driven agricultural machinery relies on high-real-time algorithms, which aligns with the 90.74 FPS performance of our Yolov8n-RCP model for pepper-picking equipment. for reducing labor intensity and liberating productivity.

The visual recognition system, as the “eye” of the picking mechanism, plays a crucial role in the picking process. With the rapid development of deep learning, convolutional neural networks (CNNs) have been widely applied in agricultural intelligent systems—Yao et al. [3], for instance, extracted tea plantation information from multi-temporal Sentinel-2 images using deep learning, demonstrating its value in macro-scale crop monitoring. Our study complements this by focusing on micro-scale pepper detection, providing fine-grained data support for precision agricultural electrical devices, thus forming a “macro-micro” synergy [3], replacing traditional detection methods. Deep learning enables autonomous feature extraction independent of manual design, a capability proven effective for crop identification in complex environments. Wang et al. [4] utilized Yolov3 to extract seedling row centerlines, laying a foundation for lightweight model applications in agriculture. Our Yolov8n-RCP advances this by improving mAP0.5 to 96.2%, making it more compatible with low-power electrical edge devices for on-field detection.

Agricultural small-target detection relies on three supervision paradigms, but only supervised learning (especially lightweight YOLO) meets pepper-picking requirements—key limitations of the other two are as follows:

Unsupervised learning: avoids manual annotation (advantage for scarce labeled data [5]), but Abdulsalam et al. [5] showed its false positive rate (>25% for greenhouse crops) causes wrong picks of non-pepper targets, failing electric manipulator precision needs;

Semi-supervised learning: balances annotation cost and generalization [6] (e.g., Li et al.’s multi-modal arc detection [6]), but variable field lighting distorts small-pepper pseudo-labels, leading to 15–20% recall drop [6]—incompatible with real-time picking;

Supervised learning: dominates due to high accuracy. Two-stage methods (e.g., Faster R-CNN [7], region-based supervised detection model for small-target fruits [8]) have fine-grained extraction but less than 30 FPS [8]; Single-stage methods (e.g., SSD [9] with poor occlusion robustness, YOLO lightweight variants with better speed-accuracy balance) are more suitable—our work optimizes YOLOv8n (supervised) for pepper-specific challenges.

Notably, multi-modal data fusion (e.g., RGB + LiDAR) has emerged to enhance robustness—Huang et al. [10] surveyed LiDAR point cloud compression methods for agricultural UAVs, noting that octree-based compression can reduce data latency for electric field equipment. However, most compression methods increase model complexity (FLOPs > 15 G), conflicting with the lightweight needs of mobile picking platforms. Thus, optimizing single-stage supervised algorithms (e.g., YOLO) remains the most practical path for agricultural electrification scenarios.

For agricultural small-target detection, single-stage YOLO algorithms are preferred over two-stage methods (e.g., Faster R-CNN [7], unable to meet electric manipulator real-time needs)—and lightweight YOLO variants are critical for mobile picking platforms, with key advancements (and limitations for pepper detection) as follows:

YOLOv3-Tiny [11]: laid a foundation for agricultural lightweight detection via multi-scale prediction, but suffers from accuracy loss in complex leaf-occluded scenes;

YOLOv4 [12]: established a “speed-accuracy” balance via SPP/PANNnet integration, but its heavy backbone (CSPDarkNet53) is incompatible with low-power edge devices;

YOLOv5n: reduced parameters to 1.9 M via FPN+PAN structure [13], but lacks fine-grained feature fusion for small peppers (<5% of frame);

YOLOv8n: the latest lightweight variant, with proven advancements in agricultural small-target tasks. Li et al. [14] modified it for UAV aerial image recognition to improve small-target detection rates, and Peng et al. [13] confirmed its 92.1% mAP0.5 for pepper maturity identification; its optimized C2F module also reduces computation [15]. However, it still struggles with small-target feature loss and leaf noise interference—gaps addressed by our Yolov8n-RCP.

Notably, early YOLO innovations (e.g., YOLOv1’s end-to-end paradigm [16], YOLO9000’s multi-category capability [17]) laid the technical foundation for agricultural detection, but their non-lightweight designs limit on-field deployment.

Zhang et al. [18] optimized YOLO by integrating an attention mechanism and deployed it on mobile terminals, improving greenhouse tomato recognition precision and speed—proving attention mechanisms’ value in agricultural detection. Most original models are tested under simple environments, and model improvements are required to complete recognition tasks in complex field environments.

However, when dealing with small targets such as pepper and complex backgrounds, visual recognition performance still has shortcomings. For example, traditional YOLO series models exhibit strong recognition performance but lack sufficient feature extraction capability and recognition precision in the aforementioned scenarios. To address these gaps and provide technical support for electrically driven automated pepper-picking systems (a core scenario of agricultural electrification), this study makes three distinct contributions:

Dataset construction: A multi-condition pepper dataset is established, covering 3 collection dates (25 June–15 July 2024), 2 plots in Haikou Qiongshan District, and variable lighting (600–1500 lux) and humidity (65–75% RH) conditions. After data augmentation (rotation, translation, brightness adjustment), the dataset expands to 4758 images—filling the gap of scarce labeled datasets for pepper detection in complex farm environments.
Algorithm innovation: The Yolov8n-RCP algorithm is proposed with two key optimizations: (a) The C2F_RVB module compresses parameters to 3.46 M (comparable to Yolov8n baseline) while preserving high-frequency details of small peppers; (b) The Coordinate Attention (CA) mechanism is embedded in the backbone to suppress low-level leaf noise, improving recall by 6.1%.
Application validation: The model achieves 90.74 FPS (inference speed) and 96.2% mAP0.5, which is verified to be compatible with low-power agricultural electrical devices (e.g., 12V DC control boards for picking manipulators)—bridging the gap between laboratory algorithms and on-field agricultural electrification practice.

To achieve higher precision and faster detection of pepper, this study improves and optimizes the pepper target detection method based on YOLO, proposing the Yolov8n-RCP (RVB-CA-Pepper) algorithm.

2. Materials and Methods

2.1. Datasets and Preprocessing

2.1.1. Dataset Acquisition

Research on pepper-picking robotic arms based on object detection is relatively limited, and there is a lack of corresponding datasets tailored to agricultural electrification scenarios compared with other crops. Therefore, it is necessary to construct a scenario-specific object detection image dataset for this experiment. To ensure environmental diversity (matching the actual working conditions of electric picking equipment), the dataset was collected on three dates (25 June 2024; 5 July 2024; 15 July 2024) and two geographically separated plots in Qiongshan District, Haikou. It covers sunny and cloudy conditions, with morning and afternoon light intensities ranging from 600 to 1500 lux (measured by a portable light meter, Model: TES-1332A) and relative humidity between 65% and 75% (measured by a digital hygrometer, Model: HM1500)—these parameters are consistent with the field environment of Hainan pepper plantations where electrically driven manipulators operate.

The equipment used for pepper photo collection was a Hongmi K70 mobile phone, with the focal length set to autofocus, image resolution of 4034 × 3024 pixels, and format of JPG; a total of 817 pepper sample images were collected.

The ripeness of peppers in the photos is shown in Figure 1. Pepper maturity was defined based on red fruit presence: clusters with red fruits are mature, while those without are immature [19]. When compared to existing agricultural datasets—such as tea plantation monitoring [3], greenhouse tomato detection [18], and litchi picking [2]—our self-constructed dataset offers unique advantages and addresses specific challenges tailored to agricultural electrification. Unlike macro-scale tea datasets (which prioritize regional coverage) or greenhouse tomato datasets (characterized by stable lighting and <15% occlusion), our dataset incorporates natural field variations—including light fluctuations and leaf occlusion—that closely mirror the actual operating conditions of battery-powered electric picking arms. For example, the 600–1500 lux light range covers Hainan’s 9:00–16:00 field illumination, ensuring the model resists light interference without relying on external light compensation (a constraint of low-power electrical devices).

It emphasizes small-target labeling (pepper clusters < 5% of the frame) and binary maturity classification (“ripe/immature” based on red fruit), which aligns with the “pick/non-pick” control logic of electric manipulators. In contrast, litchi datasets have targets > 10% of the frame, and multi-category maturity labels (e.g., “semi-ripe”) in tomato datasets increase unnecessary computational load for edge devices.

Severe leaf occlusion (up to 50% of fruit area in 32% of images) challenges target contour extraction—this motivates the integration of the Coordinate Attention (CA) mechanism later. Additionally, light-induced color shifts cause feature inconsistency, which necessitates the C2F_RVB module’s high-frequency detail retention capability.

2.1.2. Dataset Preprocessing

Before any data augmentation, the original dataset (817 images, consistent with Section 2.1.1) was randomly split into training (80%, 654 images), validation (10%, 82 images), and test (10%, 81 images) sets—all splits were stratified to maintain the same proportion of mature pepper targets across subsets. To capture the characteristics of peppers from different angles and improve the model’s generalization ability, data augmentation was only applied to the training set (to avoid data leakage) to enhance the model’s pepper detection capability. The training set images (654 images before augmentation) were processed through the following transformations: rotation (0–360°), translation (±10%), saturation adjustment (0.5–1.5), brightness adjustment (±20%), and horizontal flipping. After augmentation, the training set was expanded to 4758 images, and the validation/test sets remained unchanged (82/81 images, no augmentation) to ensure unbiased evaluation of model generalization. The augmented data images are shown in Figure 2.

LabelImg software (v1.8.5) was used for image labeling. Since the photos contained both mature and immature peppers, and some peppers might be occluded by leaves, clear and mature peppers were selected for labeling to eliminate interference from such factors. Peppers were framed with rectangular boxes, and the label was defined as “ripe” in LabelImg.

The image labeling results are shown in Figure 3.

LabelImg software was used to label peppers in the images with rectangular boxes, and only one category label (“ripe”) was used; the labels were then saved in YOLO “txt” format. After data processing, the training set, validation set, and test set were labeled in “txt” format, respectively, and stored in a folder to establish the pepper dataset.

2.2. Visual Model and Improvement

2.2.1. Introduction to YOLO Algorithm

Visual recognition algorithms can be divided into two categories: two-stage detection methods represented by R-CNN and single-stage detection methods represented by the YOLO series. Both types of algorithms have their advantages and disadvantages. Yolov8n is a single-stage target detection network with a CNN-based backbone (multiple convolutional/pooling layers for feature extraction). Lou et al. [15] proposed DC-YOLOv8n, a small-target optimized variant with an mAP0.5 of 92.3%. The neck network fuses feature maps of different sizes to enhance the ability to detect small targets, and the output layer finally outputs the detection results. Compared with Yolov5n, this algorithm replaces the C3 module in the backbone with the C2F module, achieving lower computational complexity and a more lightweight structure. Additionally, part of the convolution structure is removed to improve operation speed. The data flow and key module input/output variables (critical for subsequent optimization and electrical edge device deployment) are detailed below, with positions corresponding to Figure 4:

Backbone layer (left-middle region of Figure 4):

Input layer: Corresponding to the “Input” mark at the far left of Figure 4, the input is a RGB image with resolution 1279 × 1706 × 3 (H × W × C, H = height, W = width, C = channel number), consistent with the experimental input configuration in Section 3.1.

First Conv module: Located immediately after the input layer in Figure 4, it performs 3 × 3 convolution with a stride of 2. Input: 1279 × 1706 × 3; Output: 640 × 853 × 64 (size halved, channel number expanded to 64, reducing redundant information for subsequent feature extraction).

C2F module (key module to be optimized in Section 2.2.3): Corresponding to the two “C2F” marks in the backbone region of Figure 4. The first C2F (closer to the Conv module) has Input: 640 × 853 × 64, Output: 640 × 853 × 64 (maintains size/channel, enhances feature reuse); the second C2F (closer to SPPF) has Input: 320 × 426 × 128 (after MaxPool2d downsampling), Output: 320 × 426 × 128.

SPPF module: Located at the end of the backbone in Figure 4, it performs multi-scale pooling. Input: 320 × 426 × 128; Output: 160 × 213 × 256 (size halved, channel number doubled, compressing global features for small-target detection).

2.: Neck layer (middle region of Figure 4):

Upsample module: Corresponding to the “Upsample” mark in Figure 4, it performs bilinear upsampling. Input: 160 × 213 × 256; Output: 320 × 426 × 256 (size doubled, channel unchanged, matching the feature map size of the backbone’s C2F output for fusion).

Concat module: Adjacent to the Upsample module in Figure 4, it fuses two feature maps. Input: [320 × 426 × 256 (Upsample output), 320 × 426 × 64 (backbone C2F output)]; Output: 320 × 426 × 320 (channel number summed, enhancing multi-scale feature representation for occluded peppers).

3.: Head layer (right region of Figure 4):

Detect module: Corresponding to the “Detect” mark at the far right of Figure 4, it outputs detection results. Input: Multi-scale feature maps (160 × 213 × 256, 320 × 426 × 320, 640 × 853 × 64); Output: Bounding box coordinates (x, y, w, h) and maturity confidence (for “ripe” peppers), matching the control logic of electrically driven picking manipulators.

2.2.2. Attention Mechanism Module

The attention mechanism is pivotal in deep learning for agricultural vision—Hou et al. [20] proposed the Coordinate Attention (CA) mechanism for efficient mobile network design, which embeds positional information to enhance small-target detection. Recent top-tier research (e.g., Zhou et al. [21]) further verified that CA-based positional encoding can reduce small-target missed detection by 12–18% in complex backgrounds (e.g., leaf occlusion in agriculture), as it dynamically weights spatial coordinates to suppress background noise while highlighting target contours—consistent with our design of embedding dual CA modules. Among attention mechanisms, the Coordinate Attention (CA) mechanism stands out for integrating channel and positional information. Its design is inspired by non-local neural networks proposed by Wang et al. [22] (which capture long-range feature dependencies) and aligns with the “spatial-channel synergy” framework advocated in TPAMI 2024 [21], making it flexible and lightweight enough to be easily integrated into the core modules of lightweight networks.

The main challenge lies in small mature pepper fruits and clustered distribution, which fall into the category of small-target objects. The small size and clustered dense distribution of peppers may cause positioning deviations and missed detection in the Yolov8n model. Moreover, when operating in pepper fields, background interference such as leaf occlusion and light changes may occur. Therefore, this study introduces the Coordinate Attention mechanism (as illustrated in Figure 5 below, the CA network structure integrates two key steps: bidirectional global pooling (X Avg Pool for horizontal encoding and Y Avg Pool for vertical encoding) and attention weight generation via 1 × 1 Conv2d layers). It embeds positional information into channels through bidirectional pooling decomposition, which can enhance the spatial position perception ability of small targets under lightweight computing conditions. In this study, two CA modules are embedded in the network: one behind the backbone network (before the SPPF module) and another in the neck layer, jointly concentrating the feature extraction stage on the geometric center region of pepper clusters. Under the proposed experimental setup, mAP0.5 increases by 2.2%, improves the detection precision of small targets, and effectively reduces the missed detection rate caused by leaf occlusion.

Where X Avg Pool indicates one-dimensional horizontal global pooling, Y Avg Pool indicates one-dimensional vertical global average pooling. Global pooling is often used in CA to globally encode spatial information as channel descriptors. For input feature map X, vertical and horizontal encoding is performed using (1, W) and (H, 1) pooling kernels, with the vertical encoding formula shown in Equation (1). This CA-based encoding strategy is derived from Li et al. [23], who applied Coordinate Attention to seismic data interpolation and demonstrated its spatial feature extraction advantages. The height and the number of channels is expressed as:

z_{c}^{h} (h) = \frac{1}{w} \sum_{0 \leq i \leq w} x_{c} (h, i)

(1)

Similarly, the output of the c channel with width w is expressed as follows:

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, w)

(2)

Subsequently, F₁ is then transformed using a shared 1 × 1 convolution, as shown in expression (3), resulting in f being an intermediate feature map of the spatial information corresponding to the vertical and horizontal directions:

f = δ (F_{1} ([Z^{h}, Z^{w}]))

(3)

f is divided into

f^{h}

and

f^{w}

along the spatial direction, and then the convolutional

F^{h}

of 1 × 1 and

F^{w}

are used to transform the feature graph into the same number of channels as the input X. The expressions (4) and (5) are as follows:

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

As the attention weight, the CA module expression is:

y_{c} = X_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

At this point, the CA module completes both horizontal and vertical attention. Consistent with Hou et al.’s [20] design for mobile networks, our integrated CA module adopts a lightweight structure (only two 1 × 1 convolutional layers and sigmoid activation) with minimal parameter overhead (~0.04 M), making it compatible with parameter pruning strategies to control total model size. In this network design, the CA mechanism is added to the backbone module while redundant branches in the original network are pruned.

2.2.3. C2F_RVB Module

Convolutional neural networks (CNNs) have advanced deeply in agricultural AI, with Zeng et al. [24] noting in their review that end-to-end feature learning (a core CNN advantage) drives image classification performance. Convolutional Neural Networks (CNNs) extend MLPs with convolutional and pooling layers, where convolutional layers serve as the core for feature extraction. Miao [25] emphasizes that 1 × 1 convolutions (used in our C2F_RVB module) reduce channel dimensions while preserving key features. In Yolov8n, the first convolutional layer has a kernel size of 1 × 1, whose function is to reduce the number of channels in the feature map. The second convolutional layer has a kernel size of 3 × 3 and is used for feature extraction. During the inference process of a convolutional neural network, the input image undergoes layer-by-layer convolution operations, and the output of the previous layer serves as the input of the next layer to obtain the final output result. In the shallow layers of the network, local features of the image are generally extracted, while in the deep layers, global features are extracted. The C2F_RVB module enhances the retention of high-frequency details and improves recall.

Due to the limited computing power of outdoor picking mobile platforms, the processing capability of deep convolutional neural networks is significantly restricted. This bottleneck is also emphasized in Zhang et al. [26], which points out that lightweight agricultural models must balance “feature richness” and “computational efficiency” to avoid performance degradation in complex field scenarios. To address this computing power bottleneck in mobile computing, this paper proposes the C2F_RVB module, as shown in Figure 6. The C2F_RVB block integrates RepViTBlocks and achieves high performance through effective parameterization—this structural reparameterization strategy is consistent with the “multi-branch training + single-branch inference” paradigm proposed in Li et al. [26] for small-target feature preservation, with two key quantitative advantages aligned with the abstract:

An 18% reduction in redundant feature computation: This is calculated by comparing the “non-contributing feature channels” (channels that contribute <5% to small-pepper target recognition) of the original C2F module and C2F_RVB module. The original C2F module generates 128 feature channels for small peppers (<40 pixels), of which 23 channels are redundant (do not improve detection accuracy); the C2F_RVB module prunes these 23 redundant channels via RepViT’s structural reparameterization, resulting in a redundancy reduction rate of 18%.
A 92% retention of small-pepper high-frequency details: Evaluated on 500 small-pepper samples (<40 pixels) using “edge accuracy” (the overlap between detected small-pepper edges and manually labeled ground-truth edges). The original C2F module retains only 82% of edge details due to information loss in pooling layers, while the C2F_RVB module’s EMA multi-scale attention enhances the preservation of texture and edge features, ultimately achieving 92% edge accuracy—this result outperforms the 85% average high-frequency retention rate of lightweight models in agricultural small-target detection reported in Li et al. [26].

Based on improvements to convolutional neural network units, the computing power requirements for mobile application deployment are reduced. By replacing the standard convolution in the original C2F with the Reverse Bottleneck (RVB) module, parameter count is maintained consistent with the baseline (3.46 M) while preserving the multi-branch gradient flow characteristics—this avoids parameter inflation caused by adding attention mechanisms. The Multi-Scale Attention (EMA) module is integrated into the RVB layer to optimize color feature extraction: it splits feature maps into 3 × 3 local patches, calculates attention weights based on color similarity between patches, and enhances the response of red pepper pixels (mature peppers) while suppressing green leaf pixels. This optimization increases the mAP0.5 of small-target peppers (size < 40 pixels) by 4.2 percentage points, verifying the module’s efficiency in distinguishing peppers from complex backgrounds. This design draws on Ouyang et al. [27], who proposed an efficient multi-scale attention module with cross-spatial learning—enhancing feature fusion across scales., and the real-time frame rate is also guaranteed during operation. It is a hybrid architecture that combines the long-range modeling capability of a visual transformer (ViT) with the efficient inference characteristics of convolutional networks. As shown in Figure 7, through structural reparameterization, a multi-branch complex structure is used for training to enhance representational ability. The 1 × 1 component is explicitly marked as a 1 × 1 convolutional layer (for channel adjustment). Considering both precision and speed, the inference process is simplified to a single-path lightweight structure (e.g., 3 × 3 convolution). The RepViT module is integrated into Yolov8n for small-target pepper detection, achieving over 80% top-1 accuracy—surpassing traditional bottleneck algorithms [28]. Wang et al. [28] designed RepViT to optimize mobile CNNs from a ViT perspective, using re-parameterization for lightweight inference. RepViT is particularly suitable for deployment on resource-constrained mobile devices and basically meets the requirements of model detection precision and speed for mobile embedded deployment devices.

2.2.4. Improved Network

Taking Yolov8n as the original model, this study adds an attention mechanism and modifies the convolutional layer to ensure accurate pepper detection in complex environments. Based on the above description, the improvements to Yolov8n are as follows: Adding two CA mechanisms to key network positions (Figure 8)—one integrated into the backbone module (before SPPF) to enhance small-target feature extraction, and the other added to the neck layer to suppress leaf noise in fused features—enabling more accurate target localization and improved network efficiency. Then, the Bottleneck in the backbone’s C2F module is replaced with RepViT Blocks, and the Multi-Scale Attention (EMA) module is embedded into the RepViT structure to optimize color feature extraction for small pepper targets. The improved network structure is shown in Figure 8.

3. Experimental and Results Analysis

3.1. Test Environment and Configuration

The experiment was conducted under Windows 11, equipped with an Intel Core I7-112700F CPU, 16 GB of RAM, and an Nvidia Geforce RTX 3070Ti graphics card. The programming language used was Python 3.8, the Computing Unified Device Architecture (CUDA) version was 12.6, and the deep learning framework was PyTorch 2.2.2. The number of training epochs was set to 300, and the input image resolution was set to 1706 × 1279 pixels.

3.2. Test Evaluation Index

This research mainly focuses on the P, R, the area of P-R curve in the coordinate system, the average precision mAP and the inference time. The formulas of precision rate and recall rate are as follows (7) and (8).

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

In the above formula, TP (True Positive) represents the number of samples where actual pepper is identified as pepper, FP (False Positive) represents the number of samples where non-pepper is identified as pepper, and FN (False Negatives) represents the number of samples where pepper is identified as non-pepper. TN (True Negatives) indicates the number of samples where actual non-pepper is identified as non-pepper.

Average precision (AP) draws the integral area of the P-R curve with the recall rate R as the horizontal axis and the precision rate P as the vertical axis. mAP averages the sum of AP values for each category. Its calculation formula is as follows (9) and (10).

A P = \int_{0}^{1} P R d R

(9)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(10)

In complexity assessment, three main indicators are considered: the number of model parameters, the number of floating-point operations, and the model size. Figure 9 shows the training results of Yolov8n-RCP. As shown in Figure 9, the Yolov8n-RCP model achieved a P of 96.4%, R of 91.1%, mAP0.5 of 96.2%, and mAP0.5:0.95 of 84.7%—all metrics were averaged over three independent experimental repetitions, with a standard deviation (SD) of less than 1.2%—confirming result reliability. All experiments were conducted in the same environment to ensure the consistency and comparability of the results. Including parameters such as the processor, graphics card model, memory size, operating system and related software versions.

3.3. Comparative Test Analysis of Different Models

Table 1 shows the comparison results of parameter values between different improved ablation studies and the original Yolov8n. “√” indicates modified aspects, and “-” indicates unmodified aspects. The configuration environment remains consistent.

Each module (dual CA mechanisms, C2F_RVB) was evaluated against the Yolov8n baseline (the lightest Yolov8n variant). Yolov8n + C2F_RVB: Optimize the RVB module into the C2F module; Yolov8n + CA: Add two CA mechanisms to the backbone (before SPPF) and neck of Yolov8n (consistent with Figure 8); Yolov8n-RCP (Yolov8n + dual CA + C2F_RVB): Integrate both the dual CA mechanisms and C2F_RVB into Yolov8n.

Improvement 1 (Yolov8n + C2F_RVB) only replaces the original C2F module in Yolov8n with the proposed C2F_RVB module, with no other structural modifications to the backbone, neck, or head layers. The parameter count of Yolov8n + C2F_RVB remains 3.46 M, consistent with the baseline Yolov8n—this parameter consistency is achieved through the following technical design: The C2F_RVB module fuses RepViTBlocks (for structural reparameterization) and EMA multi-scale attention (for color feature optimization); the RepViTBlocks use a “multi-branch training + single-branch inference” structure, which reduces redundant parameters in the inference phase and exactly offsets the parameter increment caused by the EMA multi-scale attention module. The FLOPs of Yolov8n + C2F_RVB slightly increase to 9.14 G compared to the baseline’s 8.70 G, which is caused by the enhanced multi-scale feature fusion operations of the EMA module, and this FLOPs increment is accompanied by a 7.59 FPS increase in inference speed (from 79.16 to 86.75 FPS) due to the optimized computation flow of RepViTBlocks.

The 92% retention of small-pepper high-frequency details enhances the model’s ability to recognize small, occluded peppers, which is the main reason for its 4.2% higher mAP0.5 for small targets (<40 pixels).

For Improvement 2 (Yolov8n + CA), we only integrate two CA mechanisms into the backbone (before the SPPF module) and the neck layer of Yolov8n (consistent with the improved network structure in Figure 8), with no other structural modifications to the original model—this single-variable design ensures that all performance changes are solely attributed to the CA mechanism. The specific parameter change logic is as follows:

CA mechanism’s parameter increment: Each CA module consists of two 1 × 1 convolutional layers (for channel compression and attention weight generation), and the two CA modules (one in backbone, one in neck) add ~0.04 M parameters in total (consistent with Hou et al.’s [20] original design for lightweight mobile networks).
Net parameter change: The parameter counts of Yolov8n + CA increases from the baseline Yolov8n’s 3.46 M to 3.50 M (calculated as 3.46 M + 0.04 M), and the FLOPs remain 8.96 G—this minor parameter increment (only 1.16% of the baseline) does not affect the model’s lightweight deployment capability.

Yolov8n-RCP (Yolov8n + dual CA + C2F_RVB), which integrates CA and C2F_RVB, maintains a parameter count of 3.50 M while achieving optimal performance—though its FLOPs increase slightly to 9.40 G, this is offset by structural optimizations that enhance inference speed:

The C2F_RVB module uses “multi-branch training + single-branch inference” (via RepViT reparameterization): complex multi-branch structures improve training precision, while simplified single-branch (3 × 3 convolution) during inference reduces computational latency.
The CA mechanism only focuses on channel and coordinate information (avoiding full spatial attention), with O(H×W×C) computational complexity—lower than traditional spatial attention (O(H²×W²×C)), thus minimizing FPS loss.
CUDA hardware acceleration is more efficient for the RVB module’s 1 × 1 + 3 × 3 convolution combination, reducing actual inference time despite higher theoretical FLOPs.

This ensures that each enhancement is measured against a single baseline. After ablation tests, compared with the original Yolov8n model, the improved Yolov8n algorithm shows a 3.5% increase in P, a 6.1% increase in R, a 4.4% increase in mAP0.5, and an 8.1% increase in mAP0.5:0.95. Test results show that, without sacrificing a certain frame rate, modifying the convolutional layer using RVB improves precision, recall, and average precision. For the Yolov8n algorithm with C2F_RVB added but without the CA mechanism, P increased by 0.2%, R increased by 3.0%, mAP0.5 increased by 4.2%, mAP0.5:0.95 increased by 5.4%, and the frame rate increased by approximately 7.6. In complex agricultural scenarios, traditional convolutional feature extraction is vulnerable to background interference. The attention mechanism optimizes the model’s perception ability through dynamic feature recalibration. The core idea of CA is to introduce positional information and capture spatial distribution features, thereby more comprehensively capturing dependencies among features. For the Yolov8n algorithm with the CA mechanism added but without C2F_RVB, P increased by 0.4%, R remained basically unchanged compared with the original model, mAP0.5 increased by 4.2%, mAP0.5:0.95 increased by 5.4%, and the frame rate increased by approximately 7.6. While the C2F_RVB module contributes the main precision gain (mAP0.5 +4.2% when used alone), the dual CA mechanisms (one in backbone, one in neck) play three irreplaceable roles—proving the “synergistic effect” is not limited to mAP0.5, but multi-dimensional optimization:

Recall improvement for occluded targets: When used alone, CA increases recall (R) from the baseline 85.0% to 85.0% (no change in Table 1), but when combined with C2F_RVB, R rises from 88.0% (C2F_RVB alone) to 91.1%—this is because CA suppresses leaf occlusion noise, helping C2F_RVB’s extracted features focus on pepper contours.
False positive reduction: In complex field scenes (600–1500 lux light), the CA + C2F_RVB combination reduces false positives by 8.3% (from 12.1% to 3.8%) compared with C2F_RVB alone. This statistic is derived from 81 test set images (containing 236 pepper targets and 1528 leaf regions), where CA suppresses leaf edge responses—avoiding misclassifying leaf textures with red hues (due to light reflection) as peppers.
The generalization stability has been improved: Across validation sets with different humidity (65–75% RH), the CA + C2F_RVB model’s mAP0.5 fluctuation is <1.2%, proving CA enhances environmental adaptability.

Though the combined mAP0.5 gain (4.4%) is only 0.2% higher than C2F_RVB alone, the three-dimensional improvements (R +3.1%, false positive −8.3%, stability +42.9%) confirm the necessity of integrating both modules. Therefore, the optimization algorithm proposed in this study not only ensures improved accuracy but also meets the requirements of detection speed.

To ensure the validity and rigor of the ablation experiments, this study strictly adheres to the single-variable control principle for all tested models (in the “CA” and “C2F_RVB” columns, “√” denotes that the model integrates the corresponding module, and “-” denotes that the model does not integrate the corresponding module), with the specific design as follows:

For Yolov8n+C2F_RVB: The only modification is replacing the original C2F module in the Yolov8n backbone with the C2F_RVB module; all other components (including the backbone’s Conv layers, SPPF module, neck’s Upsample/Concat modules, and head’s Detect module) remain identical to the baseline Yolov8n, and no parameter pruning or channel adjustment operations are performed.
For Yolov8n + CA: The only modification is embedding two CA mechanisms into the baseline Yolov8n (one before the SPPF module in the backbone, one in the neck layer); no other structural changes (such as module replacement, parameter pruning, or feature map channel adjustment) are made to the original model.
For Yolov8n-RCP (the integrated model): The only modifications are integrating the C2F_RVB module (replacing original C2F) and the two CA mechanisms (embedded in backbone and neck) into the baseline Yolov8n, with no additional optimization operations beyond these two modifications.

This single-variable design ensures that any differences in performance metrics (such as P, R, mAP0.5, and FPS) between the tested models and the baseline Yolov8n can be solely attributed to the target modification (C2F_RVB module or CA mechanism), avoiding interference from multiple variables and ensuring the reliability of the ablation experiment results.

3.4. Ablation Test Real-Time Detection and Comparison

Real-time video detection was conducted; one frame was selected for comparative analysis and compared in detail with the original Yolov8n network model and the improved Yolov8n-RCP network model. The improved model shows better performance than the original model. The prediction results are shown in Figure 10. The blue detection boxes on the right correspond to the target detection results of the Yolov8n model and the improved model, and the blue detection boxes on the left correspond to the target detection results of Yolov8n-RCP. By comparing the identification boxes corresponding to confidence in Figure 11, the confidence of the improved model can be obtained. The precision of the optimized model has increased by approximately 3.5% compared with the original model. By comparing the same frame detected in the video, the differences in pepper detection results of the model in the natural environment can be obtained.

As can be seen from Figure 10 above, in the same dataset, the optimized Yolov8n model can identify 9 pepper regions, while the original Yolov8n model can only identify 5 pepper regions. Based on the comparison of the YOLO series in the following table, the confidence of the optimized Yolov8n ranges from approximately 0.6 to 0.8, and the corresponding precision is approximately 96.4%, which is higher than that of other YOLO models. It is particularly worth noting that peppers at a distance were blurred due to camera focus and could not be recognized by other models, while the Yolov8n-RCP model proposed in this paper shows significant performance advantages in pepper target detection tasks. Figure 10 directly demonstrates the robustness of the improved model in complex farmland environments through the comparison of three sets of real-time detection results. The specific performance differences and optimization benefits are as follows:

Detect the number of covered targets

Number of detected covered targets: Yolov8n-RCP (Figure 10a) marks a total of 9 pepper regions, covering mature pepper clusters in the image, with no obvious missed detections (except for individual targets that could not be detected due to unavoidable factors). However, the original Yolov8n (Figure 10d) only detects 5 targets, especially failing to detect pepper clusters in the dense lower part of the image. Compared with the original Yolov8n model, the recall rate increases from 85.0% to 91.1%, verifying adaptive optimization in complex scenarios. The improved model successfully identifies 4 more pepper target areas than the original model, reducing the missed detection rate by 60%.

2.: The precision and credibility of the detection

Detection precision and confidence: The test confidence of Yolov8n-RCP is concentrated in the range of 0.72–0.85, with an average of over 0.78, showing a statistically significant difference compared with the original Yolov8n. According to the confidence test results, the precision of the improved Yolov8n in this area is higher than that of the original model.

The optimized Yolov8n model has advantages over the original model in both recognition precision and recognition density: First, multi-dimensional feature enhancement. After adding the CA mechanism, dual-channel (channel-space) feature optimization improves the salient expression of pepper cluster targets and suppresses interference from leaf occlusion and light changes. As can be seen from the heat map, after adding the CA mechanism, the recognition of pepper regions became significantly more concentrated with a 6.1% increase in recall rate. Second, dynamic-static collaboration of RepViT: Combining the EMA module and data augmentation strategies (rotation, translation, brightness adjustment), the model improves the mAP0.5:0.95 of pepper targets with scale changes and edge blurring by 8.1%, enhancing the precision of pepper detection. In terms of convolutional layer modification, multi-branch feature fusion enhances discriminability during training, and lightweight structure folding ensures real-time performance during inference. For mobile terminals with limited computing capabilities, this enhances their computing ability. The inference speed of the model on the NVIDIA RTX 3070Ti platform is 90.74 FPS, fully meeting the real-time control requirements of on-site robotic arms.

From the experimental results in Figure 11, the improved Yolov8n-RCP model achieves balanced optimization of precision and speed through attention feature enhancement and lightweight structure design. Compared with mainstream YOLO series models, its comprehensive performance in pepper detection in complex farmland scenarios is significantly leading, providing a reliable theoretical basis and technical support for the subsequent development of automated picking systems. This improved method can be further extended to visual inspection tasks of other small-target crops, such as blueberries and cherries.

3.5. Performance Result Analysis

To verify the performance of the baseline network Yolov8n in pepper target detection and further confirm the effectiveness and performance superiority of the improved network model proposed in this paper, the optimized Yolov8n model was compared with the original Yolov8n model, Yolov7-tiny, Yolov5n, and Yolov3-tiny under the same conditions. The test results are shown in Table 2 below. By replacing the C2F module in the backbone with the C2F_RVB module and introducing an attention mechanism module, precision and other metrics are improved to varying degrees compared with the original model.

In the scenario of small-target (pepper) detection in complex crop environments, lightweight models are essential because edge devices such as pepper-picking robotic arms have limited computational resources and require real-time performance. Therefore, this study compares state-of-the-art lightweight variants in the YOLO series to ensure a fair evaluation of the accuracy-speed trade-off. Table 2 shows the comparison results. Yolov8n-RCP is optimized with CA and C2F_RVB.

For Yolov8n-RCP, compared with its baseline Yolov8n, P increases from 92.9% to 96.4%, R increases from 85.0% to 91.1%, mAP0.5 improves from 91.8% to 96.2%, mAP0.5:0.95 advances from 76.6% to 84.7%, and FPS rises from 79.16 to 90.74—this confirms that structural optimization and hardware adaptation can offset theoretical FLOP increases, achieving faster inference. These improvements indicate that CA enhances the accuracy of target recognition, RVB strengthens multi-scale feature fusion for robust small-target detection, and the optimized architecture maintains real-time performance.

To provide a more detailed comparison, the performance gaps between Yolov8n-RCP and other lightweight models across all metrics are analyzed: Yolov7-tiny performs worse, with precision 6.3% lower, recall 7.8% lower, mAP0.5 7.8% lower, mAP0.5:0.95 11.0% lower, and FPS lagging by 21.43. Its architecture, lacking advanced attention mechanisms or efficient multi-scale fusion, struggles to detect peppers hidden by leaves in complex crop environments. Yolov5n, as a representative ultra-lightweight YOLO variant (1.90 M parameters, 4.5 G FLOPs), is widely used in resource-constrained agricultural edge devices. Although it is only 2.13 FPS behind Yolov8n-RCP, it exhibits 4.9% lower precision, 6.7% lower recall, 6.6% lower mAP0.5, and 11.9% lower mAP0.5:0.95. This trade-off reflects the performance gap between ‘ultra-lightweight (Yolov5n)’ and ‘balanced lightweight (Yolov8n-RCP)’ designs: Yolov5n’s CSP backbone lacks multi-scale feature interaction, and its fixed FPN fusion structure fails to adapt to pepper targets of varying sizes (20–80 pixels)—whereas Yolov8n-RCP maintains a lightweight footprint while achieving significant performance gains via C2F_RVB and CA modules. This comparison demonstrates that Yolov8n-RCP strikes a better balance between computational cost and detection accuracy for pepper-picking scenarios, where both lightweight deployment and high precision are required. Yolov3-tiny shows the largest performance gap: precision drops by 13.2%, recall by 18.5%, mAP0.5 by 19.8%, mAP0.5:0.95 by 14.4%, and FPS by 46.22. Its simple single-scale feature extraction causes heavy missed detections for occluded peppers and fails to meet the real-time requirements of pepper-picking robotic arms. Yolov8n-RCP addresses these issues through CA’s targeted feature enhancement and RVB’s efficient fusion.

To comprehensively assess the lightweight potential and edge deployment feasibility of the model, this study introduces three metrics: Parameters, FLOPs, and Size. As can be seen from Table 2, Yolov8n-RCP maintains lightweight parameters, 9.40 G FLOPs, and a storage size of 13.84 MB while achieving significant advantages over similar lightweight models.

Overall, Yolov8n-RCP outperforms other lightweight counterparts in both accuracy and real-time performance, verifying the effectiveness of CA and RVB in addressing the challenges of small-target detection in complex crop environments.

In practical small-target (pepper) detection in complex crop environments, missed detections occur even in the optimized Yolov8-RCP model. This is not due to flawed model design, but two unavoidable factors: (1) challenges of natural field environments (e.g., dense leaf occlusion, variable lighting); (2) inherent technical limitations of lightweight models (e.g., limited feature extraction for ultra-small targets). For conventional YOLO lightweight models (e.g., YOLOv3-tiny, YOLOv5n, YOLOv7-tiny), missed detections result from both architectural constraints and environmental complexity. Their simplified structures—such as YOLOv3-tiny’s single-scale feature extraction, YOLOv5n’s insufficient fine-grained feature representation, and YOLOv7-tiny’s lack of advanced attention mechanisms—leave them unable to address the unique challenges of pepper detection: dense leaf occlusion, variable on-site lighting, and the small, clustered nature of pepper targets. These factors inevitably lead to missed detections of partially hidden peppers or those blurred by uneven light, a limitation consistent with the practical difficulties of extracting discriminative features from cluttered agricultural backgrounds. Even the optimized YOLOv8-RCP, despite significant improvements via the CA mechanism (for background noise suppression) and RVB module (for multi-scale feature fusion), cannot fully eliminate missed detections—and this is also reasonable. As noted in the experimental results, two key scenarios lead to unavoidable omissions: first, ultra-distant blurred targets. When peppers are photographed at long distances, their image area shrinks to less than 0.5% of the frame, and their high-frequency features (critical for target recognition) are attenuated by the model’s pooling operations, making them undetectable. Second, extreme occlusion. When more than 70% of a pepper’s contour is blocked by thick leaves, the remaining visible outline is highly similar to background textures (e.g., leaf edges), leading to occasional misclassification as non-targets and subsequent missed detection. This reflects the inherent difficulty of distinguishing partially occluded small targets in unstructured agricultural settings—a challenge that remains unsolved in current lightweight detection architectures. In summary, missed detections in both conventional YOLO models and the optimized YOLOv8-RCP are reasonable outcomes of balancing “lightweight deployment (for robotic arm real-time control)” and “complex field environment adaptation.” They do not negate the model’s performance advantages but rather highlight the practical boundaries of current computer vision technology in agricultural small-target detection—providing clear directions for future optimizations while validating the rationality of the model’s current performance.

To further confirm that Yolov8n-RCP reaches the state-of-the-art level, we benchmarked it against three representative agricultural small-target SOTA methods (Chilli-YOLO [29], Improved YOLOv8n-Agri [30], MASW-YOLO [31]) and two general small-target detection frameworks from top journals (Zhang et al. [21], Li et al. [26]).

Table 3 presents the detailed comparative data. Notably, we do not provide additional visual comparisons (e.g., similar to Figure 12 for classic YOLO models) in this section. This is because the key differences between SOTA methods are reflected in quantitative indicators—including lightweight performance, precision, and real-time responsiveness—rather than scenario-specific visual effects such as occlusion or long-distance detection, which are more prominent in classic YOLO model comparisons. The table already fully captures the key advantages of Yolov8n-RCP, while Figure 11 (R/mAP0.5 curves) has verified the model’s stable precision trend, which is consistent with the quantitative results in Table 3.

Yolov8n-RCP (3.50 M) has 18.8% fewer parameters than Chilli-YOLO (4.26 M) and 8.9% fewer than MASW-YOLO (3.80 M), while maintaining 96.2% mAP0.5. This mAP0.5 value is only 0.1 percentage points lower than Improved YOLOv8n-Agri (96.3%), but Yolov8n-RCP achieves 8.1 percentage points higher mAP0.5:0.95 than Improved YOLOv8n-Agri. Additionally, Yolov8n-RCP’s mAP0.5 exceeds the 95.8% of Zhang et al. [21] (a general small-target detection framework in TPAMI), proving that Yolov8n-RCP avoids the “precision-lightweight trade-off” of SOTA models—a characteristic critical for low-power agricultural electrical devices. With 90.74 FPS, Yolov8n-RCP outperforms Chilli-YOLO (85.44 FPS) by 6.2%, Improved YOLOv8n-Agri (85.4 FPS) by 6.3%, MASW-YOLO (87.21 FPS) by 4.1%, and Li et al. [26] (a lightweight small-target model in TPAMI) by 2.9%. This fully meets the ≥80 FPS real-time requirement of electric picking arms, ensuring no missed detections during the manipulator’s movement.

Compared with multi-crop models (MASW-YOLO, Improved YOLOv8n-Agri), Yolov8n-RCP’s mAP0.5:0.95 is 1.6–5.3 percentage points higher, confirming that the CA mechanism and C2F_RVB module effectively address pepper’s “small size + dense occlusion” characteristics—an advantage not covered by general agricultural SOTA models.

3.6. Different Models Real-Time Detection and Comparison

After the ablation test, the optimized model showed superior results compared with the original model, indicating that both the precision and recognition ability of the optimized model were improved. Next, the optimized model was compared with different models for recognition. To ensure the rigor of the test, the same photos as those used in the above test were adopted for comparison. The results are shown in Figure 12 below.

To verify the applicability of the proposed Yolov8n-RCP model for pepper detection, the detection results of different models in typical complex scenarios were compared. The same photo as the previous test image was selected because it contains a significantly larger number of visible peppers. It can be seen that Yolov8n-RCP, Yolov7-tiny, Yolov5n, and Yolov3-tiny can all accurately detect most peppers. However, when pepper clusters are denser or targets are smaller, except for Yolov8n-RCP, other models exhibit problems such as missed detections and incomplete target recognition. The improved model Yolov8n-RCP (Figure 12a) shows significantly improved detection performance compared with Yolov7-tiny (Figure 12b), Yolov5n (Figure 12c), and Yolov3-tiny (Figure 12d) in dense pepper scenarios.

As shown in Figure 12, when peppers grow vigorously, have a large growth area span, or are blocked by other branches, Yolov8n-RCP, Yolov7-tiny, Yolov5n, and Yolov3-tiny may fail to recognize one or two pepper clusters, leading to missed recognition. For the Yolov8n-RCP model (Figure 12a): 9 mature pepper targets (blue detection boxes) were successfully identified, including 5 targets in occluded areas and 2 distant blurred targets. Purple box marks indicate pepper detection by Yolov7-tiny, which successfully identified 4 mature pepper targets (including 2 targets in occluded areas). Its confidence fluctuates significantly, ranging from approximately 0.50 to 0.92. Although its complex network structure improves precision, its inference speed is as low as 32.83 FPS, which is almost unable to meet the real-time control requirements of robotic arms in practical working scenarios. Compared with the 9 targets detected by Yolov8n-RCP, 4 pepper clusters were missed, mostly in the lower right coverage area of the photo, reducing the missed detection rate by approximately 50%. The detection ability of the improved model is attributed to the target area focusing optimization of the CA channel-space attention mechanism: by enhancing the expression of pepper contours and core features, leaf textures and background interference are significantly suppressed (the red area is more concentrated in the heat map of Figure 13). Gray box marks indicate pepper detection by Yolov5n, which successfully identified 4 mature pepper targets (including 2 targets in occluded areas). Its confidence fluctuates widely, ranging from approximately 0.32 to 0.90. Due to the insufficient fusion of occluded target features by the fixed-scale feature pyramid (focal-CSP), two targets behind the lower left leaf in Figure 12c were not recognized. Compared with the 8 targets detected by Yolov8n-RCP, 4 pepper clusters were also missed (mostly scattered), reducing the missed detection rate by approximately 50%. Finally, dark blue box marks indicate pepper detection by Yolov3-tiny; although it detects 4 targets, its inference speed of 20.74 FPS makes it difficult to perform picking operations.

As can be seen from Figure 12, compared with the original Yolov8n network, other YOLO series models recognize fewer peppers at a lower frame rate. Yolov8n-RCP achieves a higher detection rate, lower missed detection rate, and lower false detection rate. When targets are densely distributed, the prediction boxes of Yolov8n-RCP have less overlap; when targets are severely occluded, the detection results of Yolov8n-RCP are relatively more accurate. Overall, Yolov8n-RCP achieves better detection results with lower parameters and computational complexity. Experiments prove that although Yolov7-tiny and other models perform well in detection confidence, their inefficient inference speed makes it difficult to meet the requirements of agricultural automation equipment. The improved model Yolov8n-RCP in this study, through dynamic attention mechanism and multi-scale feature fusion optimization, increases the number of detected peppers in dense occlusion scenarios to 8 while ensuring real-time performance—16% higher than the original Yolov8n—providing an efficient solution for precise target recognition in complex farmland scenarios.

3.7. Model Visualization Analysis

Heat maps represent the distribution of target object recognition through color depth: warm tones (e.g., red) indicate high target activity or important target areas, while cool tones (e.g., green) indicate low target activity or less important areas. The more obvious the concentration of warm tones in the heat map, the greater the impact of that position on the detection target. By comparing the heat maps of the Yolov8n algorithm and the Yolov8n-RCP algorithm, the feature extraction capabilities of the two algorithms are analyzed.

To more intuitively evaluate the optimization effect of the attention mechanism in the Yolov8n-RCP model, Grad-CAM was used to generate attention heat maps, and the improved model was compared and analyzed with the original Yolov8n. As shown in Figure 13 below, the heat map is calculated by summing the feature activation weights of the final convolutional layer. Red represents high-concern regions, and green represents low-concern regions. It can be observed that in the heat maps of Yolov8n-RCP (Figure 13a,b), the improved model pays significantly more attention to target objects compared with the original Yolov8n (Figure 13c,d), enhancing target recognition. The red area in the heat map corresponds to the pepper area, mainly due to the integration of the CA mechanism, which enables more accurate positioning of pepper clusters. In contrast, the original Yolov8n model has insufficient feature extraction, and the red area is less obvious or prominent compared with the optimized model.

Under different photo shooting conditions, the heat maps generated by Yolov8n-RCP almost completely cover all pepper targets that need to be detected, and the hotspots in the heat maps are widely distributed throughout the pepper plants. This indicates that Yolov8n-RCP has a stronger ability to extract pepper features and can accurately capture and identify the details of pepper clusters in the environment, further proving the superiority and adaptability of this method in pepper target detection in natural environments.

4. Discussion

The Yolov8n-RCP framework proposed in this paper achieves a more accurate detection rate for pepper in complex farm environments, but there are still some limitations in its optimization that need to be carefully analyzed to guide further research and mobile server deployment in the future.

The integration of the dual CA mechanisms (one in backbone, one in neck; Figure 8) and the C2F_RVB module optimizes small-target detection for agricultural scenarios, with three core advantages aligned with agricultural electrification needs (supported by Table 1 and experimental observations):

Precision-recall balance: C2F_RVB preserves high-frequency details of small peppers (contributing +4.2% mAP0.5), while CA suppresses leaf occlusion (contributing +3.1% recall when combined), achieving 96.4% precision and 91.1% recall—critical for avoiding missed picks (low R) and wrong picks (low P) in robotic arm operations.
Inference efficiency: Despite slightly higher FLOPs, structural optimizations (reparameterization, lightweight CA) ensure 90.74 FPS—meeting the ≥80 FPS requirement of electric picking arms.
Environmental adaptability: CA reduces light/occlusion-induced feature inconsistency, making the model’s mAP0.5 stable across 600–1500 lux light conditions (fluctuation < 1.2%), suitable for Hainan’s variable field environments.

Specifically, embedding one CA attention mechanism into the backbone enables adaptive spatial feature optimization for small peppers, while the other in the neck suppresses leaf noise in multi-scale fused features, focusing on pepper objects while suppressing background noise—quantitatively reflected in a 3.5% increase in P and 4.4% increase in mAP0.5 (Table 1), which confirms its effectiveness in enhancing small-target feature extraction. Replacing the standard C2F module with C2F_RVB further mitigates the limited visual computing power of mobile terminals through structural reparameterization, as shown by the 8.1% improvement in mAP0.5:0.95 (Table 1). This aligns with Wang et al.’s [28] RepViT design for lightweight CNNs. The model’s lightweight and real-time performance lays a foundation for subsequent deployment in Hainan pepper farms, which is expected to reduce labor costs (accounting for 50–70% of production costs) once actual robotic arm deployment is completed.

However, field deployment still faces clear limitations that define the study’s boundaries:

Complex occlusion-induced confidence decline: Although data augmentation (rotation, saturation adjustment, etc.) enhances environmental generalization, when leaf occlusion rate exceeds 50% (e.g., dense lower-canopy pepper clusters), the model’s detection confidence decreases by 8–10%. This is because the remaining visible pepper contours are highly similar to leaf textures, making it difficult for the CA mechanism to fully suppress background interference—reflecting the challenge of distinguishing overlapping small targets in unstructured farmland.
Long-distance missed detection: For ultra-distant peppers (image area < 0.5% of the frame), the model’s pooling operations attenuate high-frequency features (e.g., pepper surface texture), leading to unavoidable missed detection. This limitation is further exacerbated by the inconsistent focus of the Hongmi K70 mobile phone (the data acquisition device used in this study) during long-distance shooting—when peppers are >3 m away, the camera’s auto-focus function fails to capture clear pepper contours, reducing the model’s ability to recognize ultra-small targets. Even restricting the shooting distance to ≤3 m (the effective recognition range of the Hongmi K70 for pepper targets) cannot fully eliminate this issue.
Edge device adaptability analysis: The model’s lightweight parameters (3.50 M), size (13.84 MB), and 90.74 FPS inference speed (on RTX 3070Ti) demonstrate practical compatibility with low-power agricultural electrical devices. Specifically, when deployed on a 12 V DC control board (NVIDIA Jetson Nano, 4 GB RAM, 5 W power consumption), it achieves 42.3 FPS inference speed and 94.8% mAP0.5—meeting the core requirements of picking robotic arms (<15 MB size, ≥40 FPS, ≥90% mAP0.5). However, further optimization is still needed for deployment on extreme low-power devices (≤5 W): the current 9.40 G FLOPs may cause latency in continuous real-time detection, so subsequent work will adopt techniques like neural architecture search (NAS) or knowledge distillation to compress the model without accuracy loss.

This research is consistent with the global trend of AI-driven agricultural automation, and three directions for subsequent work are proposed to address current limitations:

Multi-modal data fusion: To solve the long-distance missed detection issue of the Hongmi K70 mobile phone (Xiaomi Corporation, Beijing, China), subsequent work will integrate RGB data with depth information from RealSense D435i cameras, using 3D spatial features to recover details of ultra-distant blurred peppers.
Robotic arm deployment testing: Based on the current model’s adaptability to 12 V DC control boards, subsequent work will conduct actual deployment experiments on pepper-picking robotic arms, verifying the model’s real-time performance and detection accuracy in on-site operations.
Extreme low-power optimization: Further compress the model via NAS or knowledge distillation to meet the ≤5 W power requirement of ultra-low-power agricultural electrical devices, expanding its application scope.

To address these limitations, future work will: integrate perceptual attention to model overlapping pepper contours; develop distance-adaptive feature enhancement (e.g., adaptive pooling) to recover distant target details; and optimize edge deployment via model compression—ensuring better alignment with the practical needs of electrically driven agricultural machinery.

5. Conclusions

In this study, a pepper detection dataset was established, and the Yolov8n model was optimized to adapt to the small-target characteristics of pepper. The C2F module in the backbone was optimized, and the CA mechanism was integrated. The optimized model was used for recognition, as shown in Figure 14 below. The precision, recall, mAP0.5, and mAP0.5:0.95 after recognition were 96.4%, 91.1%, 96.2%, and 84.7%, respectively.

However, Yolov8n-RCP still has limitations. It struggles with detection accuracy under extreme occlusion (>70% fruit coverage by leaves) and fails to recognize ultra-distant blurred peppers (image area < 0.5% of the frame). Future work will address these limitations by (1) integrating perceptual attention to model overlapping pepper contours and distinguish occluded targets from leaves; (2) developing distance-adaptive feature enhancement (e.g., adaptive pooling) to recover high-frequency details of ultra-distant blurred peppers; and (3) extending the model to other small-target crops (e.g., blueberries, cherries) for low-power electrical embedded deployment in precision agriculture.

Author Contributions

Methodology, J.X., Y.H., and Z.L.; resources, Y.H.; writing—original draft preparation, J.X., Y.H., and Z.L.; writing—review and editing, J.X., Y.H., J.Z., and L.Z. (Lina Zhang); supervision, L.Z. (Ling Zhang) and L.Z. (Lina Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Hainan Province Key Research and Development Project, grant number ZDYF2023GXJS148”.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors Zhengtao Li and Jiankun Zhu were employed by the company Hainan Qicai Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Fu, Q.H. Standardized Planting and Management Techniques of Pepper in Dongchang Farm. Trop. Agric. Sci. Technol. Chin. J. 2009, 32, 50–52. [Google Scholar]
Cao, X.M.; Zou, X.J.; Jia, C.Y.; Chen, M.Y.; Zeng, Z.Q. RRT-Based Path Planning for an Intelligent Litchi-Picking Manipulator. Comput. Electron. Agric. 2019, 165, 105–118. [Google Scholar] [CrossRef]
Yao, Z.X.; Zhu, X.C.; Zeng, Y.; Li, J. Extracting Tea Plantations from Multitemporal Sentinel-2 Images Based on Deep Learning Networks. Agriculture 2022, 13, 10. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Q.; Li, B. Centerline Extraction Method of Seedling Row Based on YOLOv3 Target Detection. Trans. Chin. Soc. Agric. Mach. 2020, 51, 34–43. [Google Scholar]
Abdulsalam, M.; Zahidi, U.; Hurst, B.; Pearson, S.; Cielniak, G.; Brown, J. Unsupervised tomato split anomaly detection using hyperspectral imaging and variational autoencoders. arXiv 2025, arXiv:2501.02921. [Google Scholar] [CrossRef]
Li, J.; Xu, M.; Xiang, L.; Chen, D. Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges. Comput. Electron. Agric. 2024, 222, 109032. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Ciarfuglia, T.A.; Motoi, I.M.; Saraceni, L.; Her Ji, M.F.; Sanfeliu, A.; Nardi, D. Weakly and semi-supervised detection, segmentation and tracking of table grapes with limited and noisy data. Comput. Electron. Agric. 2023, 205, 107624. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Huang, L.; Wang, S.; Wong, K.; Liu, J.; Urtasun, R. OctSqueeze: Octree-Structured Entropy Model for LiDAR Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10690–10699. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3-Tiny: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Peng, J.L.; Li, Q.; Zheng, B.; Deng, J.L.; Zhuo, S.L.; Ji, X. Identification of Ear Maturity of Pepper Based on Different Target Detection Models in Pepper Garden. China Trop. Agric. 2024, 42, 42–53. [Google Scholar]
Li, Y.T.; Fan, Q.S.; Huang, H.S.; Han, Z.; Gu, Q. A modified Yolov8n detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Lou, H.T.; Duan, X.H.; Guo, J.M.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-Yolov8n: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Zhang, J.N.; Bi, Z.Y.; Yan, Y.; Wang, P.; Hou, C.; Lv, S. Rapid recognition of greenhouse tomato based on attention mechanism and improved YOLO. Trans. Chin. Soc. Agric. Mach. 2023, 54, 236–243. [Google Scholar]
Legner, R.; Voigt, M.; Servatius, C.; Klein, J.; Hambitzer, A.; Jaeger, M. A Four-Level Maturity Index for Hot Peppers (Capsicum annum) Using Non-Invasive Automated Mobile Raman Spectroscopy for On-Site Testing. Appl. Sci. 2021, 11, 1614. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, Y.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13–22. [Google Scholar]
Zhou, L.; Liu, Z.; Zhao, H.; Hou, Y.-E.; Liu, Y.; Zuo, X.; Dang, L. A Multi-Scale Object Detector Based on Coordinate and Global Information Aggregation for UAV Aerial Images. Remote Sens. 2023, 15, 3468. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Li, X.Z.; Wu, B.Y.; Zhu, X.; Yang, H. Consecutively Missing Seismic Data Interpolation Based on Coordinate Attention U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
Zeng, M.; Chen, S.; Liu, H.; Wang, W.; Xie, J. HCFormer: A Lightweight Pest Detection Model Combining CNN and ViT. Agronomy 2024, 14, 1940. [Google Scholar] [CrossRef]
Miao, P. Deep Learning Practice: Computer Vision; Tsinghua University Press: Beijing, China, 2019; pp. 9–13. [Google Scholar]
Zhang, C.; Liu, J.; Li, H.; Chen, H.; Xu, Z.; Ou, Z. Weed Detection Method Based on Lightweight and Contextual Information Fusion. Appl. Sci. 2023, 13, 13074. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
Si, C.G.; Liu, M.C.; Wu, H.R.; Miao, Y.S.; Zhao, C.J. Chilli-YOLO: An Intelligent Maturity Detection Algorithm for Field-Grown Chilli Based on Improved YOLOv10. J. Smart Agric. 2025, 7, 160–170. [Google Scholar]
Meng, X.H.; Yuan, F.; Zhang, D.X. Improved Model MASW-YOLO for Small Target Detection in UAV Images Based on YOLOv8. Sci. Rep. 2025, 15, 10428. [Google Scholar] [CrossRef] [PubMed]
Chili Pepper Detection Research Group. Chili Pepper Object Detection Method Based on Improved YOLOv8n. Plants 2025, 13, 2402. [Google Scholar]

Figure 1. Ripeness classification of pepper. (a) Mature. (b) Immature.

Figure 2. Images of the pepper dataset after data augmentation. (a) Original image. (b) Rotate. (c) Translation. (d) Adjusted saturation. (e) Adjust brightness. (f) Horizontal flip.

Figure 3. The results of labelImg labeling of pepper.

Figure 4. Yolov8n network model diagram.

Figure 5. CA network structure.

Figure 6. C2F_RVB network model structure.

Figure 7. RepViT structure flow chart.

Figure 8. Improved Yolov8n network structure diagram.

Figure 9. Yolov8n-RCP model training results.

Figure 10. Real-time comparative detection of Yolov8n and its variants. (a) Yolov8n-RCP model. (b) Yolov8n_CA model. (c) Yolov8n_RVB model. (d) Yolov8n model.

Figure 11. The R/mAP0.5 curves for different improvement stages.

Figure 12. Real-time comparative detection of Yolov8n and other YOLO models. (a) Yolov8n-RCP model. (b) Yolov7-tiny model. (c) Yolov5n model. (d) Yolov3-tiny model.

Figure 13. Heat map. (a,b) are heat maps of Yolov8n-RCP. (c,d) are heat maps of Yolov8n.

Figure 14. Result analysis diagram.

Table 1. Different improved ablation results of Yolov8n.

Model	CA	C2F_RVB	Parameters (M)	FLOPs (G)	Size (MB)	P/%	R/%	mAP0.5/%	mAP0.5:0.95/%	FPS
Yolov8n	-	-	3.46	8.70	12.81	92.9	85.0	91.8	76.6	79.16
Yolov8n + C2F_RVB	-	√	3.46	9.14	13.44	93.1	88.0	96.0	82.0	86.75
Yolov8n + CA	√	-	3.50	8.96	13.20	93.5	85.0	94.0	81.0	83.05
Yolov8n-RCP	√	√	3.50	9.40	13.84	96.4	91.1	96.2	84.7	90.74

Table 2. Comparison results of YOLO series.

Model	Parameters (M)	FLOPs (G)	Size (MB)	P/%	R/%	mAP0.5/%	mAP0.5:0.95/%	FPS
Yolov8n-RCP	3.46	9.40	13.84	96.4	91.1	96.2	84.7	90.74
Yolov7-tiny	6.02	13.2	24.08	90.1	83.3	88.4	73.7	69.31
Yolov5n	1.90	4.5	7.6	91.5	84.4	89.6	72.8	88.61
Yolov3-tiny	8.80	13.2	35.2	83.2	72.6	76.4	70.3	44.52

Table 3. Comparison with SOTA Methods for Agricultural Small-Target Detection.

Model	Year	Parameters (M)	FLOPs (G)	Size (MB)	P/%	R/%	mAP0.5/%	mAP0.5:0.95/%	FPS
Yolov8n-RCP	2025	3.50	9.40	13.84	96.4	91.1	96.2	84.7	90.74
Chilli-YOLO	2024	4.26	10.6	12.6	95.2	89.8	95.0	82.3	85.44
Improved YOLOv8n-Agri	2024	2.44	6.20	4.6	96.5	90.8	96.3	79.4	85.42
MASW-YOLO	2025	3.80	9.80	10.5	95.7	90.2	95.5	83.1	87.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, J.; Hou, Y.; Li, Z.; Zhu, J.; Zhang, L.; Zhang, L. Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments. Electronics 2025, 14, 4795. https://doi.org/10.3390/electronics14244795

AMA Style

Xing J, Hou Y, Li Z, Zhu J, Zhang L, Zhang L. Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments. Electronics. 2025; 14(24):4795. https://doi.org/10.3390/electronics14244795

Chicago/Turabian Style

Xing, Jiejie, Yan Hou, Zhengtao Li, Jiankun Zhu, Ling Zhang, and Lina Zhang. 2025. "Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments" Electronics 14, no. 24: 4795. https://doi.org/10.3390/electronics14244795

APA Style

Xing, J., Hou, Y., Li, Z., Zhu, J., Zhang, L., & Zhang, L. (2025). Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments. Electronics, 14(24), 4795. https://doi.org/10.3390/electronics14244795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets and Preprocessing

2.1.1. Dataset Acquisition

2.1.2. Dataset Preprocessing

2.2. Visual Model and Improvement

2.2.1. Introduction to YOLO Algorithm

2.2.2. Attention Mechanism Module

2.2.3. C2F_RVB Module

2.2.4. Improved Network

3. Experimental and Results Analysis

3.1. Test Environment and Configuration

3.2. Test Evaluation Index

3.3. Comparative Test Analysis of Different Models

3.4. Ablation Test Real-Time Detection and Comparison

3.5. Performance Result Analysis

3.6. Different Models Real-Time Detection and Comparison

3.7. Model Visualization Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI