SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage

Han, Jiangyi; Lyu, Deyuan; Xia, Changgao

doi:10.3390/app16073510

Open AccessArticle

SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage

by

Jiangyi Han

^1,*

,

Deyuan Lyu

^1,2,* and

Changgao Xia

¹

School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China

²

Faculty of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3510; https://doi.org/10.3390/app16073510

Submission received: 26 February 2026 / Revised: 31 March 2026 / Accepted: 1 April 2026 / Published: 3 April 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

High-precision identification and three-dimensional (3D) positioning of cabbage plants across their entire growth cycle are fundamental prerequisites for automated agricultural management. To overcome field challenges like extreme morphological variations, severe leaf occlusion, and bounding box jitter, we introduce a camera-LiDAR fusion perception system. First, an advanced SCEW-YOLOv8 architecture is proposed, sequentially integrating SPD-Conv downsampling, a C2f-CX global feature enhancement module, an EMA cross-space attention mechanism, and the WIoU v3 loss function. Evaluated on a comprehensive whole-growth-cycle cabbage dataset, the model achieves 95.8% mAP@0.5 and 90.8% recall with a real-time inference speed of 64.2 FPS. Furthermore, a visual semantic-driven camera-LiDAR fusion ranging algorithm is developed. Through rigorous spatiotemporal synchronization and cascaded outlier filtering, the integrated system achieves millimeter-level 3D localization within the typical 1.0–2.0 m operating range of agricultural robots. It maintains a Mean Absolute Error (MAE) of only 1.45 mm in the longitudinal direction at a stable processing throughput of 20 FPS. Compared to traditional pure vision depth estimation, this heterogeneous fusion approach achieves a remarkable 96.3% reduction in spatial positioning error at extended distances, fundamentally eliminating depth degradation caused by complex illumination. Ultimately, this system provides a highly robust, full-cycle geometric perception framework for the autonomous management of open-field green cabbage.

Keywords:

cabbage; LiDAR; object detection; sensor fusion; three-dimensional positioning; YOLOv8

1. Introduction

Cabbage is a widely cultivated leafy economic crop worldwide. Its production mode is rapidly transforming from traditional labor-intensive management to automated operations. The global agricultural sector faces severe labor shortages and rising production costs. Consequently, automated management throughout the cabbage growth cycle has become essential. This automation includes precision pesticide application, variable rate fertilization, growth monitoring, and automatic harvesting. Implementing these technologies is a crucial strategy to enhance industry competitiveness, ensure stable supplies, and promote sustainable agriculture [1,2]. A robust perception system, capable of real-time, high-precision recognition and three-dimensional (3D) spatial positioning of field cabbage plants, serves as the fundamental decision-making basis for all automated equipment. Its accuracy and stability directly impact pesticide use, harvest damage, and overall efficiency [3,4].

Currently, machine vision-based target identification is the mainstream solution for field crop perception. Among these, the single-stage localization algorithms represented by the YOLO (You Only Look Once) series are highly prominent. They offer an excellent balance between classification accuracy and inference speed. Consequently, they are widely utilized in agricultural robotic sensing tasks, proving effective in scenarios such as orchard fruit segmentation, weed-crop differentiation, and dynamic growth monitoring [5,6,7,8,9]. However, directly deploying general vision architectures for whole-cycle cabbage management in complex fields faces several technical bottlenecks. Cabbage exhibits extreme morphological heterogeneity, transforming drastically from scattered seedling leaves to an overlapping canopy at the heading stage. Meanwhile, mutual leaf occlusion between plants frequently leads to omission errors and feature extraction failures. Second, natural field environments introduce drastic illumination fluctuations, cluttered backgrounds, and interference from soil and weeds. These factors introduce substantial noise during feature extraction and severely reduce perception stability. Third, the limited computing power of mobile field robots imposes stringent requirements on the algorithm’s lightweight design and real-time performance [10,11]. Most existing studies focus on the identification of cabbage at a single mature growth stage. There remains a significant lack of research on universal and robust perception algorithms covering the entire growth cycle [1,12,13,14,15,16]. More critically, standard baseline models are highly prone to bounding box jitter during continuous frame analysis. The traditional scheme maps 3D spatial coordinates based on the center point of the bounding box. This directly transmits the instability of the 2D recognition to the 3D positioning results, making it difficult to meet the stringent millimeter-level accuracy required for automated precision operations [17].

In terms of crop 3D spatial information acquisition, current mainstream solutions are dominated by passive optical perception technologies, such as binocular vision and structured light depth cameras. However, such solutions possess inherent technical defects in outdoor farmland scenarios. Their depth calculation accuracy heavily depends on the stability of ambient illumination and the surface texture characteristics of the target. Field environments frequently present complex illumination, including strong direct sunlight, backlighting, and shadows. Furthermore, cabbage leaves possess reflective and low-texture surfaces. Under these challenging conditions, passive sensors frequently experience missing depth data and value jumps [18,19,20]. Meanwhile, passive ranging schemes based on the triangulation principle exhibit errors that increase significantly with the observation distance. Within the typical 1.0~2.0 m operating range of agricultural robots, this positioning error easily accumulates to the centimeter level. Consequently, it fails to meet the strict accuracy requirements for precision pesticide application and automatic harvesting [21]. In contrast, solid-state Light Detection and Ranging (LiDAR) realizes direct physical ranging based on the time-of-flight method. It possesses primary advantages such as strong resistance to ambient light interference, high ranging accuracy, and consistent precision regardless of the measurement distance [22]. Therefore, deeply fusing the semantic recognition ability of machine vision with the high-precision geometric perception of LiDAR has become one of the most effective paths to break through the bottlenecks of crop perception in complex farmlands [23,24,25,26].

To address these challenges, we developed a camera-LiDAR fusion system based on an improved YOLOv8 model for the real-time 3D positioning of cabbage in complex fields. The theoretical original contributions of this paper are summarized as follows:

Theoretical optimization of the recognition architecture: An improved SCEW-YOLOv8 object perception model is designed to overcome the strong morphological heterogeneity and severe leaf occlusion of cabbage. Specifically, it introduces the SPD-Conv space-to-depth mapping to prevent fine-grained feature loss. It also incorporates the C2f-CX module to enhance global contextual modeling via large-kernel convolution. Furthermore, it embeds the EMA cross-space attention mechanism to adaptively suppress background noise, and utilizes the WIoU v3 dynamic non-monotonic loss function to optimize bounding box regression.
Theoretical framework for cross-modal 3D perception: A loosely coupled visual semantic-driven 3D fusion ranging algorithm is constructed. It fundamentally resolves the asynchronous spatial-temporal mapping between 2D visual semantics and 3D LiDAR point clouds. Through a two-stage cascaded filtering strategy based on directional spatial constraints and statistical outlier elimination, it establishes a rigorous mathematical framework for millimeter-level positioning. This effectively circumvents the depth degradation problem of traditional stereo-matching vision schemes under extreme field illuminations.
Empirical validation and performance breakthrough: Extensive experiments systematically validate the presented theories. The SCEW-YOLOv8 model achieves a remarkable 95.8% mAP@0.5 and 90.8% recall. Furthermore, the fusion positioning system achieves a 1.45 mm mean absolute error (MAE) in the height direction. This represents a 96.3% reduction in spatial positioning error at extended operating distances compared to mainstream pure vision baselines, ultimately providing highly reliable technical support for the automated management of locally dominant green cabbage cultivars.

2. Materials and Methods

2.1. Construction and Preprocessing of Whole-Growth-Cycle Cabbage Image Dataset

In this study, a specialized image dataset covering multiple scenarios and growth stages was constructed for the whole-growth-cycle cabbage recognition task. This provides standardized data support for model training and performance verification. The dataset construction and preprocessing process were divided into three primary stages: field image acquisition, data augmentation, and dataset division.

2.1.1. Field Image Acquisition and Scene Coverage

Test images were collected from a standardized cabbage planting base in Jiangyan District, Taizhou City, Jiangsu Province, China.

The geographical overview of the experimental site and a representative field scenario of heading-stage cabbage under challenging strong direct sunlight are illustrated in Figure 1. The primary cultivar of head cabbage used in the experiment was the dominant local variety. The acquisition period encompassed the entire growth process following transplanting—from the seedling and rosette stages to the heading stage—spanning approximately 60 days. To simulate the real observation conditions of agricultural robots in field operations, images were captured using a mobile terminal device from a typical operating height of 0.8–2.0 m and pitch angles of 0–45°. This setup strictly represented conventional field operation perspectives. Furthermore, the acquisition process included diverse lighting environments, such as strong sunlight on clear days, diffuse light on cloudy days, and low light in the evening. The basic resolution of all original images was uniformly 3000 × 4000 pixels, ensuring the complete capture of plant morphology and environmental details.

Image acquisition commenced after seedling transplanting. Systematic fixed-point field shooting was conducted every 15 to 20 days, aligning with the cabbage growth rate to document the complete developmental cycle. Ultimately, 2500 original field images were collected and categorized by growth stage: 750 at the seedling stage, 850 at the rosette stage, and 900 at the heading stage. The dataset features various scenarios, including single-plant close-ups, multi-plant groupings, and panoramic views of different planting densities.

2.1.2. Data Augmentation and Dataset Division

To improve the scene diversity of the dataset, enhance the generalization ability of the model, and prevent overfitting during training, this study conducted data preprocessing following a three-step procedure: data supplementation, augmentation optimization, and annotation and division. Field-collected data often lacks extreme scenarios and varied cabbage varieties. To address this deficiency, we selected 600 high-quality images from public agricultural datasets and published research. These images effectively supplemented extreme climate conditions and non-standard perspectives. Combined with the 2500 original field images, this formed a comprehensive basic dataset of 3100 images.

Building upon this foundation, various computer vision techniques were utilized to simulate complex field observation conditions. Specifically, 11 distinct data augmentation techniques were applied. These included photonic and sensor noise simulations (salt-and-pepper, Poisson, and Laplacian noise), motion blur, illumination variations (overexposure and high contrast), and adverse weather simulations (cloudy, rainy, foggy, and snowy conditions). Additionally, CoarseDropout was employed to simulate severe leaf-level occlusions. Visual examples of these applied augmentation techniques are illustrated in Figure 2.

After augmentation, the dataset was expanded to 9300 images. This includes 2800 at the seedling stage, 3100 at the rosette stage, and 3400 at the heading stage. These images comprehensively represent the morphological changes and scene characteristics of cabbage throughout its entire growth cycle.

All augmented images were manually and precisely annotated in YOLO format using the LabelImg tool. Aligned with the production demands of phased management, classification rules were formulated based on the degree of heading. Immature plants in the seedling and rosette stages were uniformly labeled as the unripe category. Conversely, mature plants in the heading stage were labeled as the ripe category, as defined in Figure 3B.

Furthermore, to address the high missed detection rate caused by plant overlapping, special attention was given to precise bounding box annotations. This standard was strictly maintained even under challenging conditions of severe leaf occlusion in natural fields, as illustrated in Figure 3A. This rigorous annotation strategy ensures the model’s robust perception and identification of cabbage across different developmental stages.

Finally, the dataset was randomly divided into training, validation, and test sets at a ratio of 8:1:1. Specifically, 7440 images were allocated to the training set for the iterative updating of model parameters. The validation set comprised 930 images, utilized for hyperparameter tuning and overfitting monitoring during training. The remaining 930 images constituted the test set. This independent, unseen data was reserved exclusively for the quantitative evaluation of final generalization performance and engineering practicability.

2.2. Improved SCEW-YOLOv8 Model for Cabbage Perception

The YOLO (You Only Look Once) series represents the most widely used one-stage object detection algorithms in current agricultural real-time perception tasks. Specifically, the lightweight YOLOv8n version achieves an excellent balance between recognition accuracy and inference speed through its decoupled head design and anchor-free paradigm [5]. This architectural efficiency makes it highly suitable for the limited computing power of agricultural mobile robots. In this study, YOLOv8n was selected as the baseline model. Heavier variants (e.g., YOLOv8s/m/L) were excluded because they drastically increase parameter count with only marginal accuracy gains, which would severely compromise the real-time constraints of agricultural edge devices. Consequently, an improved model, SCEW-YOLOv8, was developed to address the technical bottlenecks of whole-growth-cycle cabbage identification in complex field environments.

The architecture was optimized across four key dimensions: downsampling structure, feature extraction, attention mechanism, and loss function. These targeted modifications systematically enhance the model’s perception robustness and bounding box positioning stability. Ultimately, they provide a highly stable two-dimensional semantic benchmark for subsequent three-dimensional positioning. The overall network topology of SCEW-YOLOv8 is illustrated in Figure 4.

2.2.1. Optimization of Space-to-Depth Downsampling Based on SPD-Conv

The downsampling process in convolutional neural networks is a critical step for feature map dimensionality reduction and high-level semantic extraction. Traditional YOLO series models rely heavily on strided convolutions (typically with a stride of 2). However, this operation inherently discards approximately 75% of the spatial pixel information at each downsampling stage. According to digital signal sampling theory [27], while conventional dimensionality reduction is efficient for general objects, it struggles to preserve high-frequency spatial details. It can also introduce aliasing effects when processing extremely small or occluded agricultural targets. For small-sized seedlings or plants heavily obscured by overlapping mature leaves, their fine-grained visual features are highly fragile. Traditional dimensionality reduction easily washes out these weak features in deeper layers, resulting in severe missed detections, especially for distant targets in the field.

To fundamentally resolve this bottleneck, the Space-to-Depth Convolution (SPD-Conv) module was chosen to replace the original strided convolutions across all downsampling stages. The primary rationale for this choice is that SPD-Conv achieves lossless downsampling. Specifically, it avoids pixel discarding entirely by reorganizing spatial pixel information into the channel dimension [28]. This space-to-depth operation completely retains the fine-grained spatial features of the target within the feature channels. Consequently, it enables the model to capture richer contextual details without losing crucial geometric information. The internal architecture of this module is illustrated in Figure 5.

Specifically, given an intermediate feature map

X \in R^{H \times W \times C}

and a scale factor

s (e . g ., s = 2)

, the space-to-depth operation slices

X

into

s^{2}

sub-feature maps. The slicing process can be mathematically formulated as follows:

f_{i, j} = X [i : : s, j : : s, :], for i, j \in {0, 1, \dots, s - 1}

(1)

where

i

and

j

denote the spatial offset indices. These extracted sub-feature maps are then concatenated along the channel dimension to yield the rearranged intermediate feature map

X_{S P D} \in R^{\frac{H}{s} \times \frac{W}{s} \times s^{2} C}

:

X_{S P D} = Concat (f_{0, 0}, f_{1, 0}, \dots, f_{s - 1, s - 1})

(2)

Following the space-to-depth mapping, a non-strided convolution layer (typically with a kernel size of 1 × 1 or 3 × 3 and a stride of 1) is applied. This layer learns the fused features and adjusts the channel dimension to the target output size, thereby preserving maximum discriminative information. The detailed forward pass of this module is summarized in Algorithm 1.

Algorithm 1: Forward pass of the SPD-Conv Module

Input: Intermediate feature map

X \in R^{H \times W \times C}

, scale factor

s (default s = 2)

.
Output: Downsampled feature map

Y \in R^{\frac{H}{s} \times \frac{W}{s} \times C_{o u t}}

.
1: Initialize an empty list:

s u b_f e a t u r e s = []

2: for

i = 0

to

s - 1

do
3: for

j = 0

to

s - 1

do
4:

⊳

Extract sub-feature maps by strided slicing
5:

f_{i, j} = X [i : : s, j : : s, :]

6: Append

f_{i, j} to s u b_f e a t u r e s

7: end for
8: end for
9:

⊳

Concatenate all sub- feature maps along the channel dimension
10:

X_{S P D} = Concatenate (s u b_f e a t u r e s, \dim = channel)

11:

⊳

Apply non-strided convolution to fuse features and adjust channels
12:

Y = Conv 2 D (X_{S P D}, stride = 1, out_channels = C_{o u t})

13: return

Y

2.2.2. Global Feature Extraction Enhancement Based on C2f-CX Module

Following the lossless retention of fine-grained features, the network requires enhanced modeling capabilities for the global contours of cabbage targets. This enhancement is essential for differentiating targets within complex field backgrounds. The original C2f module in YOLOv8n possesses a limited receptive field. Consequently, it struggles to capture the broad geometric contours of the plants. To overcome this limitation, we designed a global feature enhancement module, termed C2f-CX, within the backbone network. This architecture significantly optimizes the original C2f module by integrating the large-kernel convolution concept from ConvNeXt V2 [29]. The detailed structure of this module is illustrated in Figure 6.

Traditional YOLO series models primarily utilize standard 3 × 3 convolutional layers for feature stacking. While this approach maintains computational efficiency, the restricted receptive field hinders the model from distinguishing the global spherical morphology of cabbage targets from complex field backgrounds. Consequently, the network becomes highly vulnerable to interference from shadows and leaf occlusion. To address this limitation, the C2f-CX module introduces a large-kernel convolutional structure within the bottleneck layer. This strategically expands the effective receptive field, thereby strengthening the network’s global modeling capacity for the macroscopic contours of the cabbage heads.

Furthermore, the module integrates the Global Response Normalization (GRN) mechanism [29]. By calculating the global feature responses across channels, this mechanism effectively simulates the competition and inhibition relationships among different features. Specifically, for an input feature map

X \in R^{H \times W \times C}

, GRN first aggregates the spatial features into a channel-wise response vector

g \in R^{C}

using the L2-norm. The response value

g_{c}

for the c-th channel

X_{c}

is calculated as:

g_{c} = ‖ X_{c} ‖_{2} = \sqrt{\sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j, c}^{2}}

(3)

Subsequently, response normalization is performed relative to the average response of all channels, and the original features are calibrated to enhance effective responses and suppress redundant background noise:

X_{c}^{'} = γ_{c} X_{c} (\frac{g_{c}}{\frac{1}{C} \sum_{k = 1}^{C} g_{k}}) + β_{c} + X_{c}

(4)

where

γ_{c}

and

β_{c}

are learnable affine parameters, and the

+ X_{c}

term forms a residual connection. The sequential data flow of the customized ConvNeXt V2 bottleneck unit inside the C2f-CX module is explicitly summarized in Algorithm 2.

Algorithm 2: Forward pass of the ConvNeXt V2 Bottleneck inside C2f-CX

Input: Feature map

X \in R^{H \times W \times C} .

Output: Enhanced feature map

X_{o u t} \in R^{H \times W \times C} .

1:

⊳

Large-kernel depthwise convolution for global geometry modeling
2:

Y = DepthwiseConv 2 D (X, {kernel}_{size} = 7 \times 7)

3:

Y = LayerNorm (Y)

4:

⊳

Pointwise convolution to expand channel dimension
5:

Y = PointwiseConv 2 D (Y, {expansion}_{ratio} = 4)

6:

Y = GELU (Y)

7:

⊳

Apply Global Response Normalization (GRN)
8:

Y = GRN (Y)

9:

⊳

Residual connection to maintain gradient stability
10:

Y = PointwiseConv 2 D (Y)

11:

⊳

Residual connection to maintain gradient stability
12:

X_{o u t} = Y + X

13: return

X_{o u t}

Furthermore, from the perspective of linear systems and statistical pattern recognition [30], the large-kernel convolutions within the C2f-CX module function as expanded spatial filters. These filters actively extract the essential morphological features of cabbage targets from complex field backgrounds. Consequently, they establish a robust structural foundation for improving target recognition accuracy under varying lighting and occlusion conditions.

It is theoretically crucial to discuss the synergy between the global contour modeling of C2f-CX and the fine-grained feature preservation of the aforementioned SPD-Conv. While their objectives—macro receptive fields versus local details—might superficially appear contradictory, they actually operate collaboratively within a sequential pipeline. SPD-Conv acts as a spatial fidelity guarantor during downsampling. It ensures that the fragile sub-pixel features of small seedlings are not prematurely discarded, thereby providing high-quality inputs for subsequent layers.

To mitigate the potential blurring of small objects and the subsequent degradation of localization accuracy caused by large-kernel convolutions, the C2f-CX module leverages the residual connections (identity shortcuts) inherent in the ConvNeXt V2 bottleneck (as defined in Algorithm 2, Line 10). These shortcuts establish a direct bypass for fine-grained local features, allowing them to evade the blurring effects of the large-kernel depthwise convolution. Consequently, the network dynamically routes the macro characteristics of mature cabbages through the large-kernel paths, while simultaneously preserving the precise localization details of seedlings via the residual shortcuts. This ultimately achieves a harmonious multi-scale perception.

2.2.3. Target Feature Focusing Optimization Based on EMA Cross-Space Attention

Following the backbone feature extraction, the network must actively suppress environmental noise, such as weeds and soil, to accurately focus on the effective features of the cabbage targets. Traditional attention mechanisms struggle to fully utilize cross-dimensional interaction information. They also lack sufficient target-locking capabilities in complex backgrounds. To resolve this, we embedded an Efficient Multi-Scale Attention (EMA) module within the feature fusion network. This module significantly improves the model’s focus on cabbage targets through cross-space feature interaction [31]. The topological structure of this module is illustrated in Figure 7.

Traditional attention mechanisms, such as SE and CBAM, primarily focus on channel dimension weight allocation or single-scale spatial feature extraction. Consequently, they struggle to differentiate the target from the background amidst interference from weeds and crop residues that share similar textures with cabbage leaves. In contrast, the EMA module captures spatial information across different scales through parallel sub-networks. It then seamlessly integrates channel features and spatial position information utilizing a cross-space learning mechanism.

Mathematically, the input feature map

X \in R^{H \times W \times C}

is first divided into

G

sub-groups along the channel dimension. For each sub-feature group

X_{g} \in R^{(C / G) \times H \times W}

the 1D parallel branch performs global average pooling along the horizontal (

W

) and vertical (

H

) directions to encode precise positional information. The spatial descriptor for the c-th channel at height

h

and width

w

can be formulated as:

z_{h}^{c} (h) = \frac{1}{W} \sum_{i = 1}^{W} X_{c} (h, i), z_{w}^{c} (w) = \frac{1}{H} \sum_{j = 1}^{H} X_{c} (j, w)

(5)

Simultaneously, the 3D parallel branch utilizes a

3 \times 3

convolution to capture local multi-scale spatial contexts. To achieve cross-spatial learning without dimensionality reduction, EMA aggregates the outputs of the two parallel branches (denoted as

Z_{1 D}

and

Z_{3 D}

via a matrix dot product interaction. The final attention-weighted output

Y_{g}

for the group is calibrated using a Sigmoid activation function:

Y_{g} = X_{g} \otimes Sigmoid (Z_{1 D} \cdot Z_{3 D}^{T})

(6)

where

\otimes

denotes element-wise multiplication. By concatenating the outputs of all

G

groups, the module retains contextual details to the maximum extent while effectively suppressing background noise. The layer-by-layer forward propagation of this attention mechanism is detailed in Algorithm 3.

Algorithm 3: Forward pass of Efficient Multi-Scale Attention (EMA)

Input: Feature map

X \in R^{C \times H \times W}

, number of groups

G .

Output: Attention-calibrated feature map

Y \in R^{C \times H \times W} .

.
1:

⊳

Split input into

G

groups along the channel dimension
2:

X_g r o u p s = S p l i t (X, n u m_s p l i t s = G, d i m = 0)

3:

Initialize an empty list : o u t_{g} r o u p s = []

4: for each

X_{g} in X_{-} g r o u p s

do
5:

⊳

1D Branch: Encode spatial features along H and W
6:

p o o l_h = A v g P o o l 1 D_H (X_{g})

7:

p o o l_w = A v g P o o l 1 D_W (X_{g})

8:

Z_{1 D} = Concat (p o o l_h, p o o l_w)

9:

⊳

3D Branch: Capture local context
10:

Z_{3 D} = C o n v 2 D (X_{g}, kernel \_size = 3 \times 3)

11:

⊳

Cross-spatial interaction via matrix dot product
12:

a t t e n t i o n_w e i g h t s = Sigmoid (Z_{1 D} \cdot Z_{3 D}^{T})

13:

⊳

Re-weight the original grouped features
14:

Y_{g} = X_{g} \otimes a t t e n t i o n_w e i g h t s

15:

Append Y_{g} to o u t_g r o u p s

16: end for
17:

⊳

Concatenate all groups back to original shape
18:

Y = C o n c a t e n a t e (o u t_g r o u p s, d i m = 0)

19: return

Y

2.2.4. Bounding Box Regression Loss Optimization Based on WIoU v3

The optimization of the network structure and feature extraction capabilities must be paired with an appropriate loss function. This combination effectively guides the model toward accurate convergence and significantly improves the positioning stability of the bounding boxes in cabbage perception tasks. Cabbage targets often present irregular morphology and severe field occlusion. This leads to insufficient bounding box regression accuracy and poor learning on hard samples. To overcome this, we introduce the Wise-IoU v3 (WIoU v3) loss function in the prediction layer [32,33]. Its dynamic non-monotonic focusing mechanism optimizes regression accuracy, providing a stable 2D geometric benchmark for subsequent 3D mapping.

Unlike traditional monotonic loss functions such as Complete Intersection over Union (CIoU), WIoU v3 incorporates a dispersion-based gradient gain adjustment factor into the standard distance loss and penalty terms. Its overall loss function is defined in Equation (7):

L_{W I o U v 3} = r \cdot R_{W I o U} \cdot L_{I o U}

(7)

where

R_{W I o U}

is the penalty term constrained by the center point distance, and

L_{I o U}

is the standard IoU loss. The central component of the formula lies in the gradient gain adjustment factor

r

, whose calculation formula is shown in Equation (8):

r = \frac{β}{δ α^{β - δ}}

(8)

The superiority of this algorithm lies in its construction of a non-monotonic gain curve. This curve achieves dynamic perception of sample quality and adaptive allocation of gradient weights. During the actual training process, the dispersion

β

measures the regression difficulty of the prediction box in real time. For occluded targets and low-quality labeled samples in complex field environments, a large dispersion triggers an automatic reduction in the adjustment factor via the non-monotonic characteristic. This significantly weakens the interference of harmful gradients generated by noisy samples. Conversely, when the sample dispersion falls within a moderate range, the system assigns a larger gradient gain. This mechanism allows the model to precisely focus its learning resources on ordinary samples with high learning potential. This adaptive gradient allocation mechanism accelerates the model’s convergence to the central region of the cabbage. Furthermore, it optimizes the regression robustness of the bounding boxes when facing irregular plant morphology. Ultimately, this lays a solid geometric benchmark for subsequent millimeter-level spatial mapping.

2.3. Design of Camera-LiDAR Fusion Perception System

To achieve high-precision three-dimensional spatial positioning of cabbage plants in the field, a multi-sensor fusion perception system based on a monocular industrial camera and a solid-state LiDAR was constructed. The system design encompasses three primary components: hardware integration, spatiotemporal synchronization and calibration of multiple sensors, and the visual semantic-driven fusion ranging algorithm.

2.3.1. Sensor Hardware Selection and System Integration

The image acquisition unit utilizes a HIKROBOT MV-CS020-10UC monocular industrial camera (HIKROBOT, Hangzhou, China). This camera features a high dynamic range and strong anti-light interference characteristics, enabling it to output high-quality, low-noise original images even in complex field lighting. Additionally, its maximum sampling frame rate of 90 fps ensures data continuity during mobile operations, providing essential hardware support for real-time perception.

The depth perception unit employs a Livox Mid-40 solid-state Light Detection and Ranging (LiDAR) sensor (Livox, Shenzhen, China). This device utilizes non-repetitive scanning technology. At a 0.1 s integration time, its point cloud density is equivalent to a traditional 32-line mechanical LiDAR. Extending this integration time achieves nearly 100% field-of-view coverage, effectively avoiding missed detections in dense planting environments. The maximum effective measurement distance of the device is 260 m. It maintains high ranging accuracy, and the error does not increase significantly with the observation distance. This provides highly reliable geometric data support for the 3D coordinate calculation of the cabbage targets. The key parameters of these sensors are detailed in Table 1.

At the hardware integration level, a high-strength stainless steel bracket featuring a two-degree-of-freedom independent adjustment mechanism was designed, as illustrated in Figure 8. This bracket adopts a vertical stacking layout alongside a lateral curved slide rail design. This configuration supports the independent, decoupled adjustment of the pitch angles for both the camera and the LiDAR. This specific design not only ensures a high degree of overlap between the camera’s field of view and the LiDAR point cloud within the target operation area, but it also allows for the dynamic adjustment of the observation angle based on the height of the field plants. Consequently, it effectively mitigates interference from ground weeds and physical occlusion. Finally, the bracket is rigidly connected to the agricultural robot’s operational platform to ensure the strict stability of the sensors’ relative spatial poses throughout bumpy field operations.

2.3.2. Multi-Sensor Time Synchronization Scheme Based on ROS2

We implemented a high-precision time synchronization and spatial calibration scheme to align camera images with LiDAR point clouds. This approach prevents spatiotemporal misalignment and latency caused by the mobile operation of agricultural robots. Direct fusion without synchronization causes severe inter-frame motion errors due to hardware and transmission delays.

To this end, a lightweight dual-node time synchronization module was developed based on the Robot Operating System 2 (ROS2) Humble version. The node interaction logic is illustrated in Figure 9, and the primary implementation consists of two steps:

Global clock unified calibration: The original point cloud topic, /livox/lidar, is received through the lidar_time_converter node. The device clock timestamp carried by the LiDAR hardware is uniformly converted to the ROS2 system global clock timestamp. Subsequently, the calibrated point cloud topic, /livox/lidar_converted, is output, effectively eliminating the clock offset between the two sensor types at the source.
Inter-frame precise matching: Utilizing the sync_node, a sliding window cache queue is constructed based on the approximate time synchronization strategy of the message_filters library. This enables inter-frame matching between the calibrated point cloud data and the original camera image topic, /image_raw. Finally, the time-synchronized image topic, /sync/image, and the point cloud topic, /sync/point_cloud, are output to the downstream fusion nodes.

Measured in a controlled indoor environment, this synchronization scheme consistently maintains the single-frame synchronization time difference within 10 ms, with a maximum synchronization latency not exceeding 5 ms. This effectively mitigates the perception lag caused by field data transmission fluctuations and establishes a unified temporal benchmark for subsequent camera-LiDAR fusion [24,25].

2.3.3. Camera-LiDAR Spatial Joint Calibration Method

To achieve accurate spatial mapping between image pixel coordinates and LiDAR 3D point clouds, a two-stage joint calibration scheme—comprising camera intrinsic calibration and LiDAR-camera extrinsic solution—was adopted. This scheme addresses the spatial misalignment arising from the inconsistent coordinate systems of the two sensors.

Camera internal parameter calibration: The camera intrinsic parameters were calibrated using the standard ROS2 camera_calibration functional package, which implements the classic Zhang’s calibration method [34]. This process solved for the intrinsic parameter matrix as well as the radial and tangential distortion coefficients. The calibration setup is illustrated in Figure 10. A checkerboard with an 8 × 6 internal corner specification and a grid size of 30 mm × 30 mm was employed as the calibration target. Twenty-five groups of images, captured at varying distances and pitch/yaw attitudes, were collected in a controlled environment. Through sub-pixel corner extraction and perspective projection geometric constraints, the parameters were solved iteratively. Verification confirmed an average reprojection error of 0.42 pixels, which strictly meets the pixel-level accuracy requirements for field cabbage recognition.
Targetless external parameter calibration of camera-LiDAR: Traditional checkerboard calibration is often hindered by crop occlusion and terrain constraints in complex field environments. This makes it difficult to rapidly perform extrinsic calibration before agricultural robots begin operation. To overcome this critical limitation, we adopted a targetless calibration method based on natural edge features. Specifically, the spatial pose between the monocular camera and the Livox Mid-40 LiDAR was solved using the open-source tool livox_camera_calib [35]. Since laser beam divergence inherently causes edge expansion errors, our method simultaneously extracts 3D LiDAR edge features with continuous depth and 2D image edge features using the Canny operator. It then establishes geometric constraints between the laser and image edges. By setting the minimum reprojection residual of the edge points as the optimization objective, the algorithm iteratively solves the rigid transformation extrinsic parameters using the BFGS nonlinear optimization method.

Figure 10. Intrinsic parameter calibration of the industrial camera using a standard 8 × 6 checkerboard. This visualizes the algorithmic process within the ROS framework, where the system continuously extracts sub-pixel corners from 25 groups of images with varying poses to iteratively solve for the camera matrix and lens distortion coefficients.

In the calibration process, the optimization algorithm demonstrated stable convergence after 23 iterations. Both the translation and rotation iteration errors approached zero, with no observable drift in the final results. The root mean square error (RMSE) of reprojection after calibration was 1.2 pixels. This achieves high-precision spatial alignment, which fully meets the spatial mapping requirements for cabbage recognition [24]. The visualization of these calibration results is illustrated in Figure 11.

2.3.4. Visual Semantic-Driven Camera-LiDAR Fusion 3D Ranging Algorithm

Based on the spatiotemporal joint calibration results from Section 2.3.2 and Section 2.3.3, this section introduces a visual semantic-driven camera-LiDAR fusion 3D ranging algorithm. This fusion approach overcomes the inherent limitations of single-sensor perception in complex field environments. While pure visual perception lacks direct depth information and is sensitive to illumination fluctuations, raw LiDAR point clouds lack semantic information, making accurate target classification difficult. By fusing these modalities, the algorithm achieves stable and accurate calculation of 3D coordinates and relative distances for cabbage targets.

Developed within the ROS2 distributed architecture, the algorithm utilizes a loosely coupled decoupling mode consisting of two primary nodes: “semantic detection” and “fusion ranging.” Data interaction between these nodes is maintained through standardized topics. The main process is divided into four distinct stages: visual semantic feature extraction, pixel-LiDAR spatial coordinate mapping, target point cloud fine screening, and 3D coordinate and ranging value calculation. The overall logic and node interaction process of the algorithm are shown in Figure 12.

Visual semantic feature extraction of cabbage targets: This stage corresponds to the primary camera_object_detection node. Its fundamental purpose is to provide target-level semantic anchors for the unsemantic LiDAR point clouds. Field environments present significant perception challenges, particularly overlapping leaf occlusion and background weed interference. To address these issues, the improved SCEW-YOLOv8 algorithm performs real-time identification. It receives the time-synchronized image topic, generates the bounding box, and extracts its center pixel coordinates $(u, v)$ as the positioning anchor. Simultaneously, the semantic category, center coordinates, and hardware timestamp are encapsulated into a structured topic. This ensures a strict spatiotemporal correspondence between the semantic data and LiDAR point clouds, establishing a unified benchmark for subsequent fusion.
Pixel space-LiDAR space coordinate mapping: This stage corresponds to the central coordinate conversion logic within the camera_lidar_fusion node. Based on the camera intrinsic and extrinsic parameters calibrated in Section 2.3.3, a geometric association between the 2D pixels and 3D LiDAR space is established. First, the target center pixel coordinates are back-projected into the camera coordinate system to obtain the normalized ray vector originating from the camera’s optical center. Subsequently, this ray vector is mapped to the LiDAR coordinate system using the rotation extrinsic matrix. After L2 norm normalization, the target unit direction vector is obtained. Since this step focuses on calculating the spatial ray direction, and the translation extrinsic parameters only shift the coordinate system origin without altering the direction, only the rotation matrix is involved in this specific operation. The fundamental conversion formulas are defined in Equations (9) and (10):

$r_{cam} = {[\frac{u - u_{0}}{f_{x}}, \frac{v - v_{0}}{f_{y}}, 1]}^{T}$

(9)

$r_{lidar} = {\frac{R \cdot r_{cam}}{‖R \cdot r_{cam}‖}}_{2}$

(10)

where $(f_{x}, f_{y})$ are the equivalent focal lengths of the x and y axes of the camera calibrated in Section 2.3.3, $(u_{0}, v_{0})$ are the principal point pixel coordinates of the camera imaging plane, and $| | \cdot | |_{2}$ represents the $L_{2}$ norm.
Fine screening and denoising of target point clouds: To mitigate point cloud noise caused by leaf occlusion and soil clutter, a two-stage cascaded screening strategy was designed to ensure robust ranging. The first stage implements an angular spatial constraint based on the target unit direction vector; only the effective point set with an angular deviation of less than 1° from the vector is retained. The second stage applies statistical outlier filtering using the $1\sigma$ (mean ± 1 standard deviation) criterion for real-time processing. By calculating the mean and standard deviation of the ranging values frame-by-frame, outliers falling outside this interval are eliminated. This effectively suppresses ranging fluctuations caused by complex field environments. If the number of effective points after screening is fewer than five, the frame is skipped to prevent statistical bias, ensuring a high-quality resultant point cloud set.
Calculation of target 3D coordinates and ranging values: Based on the refined point cloud set, the 3D positioning and ranging of cabbage targets are finalized. First, the average ranging value $\bar{d}$ of the effective point set is calculated and utilized as the final ranging result to enhance robustness. Subsequently, combined with the target unit direction vector $r_{lidar}$ , the 3D spatial coordinate $P_{lidar} = {[x, y, z]}^{T}$ corresponding to the center of the detection box is back-calculated. This relationship is mathematically defined in Equation (11):

$P_{lidar} = \bar{d} \cdot r_{lidar}$

(11)

where $\bar{d}$ is the average ranging value of the target point cloud set, that is, the linear distance of the cabbage target relative to the LiDAR.

2.4. Experimental Setup and Evaluation Metrics

2.4.1. Training Environment and Hyperparameter Configuration

To ensure efficient model training and the real-time performance of the ranging algorithm, a high-performance deep learning computing platform was constructed. The hardware configuration includes an Intel Core i7-13700F CPU, an NVIDIA GeForce RTX 4060 Ti GPU, and 32 GB of RAM. This setup provides sufficient computing power for parallel processing of large-scale datasets and real-time multi-sensor fusion. The software environment is based on the Ubuntu 22.04 LTS operating system, primarily utilizing the PyTorch 2.1.0 framework and Ultralytics YOLOv8 library. Furthermore, ROS2 Humble is integrated to manage the spatiotemporal synchronization of sensor data.

A consistent set of hyperparameters was employed to ensure experimental fairness. The model input resolution was fixed at 640 × 640 pixels. Training utilized the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0005. The initial learning rate was set to 0.01, incorporating a cosine annealing strategy to facilitate smooth convergence. The batch size was 16, with a maximum of 200 training epochs. Additionally, an early stopping mechanism was implemented: training automatically terminated if the validation mAP@0.5 showed no significant improvement for 20 consecutive epochs, effectively preventing overfitting.

2.4.2. Performance Evaluation Metrics

To systematically verify the performance of the proposed algorithm, a standardized evaluation system was established focusing on two dimensions: cabbage target recognition and 3D coordinate ranging. All indicators align with general evaluation specifications in the field, ensuring the objectivity and comparability of the experimental results.

For target perception performance, Precision (P), Recall (R), mean Average Precision at an IoU threshold of 0.5 (mAP@0.5), and end-to-end inference Frames Per Second (FPS) were selected as key indicators. These metrics respectively quantify classification accuracy, target coverage capability, holistic modeling effectiveness, and real-time processing throughput.

For 3D ranging performance, Mean Absolute Error (MAE) and Standard Deviation (SD) were selected to quantify system accuracy and stability. MAE represents the overall average deviation of the ranging results, serving to evaluate absolute positioning accuracy. SD quantifies the dispersion of the ranging data, verifying the system’s stability during continuous operation. The calculation formula for MAE is defined in Equation (12):

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(12)

where

y_{i}

is the true distance calibrated by LiDAR,

{\hat{y}}_{i}

is the predicted distance calculated by the algorithm, and

n

is the total number of test samples.

The SD is calculated as follows:

S D = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(13)

where

x_{i}

is the single-frame ranging value of the system,

\bar{x}

is the average ranging value of the corresponding test group, and

n

is the total number of test samples.

3. Results and Analysis

To systematically verify the performance of the developed SCEW-YOLOv8 model and the camera-LiDAR fusion positioning system, this chapter presents a extensive comparative analysis. The evaluation focuses on two primary dimensions: cabbage target detection performance and the positioning accuracy of the fusion ranging system. All experiments were conducted within the unified hardware and software environment described in Section 2.4. Strict single-variable control was maintained throughout to ensure the fairness, repeatability, and objectivity of the results.

3.1. Performance Evaluation of Cabbage Recognition Model

To demonstrate the superiority of the SCEW-YOLOv8 model in whole-growth-cycle cabbage perception, this section evaluates its recognition accuracy, generalization ability, and real-time inference performance. The assessment comprises three components: a horizontal comparison with mainstream baseline models, incremental ablation studies of the enhanced modules, and a visual analysis of the bounding box results. These experiments utilized the self-built whole-growth-cycle cabbage dataset detailed in Section 2.1, partitioned into training, validation, and test sets at a ratio of 8:1:1. The input resolution for all models was unified at 640 × 640 pixels, with identical training strategies and hyperparameters to eliminate irrelevant variables.

3.1.1. Comparative Experiment of Mainstream Object Recognition Models

To quantitatively evaluate the holistic performance of the SCEW-YOLOv8 model in the whole-growth-cycle cabbage identification task, several mainstream frameworks widely used in agricultural vision were selected for horizontal comparison. These include the classic two-stage Faster R-CNN, the single-stage SSD, and various lightweight versions of the YOLO series within the same parameter scale (YOLOv5n, YOLOv8n baseline, YOLOv8s, YOLOv9t, YOLOv10n, and the latest YOLO11n). Additionally, advanced architectures such as RT-DETR-R18 and EfficientDetV2-Lite0 were introduced for an in-depth evaluation. All comparative models utilized standard localization architectures with NMS post-processing and underwent end-to-end training and testing under identical software, hardware, dataset, and training strategies.

The performance comparison results are detailed in Table 2. It is evident that different detection architectures exhibit significant performance variations in this specific agricultural task. Although the two-stage Faster R-CNN achieved a respectable mAP@0.5 of 90.8%, its inference speed of only 13.5 FPS falls short of the real-time requirements for field operations [3]. In contrast, while the single-stage SSD operates at a higher speed, its extensive accuracy (72.2% recall and 80.5% mAP@0.5) is insufficient for robust field deployment, as it remains prone to missed detections in complex scenarios.

The lightweight models of the YOLO series demonstrate a superior balance between accuracy and inference speed, which is critical for the limited computing power of agricultural mobile robots. Among them, the newly introduced YOLO11n showed strong baseline capabilities, achieving 95.1% mAP@0.5 and 67.3 FPS with only 3.5 M parameters. However, the developed SCEW-YOLOv8 model exhibits the most significant advantages in overall performance. In terms of detection accuracy, its mAP@0.5 reaches 95.8%, outperforming the baseline YOLOv8n by 2.3 percentage points and the latest YOLO11n by 0.7 percentage points. Regarding target coverage, the model significantly boosts the recall rate to 90.8%—a 5.1 percentage point increase over the baseline—effectively mitigating omission errors for leaf-occluded and small-sized cabbage targets.

To provide a more intuitive perspective on model efficiency, Figure 13 visualizes the accuracy-efficiency trade-off. As illustrated by the comparative charts, our model achieves the highest accuracy (mAP@0.5) with only a marginal increase in parameters (3.8 M) and a stable frame rate of 64.2 FPS.

In addition to steady-state performance metrics, the dynamic training process was evaluated to verify learning efficiency. The training convergence curves over 200 epochs for the top-performing models are compared in Figure 14. Subfigures (a–d) illustrate the dynamic changes in Precision, Recall, mAP@0.5, and mAP@0.5:0.95, respectively. As depicted, the developed model (red line) demonstrates faster convergence and superior final metrics compared to models like YOLO11n, YOLOv8s, and EfficientDetV2-Lite0, proving its suitability for whole-growth-cycle cabbage perception in natural fields.

3.1.2. Ablation Experiment of Improved Modules

To quantitatively evaluate the individual contribution of each newly integrated module to the SCEW-YOLOv8 architecture, as well as the synergistic adaptability between multiple components, an extensive 16-group incremental ablation experiment was conducted using YOLOv8n as the baseline. By systematically incorporating the SPD-Conv space-to-depth downsampling module, the C2f-CX global feature enhancement module, the EMA cross-space attention mechanism, and the WIoU v3 loss function, the cumulative performance gains were systematically validated. All experiments were performed under identical hyperparameters, datasets, and hardware environments to ensure a rigorous comparative analysis. The detailed results are presented in Table 3.

As evidenced by the experimental results in Table 3, the sequential integration of the enhanced modules yields a steady upward trend across all detection metrics. Their individual functional contributions and synergistic effects are summarized as follows:

Feature retention gain of SPD-Conv module: Compared to the baseline (Group 1), Group 2—which exclusively introduces the SPD-Conv module—elevated the recall rate from 85.7% to 88.2% and increased the mAP@0.5 by 0.7 percentage points, while the inference speed decreased by a marginal 0.4 FPS. This demonstrates that replacing traditional strided convolutions with space-to-depth operations effectively mitigates the loss of fine-grained spatial information. This mechanism proves instrumental in detecting small-sized seedlings and leaf-occluded targets.
Synergistic effect of feature enhancement of C2f-CX and EMA modules: Building upon the SPD-Conv foundation, configurations that cumulatively integrated the C2f-CX and EMA modules (e.g., Group 12) further increased the mAP@0.5 to 95.3%, with precision and recall rising synchronously. This confirms the efficacy of global large-kernel convolution modeling and cross-space attention within complex backgrounds. These components significantly enhance feature discriminability between cabbage targets and soil/weed interference, strengthen global contour extraction, and reduce false detection rates.
Regression accuracy optimization of WIoU v3 loss function: The final configuration (Group 16), integrating all modules including the WIoU v3 loss function, achieved optimal detection performance, peaking at 95.8% mAP@0.5 and a 90.8% recall rate. This validates the crucial role of the dynamic non-monotonic weighting mechanism in optimizing bounding box positioning. It is particularly effective for targets with irregular seedling morphology, blurred edges, and severe field occlusion, as it accelerates the model’s ability to learn from hard samples.

From a computational efficiency perspective, the full integration of these modules (Group 16) resulted in only a marginal decrease in inference speed, dropping from 65.8 FPS to 64.2 FPS. This overall decline of merely 2.4% represents a negligible computational cost, well within the real-time processing capabilities of agricultural robots.

Furthermore, analyzing category-specific recognition throughout the growth cycle reveals that the baseline YOLOv8n struggled with mature cabbage (heading stage), achieving only 90.8% mAP@0.5 due to significant morphological variations and leaf overlap [13,15]. The fully optimized SCEW-YOLOv8 successfully addressed this performance heterogeneity. It not only boosted the mAP@0.5 for immature cabbage (seedlings/rosettes) to 97.5%, but more importantly, it elevated the mAP@0.5 for mature cabbage to 94.3%—a 3.5 percentage point increase over the baseline.

To ensure that the performance gains stem from specific architectural modifications rather than compensatory mechanisms or mere parameter expansion, we cross-verified the results using granular metrics and visual analyses. First, the isolated addition of SPD-Conv predominantly boosted the recall rate (Table 3), which precisely aligns with its theoretical objective of preserving fragile features to mitigate omission errors. Second, the accuracy-efficiency trade-off shown in Figure 13 and Table 2 demonstrates that our optimized model (3.8 M, 95.8% mAP@0.5) outperforms both similar-sized recent models (YOLOv10n, YOLO11n) and the significantly larger YOLOv8s (11.2 M), indicating the enhancements are not driven by generic capacity increases. Finally, the stable and rapid training convergence demonstrated in Figure 14 further precludes stochastic or compensatory artifacts.

3.1.3. Visual Analysis of Inference Results

To intuitively verify the practical efficacy of the SCEW-YOLOv8 model, a visual comparison between the baseline YOLOv8n and our improved architecture was conducted using the same validation set images. The comparative results are illustrated in Figure 15.

The visual evidence clearly demonstrates that the target-focusing performance of the SCEW-YOLOv8 model has been enhanced across multiple dimensions. In conventional occlusion-free scenarios, the classification confidence of the improved model is significantly higher than that of the baseline, with noticeably superior spatial alignment of the bounding boxes. In peripheral image areas and dense planting clusters, the improved model effectively mitigates the omission errors prevalent in the baseline model. Under high-illumination field conditions, redundant bboxes (false positives) are significantly reduced, yielding a lower overall false alarm rate. Most notably, in challenging scenarios involving insufficient lighting, severe occlusion, and overlapping leaves, the missed identification rate of the improved model is drastically reduced. These results collectively reflect the exceptional robustness and generalization ability of SCEW-YOLOv8 in complex field environments.

3.2. Performance Analysis of the Fusion Ranging System

To verify the ranging accuracy, positioning stability, and real-time performance of the developed camera-LiDAR fusion algorithm, a precision measurement experimental platform was constructed in a controlled indoor environment for systematic validation.

3.2.1. Ranging Experimental Platform and Test Scheme

The primary motivation for selecting a controlled indoor environment was to isolate and eliminate uncontrollable interference factors, such as ground fluctuations, drastic illumination variations, and plant kinesis. This approach enables the accurate characterization of the fusion algorithm’s theoretical positioning accuracy, providing a reliable benchmark for subsequent field operations. The experimental setup is illustrated in Figure 16. To obtain high-precision physical ground truth, the optical axes of both the camera and LiDAR were aligned vertically downward toward the ground. A plumb bob was utilized for ground coordinate projection calibration, while a steel tape with a 1 mm minimum scale served as the length benchmark. A portable high-performance computer was employed for real-time data acquisition and computation. The visual interface of the fusion ranging results is presented in Figure 17.

Within the typical operational height range of 1.0–2.0 m for agricultural robots, 20 groups of test distances were established at incremental intervals. For each distance, 20 frames of valid data were collected repeatedly. After filtering out invalid frames to ensure data integrity, the mean value across these observations was recorded as the final detection result to mitigate random single-measurement errors. The three-dimensional coordinates calibrated via the LiDAR and steel tape were utilized as the ground truth, while the coordinates calculated by the fusion algorithm served as the experimental values. The Mean Absolute Error (MAE) and Standard Deviation (SD), as defined in Section 2.4.2, were employed to quantify the positioning accuracy and stability of the system. Additionally, the end-to-end processing frame rate of the integrated system was monitored to verify its real-time operational capability.

3.2.2. Ranging Error Statistics and Performance Analysis

The detection coordinates presented in Table 4 represent the arithmetic mean of 20 frames of valid data for each test distance. Correspondingly, the overall MAE and SD indicators were calculated based on the complete set of 400 raw data frames (20 groups × 20 frames) to fully characterize the system’s performance during continuous operation. The original measurement data and statistical error results for the 20 variable-distance experiments are summarized in Table 4. Based on the aggregate analysis, the fusion perception system demonstrates exceptional positioning robustness and millimeter-level accuracy within the LiDAR coordinate system.

Experimental results indicate that, leveraging the superior physical ranging capabilities of the Livox Mid-40 LiDAR, the system achieves extremely high accuracy in the longitudinal direction (X-axis), with a Mean Absolute Error (MAE) of only 1.45 mm. The maximum absolute error across the full operating range is strictly maintained within 5 mm. In comparison, the MAEs for the lateral operation plane (Y and Z axes) are 2.85 mm and 2.95 mm, respectively. These metrics remain within the millimeter-level threshold, fully satisfying the precision requirements for targeted spraying and autonomous harvesting across the entire cabbage growth cycle. Regarding stability, the Standard Deviation (SD) for the X-axis remains stable at approximately 2.5 mm, while the Y and Z axes exhibit SD values of 3.0 mm and 2.8 mm, respectively. This indicates a lack of significant data fluctuation across the detection range, ensuring high consistency. Furthermore, the end-to-end processing rate of the integrated fusion system remains stable at 20 FPS, effectively meeting the real-time requirements for low-speed agricultural robotic operations.

The system is stable and accurate, with consistent linear performance in the 1.0–2.0 m range. Unlike commercial binocular cameras, whose errors increase rapidly with distance due to resolution limits, our system maintains a near-constant error profile. A typical accuracy index of 1.5% at the measured distance implies that a binocular system’s theoretical error would surge to approximately 30 mm at a 2.0 m distance. Moreover, traditional vision is highly susceptible to ambient light fluctuations, leaf reflections, and texture-less surfaces, which further degrade actual field MAE [18,19,21].

Conversely, the heterogeneous fusion scheme—combining monocular visual semantic guidance with direct LiDAR ranging—replaces traditional parallax-based depth estimation with direct Time-of-Flight (ToF) measurement. This fundamentally eliminates the impact of ambient light interference and texture loss on depth accuracy [22,23]. The system maintains millimeter-level average error and stable data dispersion throughout the full operating range, without significant deterioration as distance increases. This overcomes the inherent robustness defects of traditional vision-based depth schemes in complex lighting, providing highly reliable spatial data for precision operations within a large-span workspace.

4. Conclusions and Discussion

This paper introduces a camera-LiDAR fusion system, based on the SCEW-YOLOv8 model, for high-precision 3D positioning of cabbage plants throughout their growth cycle. Experimental results demonstrate that the SCEW-YOLOv8 model achieves a 95.8% mAP@0.5 and a 90.8% recall rate on the self-built whole-growth-cycle cabbage dataset, with an inference speed of 64.2 FPS. This architecture effectively resolves critical agricultural challenges, including strong morphological heterogeneity, high omission rates caused by leaf occlusion, and positioning deviations stemming from bounding box jitter. The integrated positioning system achieves millimeter-level 3D accuracy within the typical 1.0–2.0 m operating range of agricultural robots, maintaining a Mean Absolute Error (MAE) of only 1.45 mm in the longitudinal direction and a stable processing rate of 20 FPS. These metrics fully satisfy the precision and real-time requirements for autonomous field operations.

Compared to existing schemes that often focus on a single growth stage or rely on pure vision-based depth estimation, the proposed system demonstrates superior full-cycle adaptability and environmental robustness. The SCEW-YOLOv8 model achieves balanced, high-precision recognition from the seedling to the heading stages. To further elucidate the practical value of the system, Table 5 presents a quantitative comparison of performance and hardware costs between our camera-LiDAR fusion approach and traditional pure vision schemes.

It is worth acknowledging that pure stereo or RGB-D vision schemes (e.g., typical depth cameras like Intel RealSense D435i or D455) exhibit high cost-effectiveness and acceptable precision at near-field ranges. According to extensive field evaluations [36,37,38], these cameras can achieve a Root Mean Square Error (RMSE) of 10–30 mm within a 1.0 m range. However, to ensure a sufficient global field of view for look-ahead trajectory planning and to accommodate the physical dimensions of agricultural machinery, environmental perception sensors are typically required to cover an extended operating distance of 1.0 to 2.0 m. At this mid-range, the depth calculation error of pure vision systems grows quadratically based on the triangulation principle, and their performance heavily relies on ambient illumination and target texture. Existing studies demonstrate that at distances of 1.0–1.5 m, the RMSE of cameras like the D435i escalates rapidly to 30–50 mm [39], and the overall depth RMSE for vision systems can even exceed 100 mm as the distance approaches 2.0–3.0 m [40]. Such centimeter-to-decimeter level deviations fail to meet the stringent accuracy requirements for automated cabbage management.

In contrast, the heterogeneous fusion scheme of “visual semantic guidance + LiDAR direct ranging” presented in this study fundamentally overcomes these inherent defects. By leveraging the Time-of-Flight (ToF) principle, our system achieves highly stable millimeter-level positioning accuracy (MAE: 1.45 mm, SD: ~2.8 mm) regardless of distance or illumination. Quantitatively, although this heterogeneous system incurs a total hardware cost of $853.26—representing an approximately 113.3% increase in initial investment compared to mainstream pure vision baselines (typically costing $350–$540)—it achieves a remarkable 96.3% reduction in spatial positioning error at extended operating distances (calculated by reducing a representative baseline error of 40 mm down to 1.45 mm MAE).

Crucially, this cost increment should be viewed from a long-term developmental perspective. Decades ago, high-quality digital cameras were prohibitively expensive and restricted to specialized professional fields; today, driven by mass production and manufacturing advancements, they have become ubiquitous, low-cost components. Solid-state LiDAR is currently following a highly similar trajectory. Compared to a decade ago, when early LiDAR systems cost tens of thousands of dollars, rapid industrialization has already significantly driven down the price of devices like the Livox Mid-40. As the autonomous driving and agricultural robotics supply chains mature, the cost of LiDAR sensors will inevitably experience further dramatic declines. Therefore, the “camera + LiDAR” fusion paradigm is not a cost burden but an inevitable future trend. This synergy enables the collaborative optimization of 2D detection accuracy and 3D positioning stability, providing a highly reliable perception framework for the automated management of open-field vegetables.

To provide practical deployment guidelines for the SCEW-YOLOv8 model, its application scope and boundary conditions must be explicitly defined. Regarding the application scope, the architectural modifications—specifically SPD-Conv and C2f-CX—render the model highly effective for open-field leafy crops characterized by severe leaf occlusion, significant morphological variations across growth stages, and volatile lighting conditions. However, several boundary conditions and limitations persist.

First, while the C2f-CX module significantly enhances global contour modeling, it fundamentally relies on spatial texture integrity; consequently, its feature extraction capability may noticeably degrade under extreme motion blur caused by high-speed robotic operations. Second, the target-focusing ability provided by the EMA module may encounter bottlenecks in unmanaged fields with exceptionally dense weed infestations, particularly when the surrounding weeds share near-identical spectral and textural signatures with cabbage leaves. Finally, regarding baseline selection, although iterations such as YOLOv9, YOLOv10, and YOLO11 have recently emerged, YOLOv8n was selected as the foundational framework due to its mature edge-deployment ecosystem (e.g., TensorRT and ROS2 compatibility), which is essential for agricultural robotics. Our experiments show that targeted improvements on a stable architecture outperform the latest general baselines (e.g., YOLO11n), validating the need for domain-specific optimization.

The SCEW-YOLOv8 architecture is not just a collection of modules; rather, its components work synergistically to adapt to the cabbage’s morphological changes. During forward propagation, the SPD-Conv module first ensures the lossless transmission of sub-pixel features for small seedlings. Subsequently, the customized C2f-CX module utilizes large-kernel convolutions to capture the macroscopic spherical contours of mature plants. The EMA mechanism then performs cross-spatial filtering to eliminate complex field background noise, while the WIoU v3 loss function resolves regression challenges caused by severe leaf occlusion. This tightly coupled theoretical paradigm effectively bridges the gap between general-purpose detection algorithms and the rigorous demands of full-growth-cycle crop perception. Furthermore, this stable 2D semantic extraction fundamentally suppresses bounding box jitter, providing a reliable geometric prerequisite for subsequent 3D camera-LiDAR fusion and advancing the theoretical framework of multi-sensor agricultural perception.

While the principled verification of the system has been completed in a controlled indoor environment, further optimization is required for unstructured, real-world field scenarios. Future research will focus on the robustness validation of the system across broader field environments. Specifically, we will systematically analyze the influence of dynamic mechanical vibrations and volatile illumination on 3D positioning stability. To reduce dataset bias and ensure generalizability, we incorporated public images and simulated extreme weather conditions.

Furthermore, we acknowledge a limitation regarding cultivar diversity. The current dataset is primarily composed of locally dominant green cabbage cultivars. Although the algorithm fundamentally relies on geometric contours and texture features—rather than color-specific attributes—its optimal performance on distinctively different varieties, such as red or purple cabbages, requires further empirical validation. Therefore, expanding the dataset to include multi-color cabbage cultivars and applying transfer learning will be a priority to enhance global generalization. Future work will focus on migrating these domain-specific optimizations to the latest architectures (e.g., YOLOv12 and YOLOv26) and further algorithmic lightweighting for stable edge deployment.

Author Contributions

Conceptualization, J.H. and C.X.; methodology, J.H., C.X. and D.L.; software, D.L.; validation, D.L.; formal analysis, D.L.; investigation, D.L.; resources, J.H. and C.X.; data curation, D.L.; writing—original draft preparation, D.L.; writing—review and editing, J.H. and C.X.; visualization, D.L.; supervision, J.H. and C.X.; project administration, J.H. and C.X.; funding acquisition, J.H. and C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Foundation of the Taizhou Science and Technology Program (No: TG202301).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The underlying data of this paper cannot be publicly shared as the data is required for further research. These data will be shared with the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Asano, M.; Onishi, K.; Fukao, T. Robust Cabbage Recognition and Automatic Harvesting under Environmental Changes. Adv. Rob. 2023, 37, 960–969. [Google Scholar] [CrossRef]
Júnior, M.R.B.; Santos, R.G.d.; Sales, L.d.A.; Oliveira, L.P.d. Advancements in Agricultural Ground Robots for Specialty Crops: An Overview of Innovations, Challenges, and Prospects. Plants 2024, 13, 3372. [Google Scholar] [CrossRef] [PubMed]
Thakur, A.; Venu, S.; Gurusamy, M. An Extensive Review on Agricultural Robots with a Focus on Their Perception Systems. Comput. Electron. Agric. 2023, 212, 108146. [Google Scholar] [CrossRef]
Huang, Y.; Xu, S.; Chen, H.; Li, G.; Dong, H.; Yu, J.; Zhang, X.; Chen, R. A Review of Visual Perception Technology for Intelligent Fruit Harvesting Robots. Front. Plant Sci. 2025, 16, 1646871. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Appe, S.N.; Arulselvi, G.; Balaji, G.N. CAM-YOLO: Tomato Detection and Classification Based on Improved YOLOv5 Using Combining Attention Mechanism. PeerJ Comput. Sci. 2023, 9, e1463. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An Improved YOLO Algorithm for Detecting Flowers and Fruits on Strawberry Seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Lawal, M.O. Tomato Detection Based on Modified YOLOv3 Framework. Sci. Rep. 2021, 11, 1447. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Intelligent Detection of Multi-Class Pitaya Fruits in Target Picking Row Based on WGB-YOLO Network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Hai, T.; Shao, Y.; Zhang, X.; Yuan, G.; Jia, R.; Fu, Z.; Wu, X.; Ge, X.; Song, Y.; Dong, M.; et al. An Efficient Model for Leafy Vegetable Disease Detection and Segmentation Based on Few-Shot Learning Framework and Prototype Attention Mechanism. Plants 2025, 14, 760. [Google Scholar] [CrossRef]
Fu, Y.; Shi, C. ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture. Sustainability 2025, 17, 7443. [Google Scholar] [CrossRef]
Yuan, K.; Wang, Q.; Mi, Y.; Luo, Y.; Zhao, Z. Improved Feature Fusion in YOLOv5 for Accurate Detection and Counting of Chinese Flowering Cabbage (Brassica campestris L. ssp. chinensis var. utilis Tsen et Lee) Buds. Agronomy 2023, 14, 42. [Google Scholar] [CrossRef]
Zheng, J.; Wang, X.; Shi, Y.; Zhang, X.; Wu, Y.; Wang, D.; Huang, X.; Wang, Y.; Wang, J.; Zhang, J. Keypoint Detection and Diameter Estimation of Cabbage (Brassica oleracea L.) Heads under Varying Occlusion Degrees via YOLOv8n-CK Network. Comput. Electron. Agric. 2024, 226, 109428. [Google Scholar] [CrossRef]
Tian, Y.; Zhao, C.; Zhang, T.; Wu, H.; Zhao, Y. Recognition Method of Cabbage Heads at Harvest Stage under Complex Background Based on Improved YOLOv8n. Agriculture 2024, 14, 1125. [Google Scholar] [CrossRef]
Jiang, P. Field Cabbage Detection and Positioning System Based on Improved YOLOv8n. Plant Methods 2024, 20, 96. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Cao, X.; Zhang, T.; Wu, H.; Zhao, C.; Zhao, Y. CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics. Sensors 2024, 24, 8115. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Wang, X.; Wang, Z.; Xu, Q.; Xu, X.; Liu, H. Improving Stability of Gaze Target Detection in Videos. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Xu, G.; Cao, X.; Liu, J.; Fan, J.; Li, E.; Long, X. Robust and Accurate Depth Estimation by Fusing LiDAR and Stereo. Meas. Sci. Technol. 2023, 34, 125107. [Google Scholar] [CrossRef]
Song, H.; Choi, W.; Kim, H. Robust Vision-Based Relative-Localization Approach Using an RGB-Depth Camera and LiDAR Sensor Fusion. IEEE Trans. Ind. Electron. 2016, 63, 3725–3736. [Google Scholar] [CrossRef]
Liu, H.; Wu, C.; Wang, H. Real Time Object Detection Using LiDAR and Camera Fusion for Autonomous Driving. Sci. Rep. 2023, 13, 8056. [Google Scholar] [CrossRef]
Gao, S.; Chen, X.; Wu, X.; Zeng, T.; Xie, X. Analysis of Ranging Error of Parallel Binocular Vision System. In 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 621–625. [Google Scholar]
Li, N.; Ho, C.P.; Xue, J.; Lim, L.W.; Chen, G.; Fu, Y.H.; Lee, L.Y.T. A Progress Review on Solid-State LiDAR and Nanophotonics-Based LiDAR Sensors. Laser Photonics Rev. 2022, 16, 2100511. [Google Scholar] [CrossRef]
Karim, M.R.; Reza, M.N.; Jin, H.; Haque, M.A.; Lee, K.-H.; Sung, J.; Chung, S.-O. Application of LiDAR Sensors for Crop and Working Environment Recognition in Agriculture: A Review. Remote Sens. 2024, 16, 4623. [Google Scholar] [CrossRef]
Kang, H.; Wang, X.; Chen, C. Accurate Fruit Localisation Using High Resolution LiDAR-Camera Fusion and Instance Segmentation. Comput. Electron. Agric. 2022, 203, 107450. [Google Scholar] [CrossRef]
Ban, C. A Camera-LiDAR-IMU Fusion Method for Real-Time Extraction of Navigation Line between Maize Field Rows. Comput. Electron. Agric. 2024, 223, 109114. [Google Scholar] [CrossRef]
Hu, X.; Zhang, X.; Chen, X.; Zheng, L. Research on Corn Leaf and Stalk Recognition and Ranging Technology Based on LiDAR and Camera Fusion. Sensors 2024, 24, 5422. [Google Scholar] [CrossRef] [PubMed]
Oppenheim, A.V.; Schafer, R.W. Discrete-Time Signal Processing, 3rd ed.; Pearson: Upper Saddle River, NJ, USA, 2009; pp. 140–153. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Machine Learning and Knowledge Discovery in Databases; Springer Nature Switzerland: Cham, Switzerland, 2023; Volume 13715, pp. 443–459. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 16133–16142. [Google Scholar]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Sun, S.; Mo, B.; Xu, J.; Li, D.; Zhao, J.; Han, S. Multi-YOLOv8: An Infrared Moving Small Object Detection Model Based on YOLOv8 for Air Vehicle. Neurocomputing 2024, 588, 127685. [Google Scholar] [CrossRef]
Zhang, Z. Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 1999; Volume 1, pp. 666–673. [Google Scholar]
Yuan, C.; Liu, X.; Hong, X.; Zhang, F. Pixel-Level Extrinsic Self Calibration of High Resolution LiDAR and Camera in Targetless Environments. IEEE Robot. Autom. Lett. 2021, 6, 7517–7524. [Google Scholar] [CrossRef]
Liu, X.; Jing, X.; Jiang, H.; Younas, S.; Wei, R.; Dang, H.; Wu, Z.; Fu, L. Performance Evaluation of Newly Released Cameras for Fruit Detection and Localization in Complex Kiwifruit Orchard Environments. J. Field Rob. 2024, 41, 881–894. [Google Scholar] [CrossRef]
Coll-Ribes, G.; Torres-Rodríguez, I.J.; Grau, A.; Guerra, E.; Sanfeliu, A. Accurate Detection and Depth Estimation of Table Grapes and Peduncles for Robot Harvesting, Combining Monocular Depth Estimation and CNN Methods. Comput. Electron. Agric. 2023, 215, 108362. [Google Scholar] [CrossRef]
Abeyrathna, R.M.R.D.; Nakaguchi, V.M.; Minn, A.; Ahamed, T. Recognition and Counting of Apples in a Dynamic State Using a 3D Camera and Deep Learning Algorithms for Robotic Harvesting Systems. Sensors 2023, 23, 3810. [Google Scholar] [CrossRef]
Fan, Z.; Sun, N.; Qiu, Q.; Li, T.; Zhao, C. Depth Ranging Performance Evaluation and Improvement for RGB-D Cameras on Field-Based High-Throughput Phenotyping Robots. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2021; pp. 3299–3304. [Google Scholar]
Neupane, C.; Koirala, A.; Wang, Z.; Walsh, K.B. Evaluation of Depth Cameras for Use in Fruit Localization and Sizing: Finding a Successor to Kinect V2. Agronomy 2021, 11, 1780. [Google Scholar] [CrossRef]

Figure 1. Overview of the experimental site location in Jiangyan District, Taizhou, and a representative field scenario of heading-stage cabbage under challenging strong direct sunlight.

Figure 2. Visual examples of the 11 data augmentation techniques applied to the cabbage dataset. (a) The original image; (b–d) Photonic and sensor noise simulations, corresponding to salt-and-pepper, Poisson, and Laplacian noise, respectively; (e) Motion blur; (f,g) Illumination variations (overexposure and high contrast); (h–k) Adverse weather simulations including cloudy, rainy, foggy, and snowy conditions; (l) CoarseDropout to simulate severe leaf-level occlusions.

Figure 3. Dataset annotation strategy and morphological categories of cabbage across the whole growth cycle. (A) Examples of bounding box annotations for ripe cabbages, illustrating the challenge of severe leaf occlusion in natural field environments. (B) Definitions of the two label categories based on maturity: unripe (seedling and rosette stages) and ripe (heading stage).

Figure 4. Overall architecture of the SCEW-YOLOv8.

Figure 5. Schematic diagram of SPD-Conv when P = 2. The different colors represent the sub-feature maps obtained by downsampling at different pixel offsets; the plus symbol denotes channel concatenation, and the cross symbol denotes the subsequent convolution operation.

Figure 6. Architecture of the designed C2f-CX module. (a) ConvNeXt V2 unit, which incorporates an in-ternal residual connection to ensure gradient stability during training; (b) Overall structure of C2f-CX, where n ConvNeXt V2 units with large-kernel perception are employed as the bottleneck components to enhance feature modeling capabilities in complex field environments.

Figure 7. Architecture of the Efficient Multi-Scale Attention (EMA) module. The module groups input features and extracts multi-scale spatial information through parallel 1D and 3D branches. The dashed box highlights the cross-space learning mechanism, where information from different dimensions is aggregated to reconstruct significant features for cabbage detection.

Figure 8. Physical implementation of the integrated camera-LiDAR fusion sensing system hardware.

Figure 9. Logical block diagram of multi-sensor temporal synchronization mechanism based on the ROS2 framework. The arrows represent the data flow between ROS2 nodes.

Figure 11. Visualization of the targetless spatial extrinsic calibration between the monocular camera and LiDAR. The process shows the iterative spatial matching algorithm, where the 3D LiDAR point cloud edge features are continuously aligned with the 2D image edges extracted via the Canny operator, ultimately solving the rigid transformation matrix without a calibration target.

Figure 12. Node interaction and data flow block diagram of the fusion ranging system based on ROS2. The arrows indicate the direction of data flow.

Figure 13. Performance comparison across different object recognition frameworks. Bars indicate accuracy (mAP@0.5), and the line denotes parameters (M). Our model achieves the optimal accuracy-efficiency trade-off.

Figure 14. Training convergence dynamics of various identification models over 200 epochs. (a) precision; (b) recall; (c) mAP@0.5; (d) mAP@0.5:0.95.

Figure 15. Comparison of cabbage identification results between the baseline YOLOv8n and the enhanced SCEW-YOLOv8 model.

Figure 16. Physical image of the fusion ranging experimental site in an indoor controlled environment.

Figure 17. Visualization interface of the fusion ranging results. The colors in the LiDAR point cloud represent the height (X-axis) of the objects, with warmer colors (orange/red) indicating higher positions relative to the ground.

Table 1. Key Specifications of the Dual-Sensor Fusion Perception System.

Device	Type	Parameter	Value
Solid-State LiDAR	Mid-40 (Livox, Shenzhen, China)	Maximum detection range/m	260 (@ 80% target reflectivity)
		Range precision/cm	<2 @ 20 m, 80% target reflectivity
		Circular field of view (FoV)/°	38.4
		Point rate/points·s⁻¹	100,000
		Frame rate/frame·s⁻¹	10 (typical)
		Dimension (L × W × H)/ mm	88 × 69 × 76
RGB industrial camera	MV-CS020-10UC (HIKROBOT, Hangzhou, China)	Resolution/pixel	1624 × 1240
		Lens focal length/mm	8
		Maximum frame rate/frame·s⁻¹	90
		Dimension (L × W × H)/mm	29 × 29 × 30

Table 2. Performance comparison of different detection models.

Model	P/%	R/%	mAP@0.5/%	mAP@0.95/%	Parameters/M	GFLOPs/G	FPS
Faster R-CNN	91.4	82.0	90.8	69.3	41.1	370.2	13.5
SSD	81.9	72.2	80.5	41.2	23.7	35.0	27.8
YOLOv5n	88.7	84.2	90.5	68.8	2.5	7.2	61.3
YOLOv8n	90.1	85.7	93.5	72.4	3.0	8.2	65.8
YOLOv8s	91.2	87.1	94.2	73.5	11.2	28.6	48.5
YOLOv9t	90.7	86.3	94.0	72.0	2.5	7.5	70.2
RT-DETR-R18	90.4	87.8	93.9	71.8	11.4	12.1	22.5
EfficientDetV2-Lite0	91.2	87.0	94.1	72.2	18.5	20.7	18.2
YOLOv10n	87.3	87.0	92.9	69.7	3.2	8.4	68.5
YOLO11n	92.2	88.1	95.1	73.0	3.5	6.4	67.3
Ours	92.6	90.8	95.8	73.8	3.8	8.5	64.2

Table 3. Results of the ablation experiments.

Group	SPD-Conv	C2f-CX	EMA	WIoUv3	Precision/%	Recall/%	mAP@0.5/%	FPS
1 (Baseline)	-	-	-	-	90.1	85.7	93.5	65.8
2	✓	-	-	-	90.9	88.2	94.2	65.4
3	-	✓	-	-	90.7	87.5	94.0	65.5
4	-	-	✓	-	90.5	86.9	93.8	65.6
5	-	-	-	✓	90.3	86.3	93.6	65.7
6	✓	✓	-	-	91.5	88.9	94.8	65.0
7	✓	-	✓	-	91.2	88.5	94.5	65.2
8	✓	-	-	✓	91.1	88.3	94.4	65.3
9	-	✓	✓	-	91.0	88.1	94.3	65.3
10	-	✓	-	✓	90.9	87.9	94.2	65.4
11	-	-	✓	✓	90.8	87.7	94.1	65.4
12	✓	✓	✓	-	92.1	90.2	95.3	64.5
13	✓	✓	-	✓	92.0	90.0	95.2	64.6
14	✓	-	✓	✓	91.8	89.7	95.0	64.7
15	-	✓	✓	✓	91.7	89.5	94.9	64.8
16 (Full)	✓	✓	✓	✓	92.6	90.8	95.8	64.2

Note: ✓ indicates the module is included, and - indicates the module is not included.

Table 4. Three-dimensional coordinate measurements of cabbage.

ID	Real Coordinate (mm)	Detection Coordinate (mm)	Coordinate Error (mm)
1	(1000, −45, 75)	(1001.8, −47.7, 72.2)	(1.8, 2.7, 2.8)
2	(1050, 30, −20)	(1052.9, 32.1, −17.2)	(2.9, 2.1, 2.8)
3	(1100, −80, 120)	(1100.7, −82.8, 118.1)	(0.7, 2.8, 1.9)
4	(1150, 55, −60)	(1154.2, 52.2, −63.1)	(4.2, 2.8, 3.1)
5	(1200, −110, 90)	(1201.9, −112.2, 88.2)	(1.9, 2.2, 1.8)
6	(1250, 70, −95)	(1253.1, 73.8, −92.2)	(3.1, 3.8, 2.8)
7	(1300, −35, 150)	(1300.8, −37.1, 147.2)	(0.8, 2.1, 2.8)
8	(1350, 95, −40)	(1354.9, 92.2, −42.1)	(4.9, 2.8, 2.1)
9	(1400, −140, 60)	(1401.7, −142.9, 58.1)	(1.7, 2.9, 1.9)
10	(1450, 25, −130)	(1453.0, 27.9, −127.1)	(3.0, 2.9, 2.9)
11	(1500, −70, 180)	(1500.6, −72.1, 178.2)	(0.6, 2.1, 1.8)
12	(1550, 110, −75)	(1554.1, 107.2, −77.9)	(4.1, 2.8, 2.9)
13	(1600, −100, 110)	(1601.8, −102.9, 108.1)	(1.8, 2.9, 1.9)
14	(1650, 45, −160)	(1653.0, 47.9, −157.1)	(3.0, 2.9, 2.9)
15	(1700, −160, 85)	(1700.7, −162.1, 83.2)	(0.7, 2.1, 1.8)
16	(1750, 135, −20)	(1754.9, 132.2, −22.9)	(4.9, 2.8, 2.9)
17	(1800, −50, 210)	(1801.8, −52.9, 208.1)	(1.8, 2.9, 1.9)
18	(1850, 65, −190)	(1853.0, 67.9, −187.1)	(3.0, 2.9, 2.9)
19	(1900, −120, 140)	(1900.6, −122.1, 138.2)	(0.6, 2.1, 1.8)
20	(2000, 85, −110)	(2004.1, 82.2, −112.9)	(4.1, 2.8, 2.9)

Table 5. Quantitative comparison of different 3D perception schemes.

Hardware Representation	3D Positioning Accuracy (Range: 1.0–2.0 m)	Robustness to Field Environment	Hardware Cost (USD)
Depth Camera (e.g., RealSense D415/D435i/D455)	Error escalates to 30–50+ mm RMSE	Medium (Prone to failure under strong direct sunlight or textureless leaves)	~$350–540
Monocular Camera + Solid-State LiDAR	1.45 mm MAE (Stable across range)	Excellent (Physical ToF ranging, immune to ambient light)	$853.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, J.; Lyu, D.; Xia, C. SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage. Appl. Sci. 2026, 16, 3510. https://doi.org/10.3390/app16073510

AMA Style

Han J, Lyu D, Xia C. SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage. Applied Sciences. 2026; 16(7):3510. https://doi.org/10.3390/app16073510

Chicago/Turabian Style

Han, Jiangyi, Deyuan Lyu, and Changgao Xia. 2026. "SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage" Applied Sciences 16, no. 7: 3510. https://doi.org/10.3390/app16073510

APA Style

Han, J., Lyu, D., & Xia, C. (2026). SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage. Applied Sciences, 16(7), 3510. https://doi.org/10.3390/app16073510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCEW-YOLOv8 Detection Model and Camera-LiDAR Fusion Positioning System for Whole-Growth-Cycle Management of Cabbage

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction and Preprocessing of Whole-Growth-Cycle Cabbage Image Dataset

2.1.1. Field Image Acquisition and Scene Coverage

2.1.2. Data Augmentation and Dataset Division

2.2. Improved SCEW-YOLOv8 Model for Cabbage Perception

2.2.1. Optimization of Space-to-Depth Downsampling Based on SPD-Conv

2.2.2. Global Feature Extraction Enhancement Based on C2f-CX Module

2.2.3. Target Feature Focusing Optimization Based on EMA Cross-Space Attention

2.2.4. Bounding Box Regression Loss Optimization Based on WIoU v3

2.3. Design of Camera-LiDAR Fusion Perception System

2.3.1. Sensor Hardware Selection and System Integration

2.3.2. Multi-Sensor Time Synchronization Scheme Based on ROS2

2.3.3. Camera-LiDAR Spatial Joint Calibration Method

2.3.4. Visual Semantic-Driven Camera-LiDAR Fusion 3D Ranging Algorithm

2.4. Experimental Setup and Evaluation Metrics

2.4.1. Training Environment and Hyperparameter Configuration

2.4.2. Performance Evaluation Metrics

3. Results and Analysis

3.1. Performance Evaluation of Cabbage Recognition Model

3.1.1. Comparative Experiment of Mainstream Object Recognition Models

3.1.2. Ablation Experiment of Improved Modules

3.1.3. Visual Analysis of Inference Results

3.2. Performance Analysis of the Fusion Ranging System

3.2.1. Ranging Experimental Platform and Test Scheme

3.2.2. Ranging Error Statistics and Performance Analysis

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI