1. Introduction
Bok choy (
Brassica rapa subsp. Chinensis) is a nutritionally vital vegetable crop, yet its productivity is severely constrained by weed infestation [
1]. Weeds compete with bok choy for critical resources, including light, water, and soil nutrients, while also serving as reservoirs for pests and pathogens [
2], collectively diminishing crop yield and quality of bok choy [
3]. Conventional weed control methods face significant challenges: manual weeding is limited by its labor-intensive nature and operational inefficiency [
4], whereas chemical herbicides, often applied non-selectively, contribute to environmental contamination, herbicide resistance, and excessive agrochemical waste [
5]. In organic production systems, stringent regulations prohibit the use of synthetic herbicides to align with ecological farming principles [
6]. The advancement of agricultural mechanization, automation, and intelligent technologies has facilitated the adoption of mechanical weeding in vegetable production systems, leveraging its superior efficiency and precision to mitigate labor and sustainability challenges [
7]. The key and premise of mechanical weeding is the precise detection of weeds.
Researchers have extensively explored weed detection methods [
8]. Previous studies primarily depended on manually engineered features such as color, texture, morphology, and multi-spectral characteristics of weeds/crops [
9]. However, due to the high visual similarity between weeds and crops, such approaches often suffer from limited detection accuracy and robustness [
10]. With the rapid advancement of artificial intelligence, particularly deep learning (DL), automated extraction of hierarchical features from large datasets has become feasible, leading to widespread applications in image recognition, speech processing, natural language understanding, and autonomous driving [
11,
12]. Consequently, DL-based weed detection has gained increasing attention [
13,
14], and a growing number of researchers are employing various deep learning (DL) models for weed detection in crop fields [
15,
16,
17,
18].
While deep learning models have demonstrated remarkable success in weed detection for agricultural applications, significant challenges persist in achieving consistently high recognition rates and computational efficiency for specific crop varieties [
19,
20,
21]. Recent advances have focused on architecture-level optimizations to overcome these limitations [
8,
22]. Chen et al. [
23] developed the YOLOv8-EGC-Fusion (YEF) model through integration of Efficient Graph Convolution (EGC) and Group Context Anchor Attention (GCAA) modules, achieving a 3.2% improvement in mAP (97.2% vs. 94.0%) over baseline YOLOv8 in vegetable fields. In another study, Kong et al. [
24] proposed an attention-enhanced YOLO variant for cornfields, which reduced GPU memory consumption by 35% while maintaining 94.5% Precision through backbone replacement and feature fusion optimization. For rice paddies, Peng et al. [
25] reconstructed RetinaNet’s backbone with multi-scale fusion techniques, demonstrating a 15% reduction in reference time with negligible accuracy drop (<1.2%).
While DL has become a prominent tool for automated weed detection, offering alternatives to labor-intensive manual weeding and non-selective herbicide use, its prevailing direct detection paradigm—which requires models to recognize and classify weed species—encounters fundamental, application-specific bottlenecks in the context of seedling-stage bok choy fields. In this specific agronomic scenario, two intertwined bottlenecks render the direct detection approach particularly challenging. One is the data annotation bottleneck, as the prevailing direct detection paradigm requires training datasets exhaustively annotated with diverse weed species, which is prohibitively costly and often impractical to acquire for real-world fields. The other is the visual discrimination bottleneck. At the seedling stage, bok choy and weeds exhibit high visual similarity in color, texture, and morphology. This intrinsic ambiguity, combined with their random distribution, fundamentally limits the accuracy and robustness of models that rely on directly distinguishing crops from weeds based on visual features.
To circumvent the dependency on exhaustive weed species annotation and to overcome the inherent limitation of visual discrimination in this specific agronomic scenario, this study proposes a novel indirect weed detection strategy. Instead of directly identifying diverse weed species, which is hampered by the aforementioned bottlenecks, the approach re-frames the problem. This study aimed to: (1) achieve indirect weed detection through bok choy identification using the enhanced RCW-YOLOv10 model and ExG-based segmentation of residual vegetation; and (2) automatically generate weed distribution maps to facilitate precision weeding. Based on this, weed localization performance can be achieved that is comparable to direct, annotation-intensive methods, while completely eliminating the need for annotated weed training data.
2. Materials and Methods
2.1. Dataset
This study focused on developing the RCW-YOLOv10 module for detecting bok choy at the seedling stage. Image acquisition was conducted across 3 bok choy fields in Qixia District, Nanjing (32.120° N, 118.480° E), China, during typical growing seasons (May and October 2023). A Canon EOS600D digital camera (Canon Inc., Tokyo, Japan) was adopted to capture 1050 original images at 60 cm above ground level under varying illumination conditions (sunny, cloudy, and overcast) to ensure sample diversity. To ensure sample independence and enhance model generalization, each original image was acquired from a distinct, non-overlapping plot, and the training set was subsequently augmented via translation, rotation, mirroring, scaling, and brightness adjustment to mitigate overfitting. Following augmentation, the final dataset comprised 3000 images, with all samples retaining the original resolution of 1792 × 1344 pixels.
All 3000 images were manually annotated using LabelImg (v1.8.6) [
26], generating XML annotation files. During annotating, when the bounding boxes of two distinct bok choy plants overlapped, they were kept as separate annotations if the Intersection over Union (IoU) was less than 20%. Each visually distinguishable bok choy plant, regardless of its proximity to others, was annotated with a single, dedicated bounding box to ensure that the model learns to detect individual plant instances. The dataset was then randomly split into training (60%), validation (20%), and testing (20%) subsets, ensuring balanced class distribution across all partitions.
2.2. The RCW-YOLOv10 Module
The YOLOv10 architecture was selected as the foundation due to its favorable balance between accuracy and inference efficiency [
27,
28,
29,
30]. However, for detecting seedling-stage crops in dense, weed-infested fields, a scenario characterized by small objects and cluttered backgrounds, the standard backbone network presents limitations. Conventional down-sampling operations may lose fine-grained details of small crops. And the feature fusion mechanisms may be suboptimal for distinguishing crops from visually similar weeds. To address these specific issues, two targeted optimizations were introduced to the backbone network: (1) Replacing the original down-sampling modules with Receptive Field Dilated (RFD) blocks to enhance small-object detection capability; and (2) Substituting standard Channel-to-Pixel (C2f) modules with Wide Dense Bottleneck Blocks (C2f-WDBBs) to improve feature representation while maintaining computational efficiency. The resulting RCW-YOLOv10 module, illustrated in
Figure 1, is specifically designed to overcome the challenges posed by randomly distributed seedling-stage bok choy.
2.3. The RFD Module
The RFD module was designed to replace standard down-sampling layers in the YOLOv10 backbone, specifically to mitigate the loss of fine-grained features in small, sparse targets like seeding-stage bok choy. Its core mechanism employs multi-branch, multi-scale feature preservation at the point of resolution reduction. By capturing and combining complementary contextual information through parallel pathways, the module aims to retain discriminative features crucial for detecting small crop seedlings for deeper network layers [
31]. The RFD module operates in two specialized variants tailored to different network depths. One is the Shallow Robust Feature Down-sampling (SRFD) variant, deployed early in the backbone, which employs parallel 1 × 1 convolutions and residual connections to preserve the fine-grained details and edge information of seedling leaves susceptible to initial down-sampling loss. The other one, in contrast, is the Deep Robust Feature Down-sampling (DRFD) variant, which is located in deeper stages, employing dilated convolutions and channel attention to capture higher-level abstract contextual features. It enhances the model’s ability to distinguish crops from complex, weedy backgrounds. The design of the RFD module improves feature extraction for this specific task. The architecture of RFD, including details of SRFD and DRFD variants, is illustrated in
Figure 2.
2.4. C2f-WDBB Module
The C2f module is a fundamental feature-propagation component in YOLO architectures [
32]. For the specific task of segmenting seedling-stage bok choy from weeds, the standard C2f structure was hypothesized to be suboptimal. The primary challenges are: (1) the detection targets (bok choy) are small, densely distributed, and exhibit high visual similarity to surrounding weeds; and (2) the system must be computationally efficient for potential field deployment. The fixed branching and single-path aggregation in the standard C2f module may not adequately model the complex, fine-grained feature interactions required to distinguish crops from weeds in such dense, visually ambiguous scenes. This could lead to a loss of discriminative detail that is crucial for high-precision segmentation.
To address these specific challenges—preserving critical detail for small, similar targets while maintaining a low computational footprint—the Wide Dense Bottleneck Block (WDBB) was designed to replace the core operations within C2f, resulting in the C2f-WDBB module. Its multi-branch design architecture, combined with structural re-parameterization, targets distinct aspects of the detection challenge: (1) A multi-branch feature extractor where each branch targets a specific visual characteristic: (a) The standard 3 × 3 convolution serves as the foundational feature learner; (b) the 1 × 1 convolution branch is intended to enhance local discriminative features critical for distinguishing bok choy from visually similar weed textures; (c) the average pooling branch promotes translation invariance, increasing robustness to minor positional shifts in plants in the field; and (d) the WDBB-specific horizontal and vertical convolutional branches are designed to capture the oriented, elongated stem and leaf structures typical of seedling bok choy, which are often lost in isotropic convolution operations. (2) Width expansion increases channel dimensions to boost feature diversity, allowing the network to model the broader set of visual patterns present in complex field scenes. (3) Structural re-parameterization is employed to decouple training-time capacity from inference-time efficiency. During training, the multi-branch structure provides a richer gradient flow and learning capacity, enabling the network to better fit the complex visual patterns of crops and weeds. During inference, these branches are linearly fused into a single, efficient convolutional layer. This ensures the robustness gained from multi-branch training is preserved without sacrificing inference speed or accuracy, as the fused layer equivalently represents the learned feature transformations. The re-parameterized weights are computed by:
while absorbing batch normalization parameters for efficiency. The variables in the formula are defined in
Table 1, and the architecture of C2f-WDBB is depicted in
Figure 3.
2.5. Weed Detection
Weed species exhibit remarkable morphological diversity owing to different crop types, growth stages, and population densities, posing significant challenges for direct weed detection and compromising system robustness [
33]. The proposed framework employs an indirect weed detection strategy, which reframes the problem from direct species recognition to a sequential pipeline of crop localization followed by residual vegetation processing. The complete workflow is initiated with the bounding box detection of bok choy provided by the RCW-YOLOv10 model. A binary crop mask was first generated from these detections. To conservatively handle potential localization inaccuracies at crop leaf boundaries and prioritize the avoidance of misclassifying crop pixels as weeds, this initial mask was morphologically dilated, establishing a buffer zone around each detected plant. Subsequently, vegetation outside this exclusion zone was segmented. An optimized Excess Green (ExG) index was applied to the RGB image, using normalized channel values for robustness, followed by an initial binary vegetation mask. This mask was then refined through post-processing: a morphological opening with a 3 × 3 kernel removes small, isolated noise pixels; a morphological closing with the same kernel fills small holes within potential weed regions; and finally, connected components with an area smaller than a defined threshold (150 pixels) were filtered out as agronomically insignificant residues. The centroids of the remaining connected components were calculated, yielding the image-coordinate locations of identified weed patches. To translate this into an actionable format for field operations, the image was partitioned into a uniform 6 × 8 grid (48 cells of 224 × 224 pixels). A grid cell was designed as a candidate treatment zone if it contained one or more weed centroids, thereby generating the final weed distribution map that spatially discretizes weed presence for potential guidance of targeted intervention. The full system workflow is illustrated in
Figure 4.
2.6. The Experimental Platform
The experimental setup utilized PyTorch (version 1.13.0 with CUDA 11.6,
https://pytorch.org, accessed 10 December 2024) on a high-performance computing platform featuring 128GB RAM, an Intel Core i9-10920X CPU (3.50 GHz), and an NVIDIA RTX 3080 Ti GPU running Ubuntu 20.04.1. The RCW-YOLOv10 model was pre-trained using ImageNet, which is a large dataset with more than 14 million labeled images [
34], to initialize the weights using a transfer learning approach. A cosine annealing scheduler was employed to adjust the learning rate, decaying from the initial value (0.01) to 1 × 10
−5 over the 100 training epochs, with a batch size of 16. The core hyperparameters followed the recommended configurations for YOLOv10, with the optimization process using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005.
2.7. Evaluation Metrics
The detection performance was comprehensively evaluated across both accuracy and efficiency using six key metrics: Precision, Recall, mAP50, mAP50-95, Parameters, and Giga Floating-point Operations per Second (GFLOPs).
Precision measures the accuracy of positive predictions:
Recall measures the model’s ability to identify all positive instances:
In the evaluation metrics, true positives (TP) represent samples correctly identified as positive (actual positive predicted as positive), false positives (FP) denote samples incorrectly classified as positive (actually negatives predicted as positives), and false negatives (FN) indicates those wrongly rejected as negative (actual positives predicted as negatives), where all terms are defined relative to the classifier predictions versus ground truth labels.
Intersection over Union (IoU) quantifies the spatial overlap between a predicted bounding box and its ground truth box, calculated as:
For the evaluation, a predicted bounding box is considered a TP if its IoU with a ground truth box is greater than 0.5, following a common threshold in object detection for practical applications.
Average Precision (AP) quantifies the model’s detection quality across all levels by integrating the Precision–Recall curve. For object detection, it is computed by:
where p(r) is the Precision at Recall r, with the curve interpolated to ensure monotonicity.
Mean Average Precision (mAP) extends AP to multi-class scenarios:
with N being the number of classes.
In object detection evaluation, mAP50 and mAP50-95 are standardized variants of mAP. mAP50 calculates the mean AP across all categories at a fixed IoU threshold of 0.5, considering predictions with IoU exceeding 0.5 as true positives, reflecting baseline performance under moderate localization requirements, while mAP50-95 rigorously evaluates both classification and localization by averaging mAP over 10 equidistant IoU thresholds from 0.5 to 0.95 in 0.05 increments, formulated as:
And the higher threshold range of up to IoU = 0.95 demands precise bounding box alignment for positive identification.
For real-time weeding applications, model complexity directly impacts field deployment viability, which can be evaluated through two quantifiable metrics: (1) Parameters representing total trainable elements, including weights, biases, and activations, with higher parameter counting increasing training time and computational resource demands; and (2) GFLOPs as a hardware efficiency metric, with lower values indicating reduced computational load and enabling faster real-time performance.
2.8. Image Processing
Building upon the successful detection of bok choy, the crop regions were accurately localized and masked in the original RGB image, while residual green vegetation in the background is algorithmically classified as potential weeds.
For automated weed segmentation in bok choy fields, the color-based pipeline employed an optimized ExG index. The ExG index is predicated on the distinctive spectral reflectance of photosynthetically active vegetation, which exhibits high reflectance in the green channel. The application of ExG aftercrop masking simplifies the scene, primarily to soil and residual plants, which mitigates some mis-classification risks compared to its application on a full image. Leveraging insights from Morid et al.’s foundational work [
35] and Jin et al.’s refinements [
33], the ExG index was further enhanced by the following formula to improve segmentation accuracy:
To mitigate sensitivity to ambient lighting conditions, the r, g, and b values in the above equation were normalized R, G, and B channel values, and calculated by:
The ExG index was selected for its computational efficiency and proven effectiveness in segmenting green vegetation from soil backgrounds under normal field lighting. Its application after crop masking provides a cost-effective means to identify residual vegetation as operational weeds. However, the performance of this color index can be compromised under conditions of severe shadow, specular highlights, or when targeting senescent (non-green) vegetation.
3. Results
3.1. Ablation Experiment
Ablation experiments were conducted to quantitatively evaluate the contributions of the proposed RFD and C2f-WDBB modules. All variants were trained for 100 epochs under identical hyperparameters. The results are summarized in
Table 2 and visualized in
Figure 5.
Experimental results confirmed that both the RFD module and C2f-WDBB module independently promote YOLOv10’s performance. The RFD module improved detection accuracy, increasing Precision, Recall, mAP50, and mAP50-95 by 1.3%, 0.9%, 0.5%, and 2.2% points, respectively, as shown in
Table 2, while maintaining computational load, with GFLOPs unchanged at 8.4 G, in exchange for a marginal 1.3% increase in parameters. In contrast, the C2f-WDBB module delivered dual advantages of enhanced recognition accuracy alongside superior computational performance, achieving a 7.2% reduction in parameters and a reduction in GFLOPs (from 8.4 G to 6.5 G). Comparative analysis revealed that the structural complexity of the baseline C2f module, involving multiple convolutional layers and residual connections, resulted in a higher parameter count. In contrast, the proposed C2f-WDBB module employs architectural refinements such as parameter sharing and re-parameterization to achieve greater parameter efficiency. These architectural refinements collectively demonstrated how specialized modifications can simultaneously advance accuracy and efficiency in object detection systems, with the RFD module excelling in feature extraction quality while the C2f-WDBB module optimized computational operations.
Introducing the RFD module improves detection accuracy at the cost of a marginal parameter increase, while maintaining identical computational complexity (GFLOPs) compared to the baseline. This indicates that the RFD module successfully repurposes the existing computational budget for more effective feature extraction. However, this accuracy gain came at the cost of a moderate increase in model parameters. Its value lies in its ability to convert a minimal parameter overhead into a significant accuracy improvement, a favorable trade-off for applications where accuracy is prioritized and the marginal parameter increase is acceptable. The synergistic integration of both modules yielded superior improvements, achieving an optimal balance by increasing mAP50 by 1.1 percentage points, reducing GFLOPs by 1.1 G, and decreasing the total number of parameters by 38.5% compared to the baseline YOLOv10.
Comparative analysis of
Table 2 (rows 2–3) indicated distinct optimization characteristics: the RFD module exhibited marginally superior accuracy enhancement, while the C2f-WDBB module demonstrated more substantial gains in model compactness (parameter reduction). Their integration yielded balanced improvements, simultaneously increasing mAP50 by 1.1% and mAP50-95 by 0.4% while significantly reducing Parameters and GFLOPs. The result suggests that the RFD and C2f-WDBB modules play complementary roles—enhancing feature representation and improving structural efficiency, respectively—and that their combined use can lead to a model that is both more accurate and more efficient than the baseline.
The achieved reduction in parameters and GFLOPs for the final RCW-YOLOv10 model represents a necessary condition and a strong theoretical indicator for improved operational efficiency, which is critical for real-time applications. However, definitive confirmation of real-time viability on specific hardware requires direct measurement of inference latency, which is beyond the scope of this architectural study but is a logical and necessary focus for subsequent deployment-oriented work.
Figure 5 illustrates the trend in mIoU of the predicted versus ground truth bounding boxes across the ablation variants. While mAP is the primary detection metric, mIoU provides a direct, threshold-agnostic measure of average localization quality, complementing the mAP analysis by showing the pure geometric overlap improvement.
Comparative evaluation in
Table 3 delineates a clear efficiency–accuracy profile for RCW-YOLOv10. While its mAP50 of 98.0% is highly competitive, it does not lead all accuracy metrics, particularly the stringent mAP50-95, when compared to more complex two-stage detectors (DETR, Faster R-CNN). The primary advantage of RCW-YOLOv10 is its exceptional computational and parametric efficiency, requiring up to 84.7% fewer parameters and 76.5% fewer GFLOPs than these benchmarks. The substantial reduction in GFLOPs is a key prerequisite and a strong indicator for the potential of high-throughput, real-time processing, although definitive confirmation requires latency measurement on deployment hardware, a logical focus for future work.
3.2. Detection Results
Figure 6 visually validates the system’s performance under different typical field conditions. The left column shows a challenging and dense planting scenario with bok choy and weeds clustered, and the right column exhibits a case where bok choy departs from weeds. The RCW-YOLOv10 model achieved a consistently high Precision value of 95% across diverse field conditions, including variable lighting and occlusion scenarios. This reliable crop detection enables precise weed segmentation, with Precision improvements over conventional color-thresholding methods.
3.3. Weed Mapping
After bok choy detection, crop regions were masked from the original images to isolate remaining vegetation. These residual green areas were segmented as weeds using an optimized ExG index, followed by morphological filtering to reduce noise and refine the weed mask. Based on the segmentation results, a weed distribution map was generated, as shown in the second row of
Figure 6, which can enable accurate localization of weed regions in image coordinates.
For operational planning, each image was divided into a uniform 6 × 8 grid, where the cell dimensions (224 × 224 pixels) were designed to correspond to a physical area of approximately 60 mm × 60 mm on the ground, following the spatial calibration. Grid cells containing weed centroids were designated as candidate treatment zones, as illustrated in the third row of
Figure 6.
4. Discussion
While DCNNs have achieved reliable accuracy, their conventional implementation requires exhaustively annotated datasets, which is a significant data annotation bottleneck. This study addresses this by proposing an indirect strategy that circumvents the need for weed-species labels entirely. Instead of recognizing weeds, the framework first localizes bok choy and then segments residual green vegetation as operational weeds. The primary contribution is thus this data-efficient paradigm shift. The method’s effectiveness is contingent on two factors: the Recall of the crop detector and the performance of the ExG-based segmentation under field conditions. While the ExG index is efficient for green vegetation, its limitation in detecting senescent (non-green) weeds explicitly defines the operational boundary of the current framework. Therefore, the method’s utility lies in its simplified data requirement and functional output for targeted intervention, as validated in this study, with performance inherently tied to the reliability of its sequential steps.
The primary contribution of this approach is its ability to generate useful weed segmentation maps without weed-labeled data, circumventing the data bottleneck rather than demonstrating invariant performance across all unseen weed species. The quality and robustness of the final weed map are inherently dependent on two factors: the Precision and Recall of the initial bok choy detector and the effectiveness of the ExG-based segmentation and subsequent morphological filtering on the specific field conditions encountered. While the ExG index is generally effective for segmenting green vegetation, its performance can vary with plant pigmentation, lighting, and the presence of senescent (non-green) weeds. Therefore, the method’s practical utility lies in its simplified data requirement and functional output for targeted intervention, with the understanding that its performance is contingent on the reliability of its sequential steps as validated in this study on bok choy fields.
The YOLO series performs object detection through its single-stage regression architecture and multi-scale feature fusion, yet faces inherent deficiencies where grid-based detection restricts small-target sensitivity to below 30% AP for objects smaller than 32 pixels, while computational complexity challenges edge deployment [
36]. Although YOLOv10 improves efficiency, its standard down-sampling may still contribute to information loss for small objects. The RCW-YOLOv10 model was designed to address these potential shortcomings through two backbone innovations: (1) RFD modules enhance multi-scale feature extraction by replacing standard down-sampling layers, a design intended to better preserve features of small, sparse targets; and (2) C2f-WDBBs optimize cross-level feature fusion efficiency. Given that the primary detection targets in this study (seedling stage bok choy) are characteristic small objects, the significant increase in overall mAP50 provides direct evidence supporting the effectiveness of these architectural improvements for small-target recognition. The model achieves a superior speed–accuracy trade-off, as evidenced by the higher mAP50 with a 38.5% reduction in parameters and maintained computational load (GFLOPs), making it suitable for bok choy detection in agricultural settings.
Most existing weed detection studies have primarily concentrated on evaluating model performance in terms of accuracy, such as Precision, Recall, or mAP, with the challenges of real-world field-to-action mapping often receiving less emphasis. This study advances beyond pure detection metrics by translating results into an operationally structured weed distribution map. The fixed grid system provides a spatial framework that can serve as the foundational data layer for potential downstream functions, such as guiding targeted intervention or informing path planning. It is crucial to distinguish this computational proof-of-concept from a validated field system. The conversion of grid coordinates to real-world action, real-time latency validation, and agronomic testing for variable-rate application are critical next steps. This work establishes the essential perceptual and mapping foundation upon which such future integrated systems can be built.
The resulting maps demonstrate the potential to support several operational functions in a future integrated system: (1) localization of weed clusters with a grid-based coordinates system, which could, with accurate calibration, support targeted intervention; (2) a modular spatial representation that could inform robotic path-planning algorithms; and (3) spatial zoning of weed presence at the image level, which forms a foundational data layer that could, in principle, support variable-rate decision based on weed density. Collectively, these components establish a scalable computational foundation for translating perception into potential action within automated weeding systems. The framework, by generating spatially explicit weed maps, holds promise for enabling more precise, crop-specific interventions in future precision agriculture applications that integrate reliable actuation and control systems.
It is important to note that while the reported reductions in GFLOPs and parameters indicate improved theoretical efficiency, the actual inference latency on specific hardware for real-time field deployment would require further profiling and optimization, which is a key direction for future applied work.
5. Conclusions
This study proposed RCW-YOLOv10, an enhanced YOLOv10 variant optimized for seeding-stage bok choy detection through two architectural innovations: (1) an RFD module designed to improve multi-scale feature extraction, and (2) C2f-WDBB to enhance feature fusion efficiency. The improved model achieved a 1.1% increase in mAP50 and a 38.5% reduction in parameters, while maintaining GFLOPs, compared to the baseline YOLOv10. Furthermore, the indirect detection strategy, first localizing bok choy and then segmenting residual green vegetation via an optimized ExG index, eliminated the need for species-specific weed annotations, addressing a key data bottleneck. The framework successfully generated operationally structured weed distribution maps, which provide a spatially discrete representation of weeds. Collectively, this work delivers a practical and extensible computational framework that establishes a critical data foundation, comprising an efficient detector, an annotation-light strategy, and a structured weed map, for integrating visual perception into future intelligent robotic weeding systems.