Two-Stream Dense Feature Fusion Network Based on RGB-D Data for the Real-Time Prediction of Weed Aboveground Fresh Weight in a Field Environment

: The aboveground fresh weight of weeds is an important indicator that reﬂects their biomass and physiological activity and directly affects the criteria for determining the amount of herbicides to apply. In precision agriculture, the development of models that can accurately locate weeds and predict their fresh weight can provide visual support for accurate, variable herbicide application in real time. In this work, we develop a two-stream dense feature fusion convolutional network model based on RGB-D data for the real-time prediction of the fresh weight of weeds. A data collection method is developed for the compilation and production of RGB-D data sets. The acquired images undergo data enhancement, and a depth transformation data enhancement method suitable for depth data is proposed. The main idea behind the approach in this study is to use the YOLO-V4 model to locate weeds and use the two-stream dense feature fusion network to predict their aboveground fresh weight. In the two-stream dense feature fusion network, DenseNet and NiN methods are used to construct a Dense-NiN-Block structure for deep feature extraction and fusion. The Dense-NiN-Block module was embedded in ﬁve convolutional neural networks for comparison, and the best results were achieved with DenseNet201. The test results show that the predictive ability of the convolutional network using RGB-D as the input is better than that of the network using RGB as the input without the Dense-NiN-Block module. The mAP of the proposed network is 75.34% (IoU value of 0.5), the IoU is 86.36%, the detection speed of the fastest model with a RTX2080Ti NVIDIA graphics card is 17.8 fps, and the average relative error is approximately 4%. The model proposed in this paper can provide visual technical support for precise, variable herbicide application. The model can also provide a reference method for the non-destructive prediction of crop fresh weight in the ﬁeld and can contribute to crop breeding and genetic improvement.


Introduction
In the process of weed management, uniform herbicide spraying is currently the most commonly applied weeding method [1]. However, the large-scale use of herbicides has led to the pollution of the natural environment, increased weed resistance, hidden dangers to food security and biodiversity, and many other agricultural and ecological problems [2,3]. This has led to more focused research on precision weed management strategies. In precision weed management, the most commonly used method is to determine the location of weeds through computer vision technology and to apply herbicides to individual weeds [4,5]. However, from the perspective of plant physiology, the dosage of herbicides is closely related to the type of weed and its physiological parameters [6][7][8]. Applying a uniform herbicide dosage does not maximize the use of herbicides, and there is therefore still room for optimization. Herbicides in weeds directly act on the cells of the weeds, affecting cell metabolism and transport functions and eventually killing the weeds [9,10]. The size of weed cells and tissues directly determines the herbicide dosage [6,11]. The aboveground fresh weight is an index that best reflects the size and cell content of plants [12,13], and it is suitable for use as a quantitative index to provide a basis for determining herbicide dosages for real-time variable herbicide spraying. Therefore, the development of a vision system that can detect weeds in real time in complex farmland environments and obtain fresh weight data can change the process of determining precise application doses and has important guiding significance for precise weed management.
The use of visual technology is a rapid and effective method for evaluating the fresh weight of plants. Jiang et al. [14] developed a lettuce weight monitoring system in a plant factory that segmented RGB images and used the number of pixels and the plant weight data to establish a regression equation. Arzani et al. [15] established a regression relationship between fruit diameter and fresh weight. Reyes et al. [16] used Mask R-CNN to segment RGB images to obtain plant characteristics and establish a regression equation between fresh weight and characteristics to obtain the fresh weight of lettuce. The experiment was carried out on a hydroponic growth bed. Mortensen et al. [17] performed 3D point cloud segmentation and obtained the surface area parameters of lettuce for fresh weight prediction. Lee et al. [18] used the 3D point cloud obtained by Kinect for 3D printing and correlated the weight of the cabbage with the amount of material consumed by the 3D printer. The main method used in the current research is to first extract the plant to be predicted from the background, extract the characteristics of the plant and then establish an association with the fresh weight. However, farmland scenes are complex and changeable; the soil background is uneven, the light fluctuates, the weeds in the field overlap each other, and the types, spatial shapes, and growth positions of the weeds differ. It is therefore difficult to extract weeds from complex backgrounds [19,20]. At the same time, the phenotypic information for each plant is extracted as a predictive factor (e.g., measured value [21], leaf area index [22], pixel number [23]) to establish a single linear regression relationship with the fresh weight of the plant; this single-factor approach does not include sufficient information. The fresh weight of weeds is determined not by a single characteristic parameter but by a combination of multiple characteristics. Therefore, it is still difficult to correlate the multidimensional characteristics of weeds with their fresh weight.
Current methods around the extraction of weeds from the background include the use of computer vision techniques and spectral features as the two main directions. This can be effectively distinguished if there is a significant difference in spectral reflectance between the two weeds [24][25][26][27]. However, the use of spectral cameras is often expensive and demanding (illumination) and it is also difficult to distinguish between weeds with similar spectral features. On the other hand, weed identification using visible light images is mainly based on color features, shape features, or texture features [28][29][30][31][32]. Weed detection based on color features uses different weed and crop color thresholds for effective differentiation. However, when faced with similarly colored weeds and crops, it is difficult to distinguish them even with color space conversion [33], especially for large fields with a relatively large number of weed species, and it is relatively difficult to identify each weed species at a granular level. Identification based on shape and texture is also relatively difficult under conditions of overlapping leaves and similar weed shapes, where shape feature templates are susceptible to interference [34][35][36]. The interspecific similarity of weeds to weeds and the similarity of weeds to crops makes it difficult to perform multi-species weed detection using single-function computer vision methods. In this context, deep learning techniques have performed well in the field of image target detection and recognition [37][38][39][40]. CNNs can automatically acquire multiple features in visible images that are effective for target object recognition, are robust for multi-species target detection in complex environments, and have been applied to weed recognition research [41][42][43][44]. The use of CNN technology for weed identification holds good promise. Weeds are polymorphic, and different types of weeds exhibit different spatial scale information; the same weed may even exist at different spatial scales at different life stages. The use of 2D plane information obtained as RGB images for estimation has limitations, and it is difficult to use this information to accurately describe the spatial stereo information of weeds. As 3D point cloud technology has developed, it has begun to be applied for plant spatial detection. Zhou et al. [45] used 3D point cloud technology to segment soybean plants. Li et al. [46] developed a low-cost 3D plant morphological characterization system. Chaivivatrakul et al. [47] used 3D reconstruction to characterize the morphology of corn plants. Unlike RGB information, 3D point cloud information can provide spatial scale information [48], and it is obviously more advantageous for describing the spatial structure of weeds. 2D or 3D information essentially obtains phenotypic parameters as a single predictor variable for linear regression. Sapkota et al. [49] used canopy cover obtained from UAV imagery to build a regression model with ryegrass biomass. However the single linear model is weakly expressive and may ignore other potential information in the imagery that has an impact on above ground biomass. The convolutional neural network model has unique advantages for addressing nonlinear relationships. Such models can recognize the complex and nonlinear relationship between the input and output of the modelling process [50,51] and automatically learn implicit characteristic information to directly perform nonlinear regression predictions of fresh weight. At present, it is relatively rare to use convolutional neural networks to directly associate 3D information from weeds with their fresh weight.
In this work, to accurately locate weeds and to predict the fresh weight of weeds with different shapes and positions against a complex farmland background, a combination of 3D point cloud and deep learning techniques is explored.
The contributions of this article are as follows: • A method of data collection and preprocessing for constructing the fresh weight of different kinds of weeds is proposed.

•
A YOLO-V4 model and a dense fusion network of two-stream features are established for weed detection and fresh weight estimation.

•
The proposed method is tested and analyzed.

Research Area and Objects
To ensure the practical relevance of the study, the study area was a pristine agricultural arable field in Xiangfang Farm, Harbin, China (126 • 43 34.31" E, 45 • 44 29.98" N). Maize was sown on 4 May 2020 and planted in a flat crop. Herbicide applications were made in two main stages, the first stage was a pre-sowing closure treatment and the second stage was a herbicide application at the stalk stage, with precision-to-target application techniques mainly geared towards the second stage. So after sowing a closed treatment is applied with acetochlor herbicide. After the closure treatment, we selected three weeds with large population sizes in the field for our study. Sonchus arvensis is a perennial herb of the genus Sonchus in the Compositae family. Solanum nigrum is an annual herb in the Solanaceae family. Abutilon theophrasti Medicus is an annual subshrub herb in the Malvaceae family. In the process of collecting these three kinds of weeds, we did not perform any treatment on other weeds in order to maintain the natural state of the farmland. As depth data were used in this study and weed height was an important factor, three different growing heights were selected for the study. Sonchus arvensis grew at relatively low heights, Abutilon theophrasti Medicus at higher heights and Solanum nigrum in the middle of the two.

Platform and Equipment
The aim of this study was to provide visual support for an accurate variant target spray system. Therefore, an acquisition platform was used for data acquisition to simulate the field application process. The platform is driven by two wheels with adjustable track speed. The differential steering principle is also used to facilitate easier steering in the Remote Sens. 2021, 13, 2288 4 of 29 field [52]. The wheel spacing is adjustable for easy adjustment of the monopoly distance. Two Kinect v2 sensors are mounted under the platform at a height of 0.8 m above the ground. The centers of the two Kinect v2 sensors are 0.7 m apart.
The Kinect v2 sensor is equipped with a 1920 × 1080 resolution camera and a 512 × 424 depth sensor. The field of view of the camera is 70 • × 60 • , and the detection range is 0.5 to 4.8 m. The distance between the object and the camera plane can be judged by the reflection time of a projected infrared pulse (ToF). Since infrared pulses are disturbed by natural light in an outdoor environment, it will interfere with the work of Kinect v2 and reduce the quality of depth data. We set up a light shield to weaken the effects of the glare, so that some of the light can enter and the RGB can be captured clearly, and so that the depth camera of the Kinect v2 can be used consistently. When selecting the installation height of the equipment, it is necessary not only to ensure the clear imaging of a single weed but also to correspond to the detection range of the Kinect v2 sensor. Chemical weeding is usually performed in the corn field at the 3-to 5-leaf stage. At this time, the height of the corn and weeds is generally not more than 0.3 meters, so 0.8 meters was selected as the installation height of the Kinect v2 sensor in this study.
The weighing equipment is an analytical balance produced by Shanghai Hochoice Company (China) that is accurate to the milligram.

Collection Method
To establish a one-to-one relationship between a single weed and its fresh weight in an image simultaneously, it is necessary to ensure that the captured image fully conforms to the working state of the weeding robot in the natural environment. The camera height and light fluctuations caused by the platform also need to be considered. Therefore, static single-frame shooting is not possible, and it is necessary to simulate the walking state of the platform on site for dynamic acquisition. In this study, an efficient method for collecting the weed fresh weight data was developed. Figure 1 shows the collection process. We summarize the collection process into four steps: Step A: Before shooting, the staff must first determine the camera's field of view and underline the camera's field of view. Two lines, the edge of the camera's field of view and the position 400 pixels from the edge of the field of view, are established. The area between the two lines is called the label establishment area. After finding the weeds, the weeds are associated with a label, and the weed type and serial number are recorded in the label establishment area. If the weeds are within the same row, they are marked in order from the farthest to the closest label to the camera. If weeds are on the line or in the label establishment area, the weeds will not be recorded, as shown by the red cross in the picture. After the marking is completed, the lines are moved away to avoid affecting the subsequent shooting. At this point, the label establishment area can be distinguished based on the pixels of the captured image. It should be noted that there is no need to mark weeds on the lines or in the label area because this middle area is eventually cut and used to build the data set. Using this data collection method does not cause human interference with the shooting content in the middle area. This ensures that the constructed data set conforms to the natural state. This collection method is also more effective than other methods.
Step B: The collection platform contains two Kinect v2 devices that can collect data from two rows at the same time. The movement speed of the platform is 0.3 meters per second, and the Kinect v2 shooting speed is set to 30 fps. The platform moves straight along the trajectory of the established line. The weeds and tags are photographed at the same time to obtain the RGB-D information for the weeds.
Step C: After the platform passes, the staff uses destructive methods to obtain the aboveground parts of the weeds, weighs them on an electronic balance, and records the weight on the label. The robot will stop after walking 60 meters and wait for the collector to complete the collection before continuing. This avoids, as far as possible, any increase in the fresh weight of weeds on the ground due to time. Step D: The MapColorFrameToDepthSpace function in Kinect v2 for Windows SDK 2.0 is used to match the depth data and image data, which have different resolutions (1920 × 1080 and 512 × 424, respectively). The depth data is converted to 1920 × 1080 resolution, which is the same size as RGB to form RGB-D data. Corresponding weeds in RGB-D are cropped, and a data set corresponding to RGB-D data and fresh weight labels on the ground is obtained. In this process, highly overlapping frames are eliminated, and blurred images are filtered. It is worth noting that due to the different viewing angles and resolutions of the two cameras on the Kinect v2, after the depth image is registered with the color image, there is a certain lack of edge to the depth image, as shown in the detailed view of step D in Figure 1. Since only the 1080 × 1080 area in the middle of the image is used, this deletion does not affect the data. Table 1 shows the date, number of weeds, and weather information obtained in the data set. To ensure the diversity of the data set, the data were collected over half a month, and the spatial range of data collection almost covered the test area (60 × 60 meters). Because herbicides are used mainly in sunny weather, data collection was not carried out on rainy days. The data collection time is between 7am and 10am BST. A total of 20274 images were collected, of which 1200 of each weed had associated aboveground fresh weight data. cropped, and a data set corresponding to RGB-D data and fresh weight labels on the ground is obtained. In this process, highly overlapping frames are eliminated, and blurred images are filtered. It is worth noting that due to the different viewing angles and resolutions of the two cameras on the Kinect v2, after the depth image is registered with the color image, there is a certain lack of edge to the depth image, as shown in the detailed view of step D in Figure 1. Since only the 1080 × 1080 area in the middle of the image is used, this deletion does not affect the data.    Figure 1D, weeds are sorted from left to right according to leaf age and growth size. A detailed enlarged image is shown in Figure A1 in the Appendix to this paper. Table 1 shows the date, number of weeds, and weather information obtained in the data set. To ensure the diversity of the data set, the data were collected over half a month, and the spatial range of data collection almost covered the test area (60 × 60 meters). Because herbicides are used mainly in sunny weather, data collection was not carried out on rainy days. The data collection time is between 7am and 10am BST. A total of 20274 images were collected, of which 1200 of each weed had associated aboveground fresh weight data. The different colored circles represent the respective categories of weeds. The weeds marked with a red cross in the picture represent no recording and collection. In (D), weeds are sorted from left to right according to leaf age and growth size. A detailed enlarged image is shown in Figure A1 in the Appendix A to this paper.

Technical Route
Our approach divides the prediction of weed fresh weight into two tasks. First, a target detection network is trained to determine the location of the weeds in the field; then, a regression model is built to predict the fresh weight of the detected weeds. Figure 2 shows the technical route for predicting the fresh weight of weeds in the natural environment. This route first divides the RGB-D data obtained by the Kinect v2 into three-channel RGB image data and the single-channel D depth data. Second, the RGB data are input into the trained YOLO-V4 model. The weed classification and the bounding box (the location coordinates of the target weed) are obtained. After using the k-nearest neighbor (KNN) method to fill in the missing values of the single-channel depth data again, is the data are normalized. Then, the D image and RGB image are cropped according to the bounding box coordinates obtained by YOLO-V4. The last step is to input the RGB-D data for each weed into the trained two-stream dense feature fusion model to obtain the aboveground fresh weight of the weed. At this point, target (weed) detection and the estimation of aboveground fresh weight have been completed.   The development language environment for this research is Python 3.7, and Tensor-Flow 2.0 is the CNN construction framework. The model was trained and tested on an NVIDIA 2080Ti GPU.

KNN Missing Value Filling
Under actual operating conditions, Kinect v2 is disturbed by the environment, and the depth information obtained has certain missing values. To weaken the influence of missing values, we use the KNN method to process the missing depth values. The main idea of this approach is to select the average value of several points closest in Euclidean The development language environment for this research is Python 3.7, and Tensor-Flow 2.0 is the CNN construction framework. The model was trained and tested on an NVIDIA 2080Ti GPU.

KNN Missing Value Filling
Under actual operating conditions, Kinect v2 is disturbed by the environment, and the depth information obtained has certain missing values. To weaken the influence of missing values, we use the KNN method to process the missing depth values. The main idea of this approach is to select the average value of several points closest in Euclidean Remote Sens. 2021, 13, 2288 8 of 29 distance to the missing value to replace the missing value. The missing depth values are usually closely related to the nearby spatial information, so the KNN method is used for to fill them in. Figure 3 shows a comparison of the original and filled-in depth data. This method can cope with a large range of missing values. distance to the missing value to replace the missing value. The missing depth values are usually closely related to the nearby spatial information, so the KNN method is used for to fill them in. Figure 3 shows a comparison of the original and filled-in depth data. This method can cope with a large range of missing values.

YOLO-V4 Weed Detection Model
The goal of this stage is to train a model to obtain the area of interest of weeds in the actual environment and to lay the foundation for the subsequent estimation of the fresh weight of individual weeds.
YOLO is a single-stage target detection algorithm. The detection speed is faster than that of a two-stage network (Faster-R-CNN [53]). YOLO-V4 [54] introduces the mosaic data enhancement function on the basis of YOLO-V3 [55]; optimizes the backbone network, network training, activation function, and loss function; makes YOLO-V4 faster and more accurate; and achieves the best balance of the existing target detection frameworks. The network uses CSPDarknet53 as the feature extractor, Path Aggregation Network (PANet) as the backbone network integrated extraction feature, and YOLO-V3 as the detection head to achieve target detection.
The main steps for weed detection with YOLO-v4 are as follows.
(1) Data processing. The image acquisition process collected 20,274 images, selected images for detection through visual observation, deleted blurry images, and selected a final set of 7000 images in total. The 1920 * 1080 RGB-D data were first cropped along the label line established during data collection to 1080 * 1080, then scaled to a 540 * 540 matrix. Labeling [56] was then used to mark the RGB images. A total of 12,116 Solanum nigrum were tagged, 12,623 Abutilon theophrasti Medicus were tagged, and 7332 Sonchus arvensis were tagged in the dataset. To distinguish the dataset created from the weed example RGB-D and fresh weight labels, the dataset is referred to as dataset 1, the training set as training set 1, and the test set as test set 1. And the other is referred to as dataset 2. The training set is referred to as training set 2 and the test set is referred to as test set 2. The data set was divided into a training set (6300 images) and a test set (700 images) at a ratio of 9:1.

YOLO-V4 Weed Detection Model
The goal of this stage is to train a model to obtain the area of interest of weeds in the actual environment and to lay the foundation for the subsequent estimation of the fresh weight of individual weeds.
YOLO is a single-stage target detection algorithm. The detection speed is faster than that of a two-stage network (Faster-R-CNN [53]). YOLO-V4 [54] introduces the mosaic data enhancement function on the basis of YOLO-V3 [55]; optimizes the backbone network, network training, activation function, and loss function; makes YOLO-V4 faster and more accurate; and achieves the best balance of the existing target detection frameworks. The network uses CSPDarknet53 as the feature extractor, Path Aggregation Network (PANet) as the backbone network integrated extraction feature, and YOLO-V3 as the detection head to achieve target detection.
The main steps for weed detection with YOLO-v4 are as follows.
(1) Data processing. The image acquisition process collected 20,274 images, selected images for detection through visual observation, deleted blurry images, and selected a final set of 7000 images in total. The 1920 × 1080 RGB-D data were first cropped along the label line established during data collection to 1080 × 1080, then scaled to a 540 × 540 matrix. Labeling [56] was then used to mark the RGB images. A total of 12,116 Solanum nigrum were tagged, 12,623 Abutilon theophrasti Medicus were tagged, and 7332 Sonchus arvensis were tagged in the dataset. To distinguish the dataset created from the weed example RGB-D and fresh weight labels, the dataset is referred to as dataset 1, the training set as training set 1, and the test set as test set 1. And the other is referred to as dataset 2. The training set is referred to as training set 2 and the test set is referred to as test set 2. The data set was divided into a training set (6300 images) and a test set (700 images) at a ratio of 9:1. (2) Training parameters. Considering the limitations on server memory, the batch size was set to 8, and the model was trained after defining the model parameters. The learning rate was set to 0.001, the classification was set to 3 categories, and the number of iterations was set to 40,000. Figure 4 shows the loss curve during training. The learning efficiency of the weed detection model is high, and the training curve converges quickly. As training continues, the slope of the training curve gradually decreases. Finally, when the number of training iterations reaches approximately 35,000, the learning efficiency of the model gradually reaches saturation, and the loss fluctuates in the interval of 0~1.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 30 (2) Training parameters. Considering the limitations on server memory, the batch size was set to 8, and the model was trained after defining the model parameters. The learning rate was set to 0.001, the classification was set to 3 categories, and the number of iterations was set to 40,000. Figure 4 shows the loss curve during training. The learning efficiency of the weed detection model is high, and the training curve converges quickly. As training continues, the slope of the training curve gradually decreases. Finally, when the number of training iterations reaches approximately 35000, the learning efficiency of the model gradually reaches saturation, and the loss fluctuates in the interval of 0~1.

Two-Stream Dense Feature Fusion Network
After YOLO-V4 detection, we obtained RGB-D information of different types of target weeds. In order to use this information to predict the fresh weight of weeds, a dualstream dense feature fusion network model was proposed in this study. In this case, the dense network (NiN) network in the network can be ported according to the network depth. This approach has been validated and tested on AlexNet [45], VGG19 [46], Xception [47], Res-net101 [48], and Densenet201 [49] convolutional neural networks, and the best performance has been achieved on Densenet201. Figure 5 shows the overall architecture of the dual-stream dense feature fusion network working with DenseNet201 as the main component. In the figure the Dense-NiN module is used as a feature extractor for the depth data. The Dense module acts as a feature extractor for the RGB information and also receives the feature maps extracted by the Dense-NiN-Block for fusion with the RGB. The fused features are fed into the global average pooling layer for sample space mapping. Finally, the regression layer outputs fresh weight data for the weeds.

Two-Stream Dense Feature Fusion Network
After YOLO-V4 detection, we obtained RGB-D information of different types of target weeds. In order to use this information to predict the fresh weight of weeds, a dual-stream dense feature fusion network model was proposed in this study. In this case, the dense network (NiN) network in the network can be ported according to the network depth. This approach has been validated and tested on AlexNet [45], VGG19 [46], Xception [47], Res-net101 [48], and Densenet201 [49] convolutional neural networks, and the best performance has been achieved on Densenet201. Figure 5 shows the overall architecture of the dual-stream dense feature fusion network working with DenseNet201 as the main component. In the figure the Dense-NiN module is used as a feature extractor for the depth data. The Dense module acts as a feature extractor for the RGB information and also receives the feature maps extracted by the Dense-NiN-Block for fusion with the RGB. The fused features are fed into the global average pooling layer for sample space mapping. Finally, the regression layer outputs fresh weight data for the weeds.  The main modules of the Two-stream dense feature fusion network are as follows: (1) Dense Module The Dense module adopts the structure of DenseNet201, which mainly consists of Dense-Block structure, this structure can effectively solve the problem of continuous convolution operation and downsampling of convolutional neural network work, the feature map is reduced and the feature information is lost during transmission. The DenseNet structure makes more effective use of the feature information to prevent information loss. Figure 6 shows the Dense-Block structure of DenseNet. It connects each layer to other layers in feedforward mode; thus, laye l receives all the feature maps of the preceding layers x 0 , x 1 , . . . ,x l−1 as input.
where [x 0 , x 1 , . . . ,x l−1 ] is a splice of the feature maps of layers x 0 , x 1 , . . . ,x l−1 and H l is a function used to process the spliced feature maps. This allows DenseNet to mitigate gradient vanishing, enhance feature propagation, facilitate feature reuse, and greatly reduce the number of parameters.  Figure 6. Dense-Block structure.
(2) Dense-NiN-Module The information that can be expressed by the depth matrix includes information on the spatial structure of the weed, the distance of the camera from the weed, and the distance of the camera from the ground. All of this information has a potential impact on the estimation of the fresh weight of the weed on the ground. Therefore, more attention should be given to the global expression of depth information. Depth features are more abstract than RGB features. The convolution filter in CNN is a generalized linear model (GLM) of the underlying data block. The abstraction level of the GLM is very low [57]. Using a more powerful nonlinear function approximator instead of the GLM function can improve the abstraction ability of the local model. The use of multilayer perceptron (MLP) convolution instead of ordinary convolution may enhance the abstract feature extraction ability and nonlinear ability of the model [58], which would be very suitable for deep feature extraction. The calculation performed by the MLP convolution layer is shown as follows: Where ( , ) is the pixel index in the feature map, stands for the input patch centered at location ( , ), and is used to index the channels of the feature map. is the number of layers in the MLP. A rectified linear unit is used as the activation function in the MLP.
Dense-NiN-Module is composed of Dense-NiN-Block structure. Figure 7 shows the process of Dense-NiN-Block structure sending the acquired depth feature channel map to Dense-Block structure. The structure uses MLP as a deep feature filter and draws on the idea of DenseNet201, using a dense connection structure to enhance the feature extraction ability of the model and reduce the number of parameters. The basic unit of the model is the MLP convolution layer, followed by the deep concatenated feature fusion layer for feature map concatenation. Then, the features are sent to the RGB deep concatenated feature fusion layer, and other features are sent to the subsequent network for high-dimensional feature extraction. The basic unit can be transplanted into an existing classic convolutional neural network as a deep feature extractor. The depth of the module is consistent with the depth of the network. (2) Dense-NiN-Module The information that can be expressed by the depth matrix includes information on the spatial structure of the weed, the distance of the camera from the weed, and the distance of the camera from the ground. All of this information has a potential impact on the estimation of the fresh weight of the weed on the ground. Therefore, more attention should be given to the global expression of depth information. Depth features are more abstract than RGB features. The convolution filter in CNN is a generalized linear model (GLM) of the underlying data block. The abstraction level of the GLM is very low [57]. Using a more powerful nonlinear function approximator instead of the GLM function can improve the abstraction ability of the local model. The use of multilayer perceptron (MLP) convolution instead of ordinary convolution may enhance the abstract feature extraction ability and nonlinear ability of the model [58], which would be very suitable for deep feature extraction. The calculation performed by the MLP convolution layer is shown as follows: where (i, j) is the pixel index in the feature map, x ij stands for the input patch centered at location (i, j), and k is used to index the channels of the feature map. n is the number of layers in the MLP. A rectified linear unit is used as the activation function in the MLP. Dense-NiN-Module is composed of Dense-NiN-Block structure. Figure 7 shows the process of Dense-NiN-Block structure sending the acquired depth feature channel map to Dense-Block structure. The structure uses MLP as a deep feature filter and draws on the idea of DenseNet201, using a dense connection structure to enhance the feature extraction ability of the model and reduce the number of parameters. The basic unit of the model is the MLP convolution layer, followed by the deep concatenated feature fusion layer for feature map concatenation. Then, the features are sent to the RGB deep concatenated feature fusion layer, and other features are sent to the subsequent network for high-dimensional feature extraction. The basic unit can be transplanted into an existing classic convolutional neural network as a deep feature extractor. The depth of the module is consistent with the depth of the network.  (3) Output Layer Figure 8 shows the output data stream of YOLO-V4. At this point, the data streams for the three weed species have been obtained. The number of data streams and the data size for each weed are not fixed. Convolutional neural networks are affected by the fully connected layer and cannot cope with inputs of different sizes. The usual method is to scale the data to a uniform size. However, for this study, if the image of weeds is roughly scaled, smaller weeds will appear to be larger weeds after zooming in, and changes in the size of the shape outline will affect the estimation of the fresh weight of the weeds.
Based on this problem, this paper uses a global average pooling layer to replace the fully connected layer of DenseNet201 to handle inputs of different sizes. The regression layer uses the mean-squared-error loss as the loss function. The formula is: where is the number of responses, is the target output, and is the network's prediction for response . (3) Output Layer Figure 8 shows the output data stream of YOLO-V4. At this point, the data streams for the three weed species have been obtained. The number of data streams and the data size for each weed are not fixed. Convolutional neural networks are affected by the fully connected layer and cannot cope with inputs of different sizes. The usual method is to scale the data to a uniform size. However, for this study, if the image of weeds is roughly scaled, smaller weeds will appear to be larger weeds after zooming in, and changes in the size of the shape outline will affect the estimation of the fresh weight of the weeds.  (3) Output Layer Figure 8 shows the output data stream of YOLO-V4. At this point, the data streams for the three weed species have been obtained. The number of data streams and the data size for each weed are not fixed. Convolutional neural networks are affected by the fully connected layer and cannot cope with inputs of different sizes. The usual method is to scale the data to a uniform size. However, for this study, if the image of weeds is roughly scaled, smaller weeds will appear to be larger weeds after zooming in, and changes in the size of the shape outline will affect the estimation of the fresh weight of the weeds.

BRC
Based on this problem, this paper uses a global average pooling layer to replace the fully connected layer of DenseNet201 to handle inputs of different sizes. The regression layer uses the mean-squared-error loss as the loss function. The formula is: where is the number of responses, is the target output, and is the network's prediction for response . Based on this problem, this paper uses a global average pooling layer to replace the fully connected layer of DenseNet201 to handle inputs of different sizes. The regression layer uses the mean-squared-error loss as the loss function. The formula is: where R is the number of responses, t i is the target output, and y i is the network's prediction for response i. The main steps for estimating the weed fresh weight based on a two-stream dense feature fusion network are as follows: (1) Data enhancement. A new data enhancement method suitable for the depth matrix, called depth transformation enhancement, is proposed. The source of this method is the simulation of the fluctuation on the distance between the camera and the ground in the field, as shown in Figure 9. As also shown in Figure 9a, when l is negative, the camera is closer to the ground and the target weed is shown larger in the image. When l is positive, the camera is further from the ground and the target weed is shown smaller in the image. As shown in Figure 9b. The values of the size and depth information can be changed according to the volatility of the distance in order to enhance the data. When the depth value increases or decreases overall, the image will be scaled according to the scale factor. The specific formula is as follows: where x is the pixel length of the target, y is the pixel width of the target, d is the installation height of the camera (800 mm in this article), and l is a strongly fluctuating value. The fluctuation range selected in this article is an integer within ±50, f x and f y correspond to the two focal lengths of the camera, w represents the actual length of the target weed, and h represents the actual width of the target weed.  The main steps for estimating the weed fresh weight based on a two-stream dense feature fusion network are as follows: (1) Data enhancement. A new data enhancement method suitable for the depth matrix, called depth transformation enhancement, is proposed. The source of this method is the simulation of the fluctuation on the distance between the camera and the ground in the field, as shown in Figure 9. As also shown in Figure 9a, when l is negative, the camera is closer to the ground and the target weed is shown larger in the image. When l is positive, the camera is further from the ground and the target weed is shown smaller in the image. As shown in Figure 9b. The values of the size and depth information can be changed according to the volatility of the distance in order to enhance the data. When the depth value increases or decreases overall, the image will be scaled according to the scale factor. The specific formula is as follows: where is the pixel length of the target, is the pixel width of the target, is the installation height of the camera (800 mm in this article), and is a strongly fluctuating value. The fluctuation range selected in this article is an integer within ±50, and correspond to the two focal lengths of the camera, represents the actual length of the target weed, and ℎ represents the actual width of the target weed. The 1200 RGB-D data for each weed species were divided into datasets, of which 900 were used as training set 2 and 300 as test set 2. The distribution of weights was randomized and is shown in Figure A2 in the Appendix.

•
Randomly flip vertically or horizontally. The 1200 RGB-D data for each weed species were divided into datasets, of which 900 were used as training set 2 and 300 as test set 2. The distribution of weights was randomized and is shown in Figure A2 in the Appendix A.

•
To make the data more adaptable to light fluctuations, randomly increase or decrease the brightness of RGB data by 10%.
• Perform random depth transformation.
(2) Training parameters. The deep learning frameworks are all trained on the GPU. Usually, the input image (batch size × channel × h × w) is put into a specified tensor and sent to the GPU. Images of different sizes cannot form a unified tensor, so in this study, the batch size is set to 1, and each image is sent as a separate tensor to the GPU for training. The learning rate is set to 0.001, and Adam is used as the optimizer. The number of iterations is set to 10,000.

Model Evaluation
(1) AP and mAP The average precision (AP) is used to calculate the area of the PR curve within a certain category, and the mean average precision (mAP) is the average of the area of the PR curves of all categories. The larger the values of AP and mAP are, the better the comprehensive performance of the network in detecting weeds.
(2) IoU The intersection over union (IoU) is a standard used to define the accuracy of target object detection. IoU evaluates the performance of the model by calculating the overlap ratio between the predicted bounding box and the true bounding box. The higher the IOU value is, the greater the overlap between the bounding box of the detected weed and the original labelled box. The mIoU is the average IoU of all test results. The formula is as follows: where S overlap is the area of intersection of the predicted bounding box and the true bounding box. S union is the area of the union of the two bounding boxes.
To verify the performance of the algorithm, the root mean square error (RMSE) and R 2 are used as evaluation indicators: where N is the number of data samples, y i is the measurement at the i th sample,ŷ i is the model estimation at the i th sample, and y i is the mean of the measurements.

Technical Route Results
The main idea of the technical route proposed in this paper is to use YOLO-V4 to locate the target weeds and then send the obtained weed areas to the corresponding two-stream dense feature fusion network by category to predict their fresh weight on the ground. Figure 10 shows the results of the 3D visualization of the aboveground fresh weight detection of weeds (A visualization of the results of the two-stream dense feature fusion network on RGB images can be obtained in Figure A3 in the Appendix A). The mAP (IoU value of 0.5) of the model proposed in this paper is 75.34%, and the mIoU is 86.36%. When combining YOLO-V4 with the improved, fastest two-stream dense feature fusion network (AlexNet) model, the prediction speed is 17.8fps. The average relative error of the fresh weight of the weeds in the test set is approximately 4%. This model can provide visual technical support for precision variable-target platforms.

Comparison of YOLO-V4 with Other Target Detection Algorithms
To find the most suitable convolutional neural network for weed detection, this study compared the YOLO-V4 model with the SSD [59], YOLO-V5x [60], M2DNet [61], and Faster R-CNN [53] networks. Target detection networks can be divided into two main categories: one-stage target detection networks and two-stage target detection networks. The reason for selecting these four networks for comparison is that YOLO-V4, YOLO-V5x, SSD, and M2DNet are typical representative one-stage networks of different types, and their performance is relatively advanced. The Faster R-CNN network, a typical two-stage network, also exhibits advanced performance. Therefore, this article compares the performance advantages of these four types of networks with regard to the problem of weed detection. Table 2 shows the mAP scores (mAP is obtained at an IoU value of 0.5), mIoU values, and average detection times of the models.

Comparison of YOLO-V4 with Other Target Detection Algorithms
To find the most suitable convolutional neural network for weed detection, this study compared the YOLO-V4 model with the SSD [59], YOLO-V5x [60], M2DNet [61], and Faster R-CNN [53] networks. Target detection networks can be divided into two main categories: one-stage target detection networks and two-stage target detection networks. The reason for selecting these four networks for comparison is that YOLO-V4, YOLO-V5x, SSD, and M2DNet are typical representative one-stage networks of different types, and their performance is relatively advanced. The Faster R-CNN network, a typical twostage network, also exhibits advanced performance. Therefore, this article compares the performance advantages of these four types of networks with regard to the problem of weed detection. Table 2 shows the mAP scores (mAP is obtained at an IoU value of 0.5), mIoU values, and average detection times of the models. In the above results, the mAP score of YOLO-V4 is 0.7534, which is higher than the scores of the other four models. This indicates that the combined recall performance and accuracy of YOLO-V4 is better than the other four models. the IoU value of YOLO-V4 is 0.8636, which is higher than the other four models. This indicates that YOLO-V4 is more accurate than the other four models in detecting bounding boxes. the average removal time of YOLO-V4 is 0.033 seconds, which is faster than the other three models. However, the detection speed of YOLO-V4 was slower compared to YOLO-V5x. In our test set1, the minimum pixel size that yolov4 can detect for Sonchus arvensis is 14 × 16, for Abutilon theophrasti Medicus 8 × 10, and for Solanum nigrum 7 × 11. YOLO-V4 is effective for small target weed detection.

Comparison of Regression Network Results Embedded with the Dense-NiN Module
In describing the model, we mentioned that the Dense-NiN module can be embedded in a typical convolutional neural network. In the embedded VGG19 and AlexNet networks, we add a deep feature fusion layer after each pooling layer to receive the output of the Dense-NiN-Block module. In Inception-V3 and Resnet101, we add the Dense-NiN-Block module before the network convergence layer. The structure of DenseNet201 has been described above. The number of test set2 for each weed species is 300. This study integrates the Resnet101, VGG19, Inception-V3, AlexNet, and DenseNet201 networks of the Dense-NiN module for comparison to select the model with the best fit.
To compare the effects of weed species on the detection results, three weed species, Abutilon theophrasti Medicus, Solanum nigrum, and Sonchus arvensis, were used as training sets to train the convolutional neural network. At the same time, these three weed species were also merged into a single data set to train the model (abbreviated as all). Moreover, to compare RGB-D information and RGB information when using a convolutional network for fresh weight prediction, RGB and RGB-D were used as inputs for network training. The dual-stream dense fusion network architecture proposed in this paper used the RGB-D information for training. The RGB images were used directly with the default network structures of these five networks, and the output module of the original network needed only to be replaced with the output module proposed in this article to achieve a new regression. The RMSEs of the training models are shown in Figure 11, the R 2 values are shown in Table 3, and the average times (s) are shown in Table 4.   Weight(g) RMSE Figure 11. Comparison of weed fresh weight predictions with RMSE values. The above results show that, in all networks, the accuracy obtained using RGB-D data as the input is higher than that obtained using RGB as the input. This indicates that RBG-D stereo data can indeed provide more information for use in weed fresh weight evaluation. However, the speed usually decreases when RBG-D data are used. This is because the two-stream dense feature fusion network using RGB-D data introduces a denser convolution structure and increases the weight, which causes the speed to drop. In the regression test for the fresh weights of the three weed species, the RMSE values of the dual-stream dense fusion network (DenseNet201-rgbd) are 0.358 for Abutilon theophrasti Medicus, 0.416 for Solanum nigrum, and 0.424 for Sonchus arvensis (Notable among these is the closer detection of RGB-D and RGB for Sonchus arvensis compared to the other two weeds. We provide a specific analysis in session 4.3.3.). The value for all weeds is 0.568, which is higher than those of the other models. The RMSE values of the three aboveground fresh weight prediction models trained using this model are lower than the RMSE value of all weed models trained directly. Therefore, after applying YOLO-V4, a network that can be independently and successfully trained for each weed species can be adaptively selected, and its performance will be better than a trained network using all the weeds as the training set. The R 2 of the dual-stream dense fusion network (DenseNet201-rgbd) is also the highest, with a value of 0.9917 for Abutilon theophrasti Medicus, a value of 0.9921 for Solanum nigrum, and a value of 0.9885 for Sonchus arvensis. This network has a good fitting ability. Selecting the corresponding model according to the weed type output by YOLO-V4 does not affect the speed of the model. For example, there are 10 tensors in the output stream of YOLO-V4. Using a different model for each weed type or directly using all the trained weed models requires a calculation time of 10 tensors. The only difference is whether the network is selected according to the weed type. This kind of speed loss is almost negligible.
At the same time, the higher the accuracy of the detection model is, the slower the speed; if higher accuracy is desired, speed must be sacrificed to some extent. When the density of weeds in the environment is high, the accuracy of the model may be reduced, and a faster model can be selected. It is worth noting that the average detection speed of each model is 0.0359 for Solanum nigrum, 0.0378 for Sonchus arvensis, and 0.0390 for Abutilon theophrasti Medicus. We believe that this is due to the size of the weed test set2 image. We calculated the average image size of the three weeds in the test set2. The average size of Solanum nigrum is 104 × 108, the average size of Abutilon theophrasti Medicus is 166 × 175, and the average size of Sonchus arvensis is 150 × 158. The size of the weeds also affects the speed of the network. Therefore, reducing the image size uniformly during the training process of the two-stream dense feature fusion network and reducing the image size by the same proportion during the prediction process could help to improve the efficiency of the model.
On the other hand, we used a non-CNN technique to build a regression model with the canopy area of the weed as the independent variable and the aboveground fresh weight of the weed as the dependent variable. Using a polynomial fit method, Abutilon theophrasti Medicus obtained a minimum RMSE value of 3.632. Solanum nigrum obtained a minimum RMSE value of 3.246. Sonchus arvensis obtained a minimum RMSE value of 2.033. The experiments proved that that using the CNN technique is indeed better than using single factor regression. The method is more advantageous. In a real field environment, the ground is relatively uneven. For example, two identical weeds, one growing at a higher position and the other at a lower position, will have different RGB images even if the height of the camera is 800 mm. If above-ground fresh regression is performed using canopy pixel area, the weed growing in the higher position has a larger canopy pixel area and the weed growing in the lower position has a smaller canopy pixel area. This can lead to such errors, and depth data can help us to resolve such differences effectively.
In practical agricultural applications, the Chinese national standard (GB-T36007-2018) states that field weeding robots should operate at a speed of around 0.4 to 0.5 m per second. Our robots can operate effectively in real time with RGB-D while complying with the Chinese national standard. For us, faster speed is not as effective as more precise accuracy. In the future, robots will inevitably travel at higher speeds, so it is worth considering giving up a certain level of accuracy to use RGB images in the future. It is worth noting that YOLO-V5x is very fast and, although not as accurate as YOLO-V4, is smaller, making it easier for us to deploy to edge computing devices. We still need to evaluate the specific performance of YOLO-V4 and YOLO-V5x on edge computing devices such as the Jet-son AGX Xavier in future work.

The Impact of Different Data Enhancement Methods
To verify the influence of the four data augmentation methods described above in the training model, the control variable method was used to delete one data augmentation method at a time, and the RMSE values were obtained. The results are shown in Table 5. According to the experimental results, random rotation and random flipping have limited impacts on the model, but excluding these two methods still reduces the detection accuracy. Removing random rotation increases the average RMSE of the model by 0.052, and removing random flipping increases the average RMSE of the model by up to 0.050. The device cover provides the function of a hood but still allows visible light to pass through. Brightness enhancement can help the model adapt to subtle changes in light. The results show that the result of removing the brightness enhancement transform is 0.115 higher than the RMSE value using the full enhancement method. The depth conversion enhancement function can help the model adapt to uneven ground. Depth enhancement greatly improves the performance of the detection model. If this method is excluded, the RMSE score of the detection model increases by 0.129. Therefore, the depth conversion enhancement method helps to improve the performance of the model.

The Two-Stream Dense Feature Fusion Network (DenseNet201) Is Affected by the Growth Period and Weed Species
To compare the responses of the RGB network and RGB-D network (DenseNet201-rgb and DenseNet201-rgbd) to weeds in different periods, we classified the three weed species by size from small to large according to the quality distribution of the test set. Every fifty adjacent weeds are considered as one stage, and six stages (A, B, C, D, E, and F) stages are considered in the analysis. Figure 12 shows the actual results for the three weed species.
Comparing the average RMSE value of the RGB data with the average RMSE value of the RGB-D data shows that in stages A and B, the RMSE value for Abutilon theophrasti Medicus increased by 0.113, that for Solanum nigrum decreased by 0.011, and that for Sonchus arvensis decreased by 0.162. The advantage of using RGB-D data is not obvious. In stages C and D, the RMSE for Abutilon theophrasti Medicus increased by 0.209, for Solanum nigrum weeds increased by 0.334, and for Sonchus arvensis increased by 0.111. The RMSE for Abutilon theophrasti Medicus and Solanum nigrum increased significantly, while the increase in the RMSE for Sonchus arvensis was relatively small. In stages E and F, the RMSE for Abutilon theophrasti Medicus weeds increased by 0.650, for Solanum nigrum increased by 0.628, and for Sonchus arvensis increased by 0.249. Compared with those in the first four stages, the RMSE increase for Abutilon theophrasti Medicus and Solanum nigrum was greater, while the increase for Sonchus arvensis was still relatively small. Overall, the RMSE values for Abutilon theophrasti Medicus and Solanum nigrum obtained using RGB images as input gradually increases, and the magnitude of the increase also increases. Although the RMSE value for Sonchus arvensis also exhibits an upward trend, the overall fluctuation is very small. Using RGB-D as the network input, the RMSE values for the predicted values of the weeds in the six stages all fluctuate slightly or even show a downward trend. The results show that in the later stages of weed growth, using RGB-D as the network input provides more stable and accurate results than using RGB as the network input.   In the early weed growth stages, the performances obtained using RGB and RGB-D as inputs are roughly the same. This shows that in the early stage, the regression model is more dependent on the overhead-view area of the plant for regression prediction. At this time, the weeds are very short, so the regression prediction results using RGB and RGB-D are nearly the same. In the subsequent growth stages, as the weeds gradually grow taller, the stems account for a certain percentage of the weight of the weeds, the height of the plants cannot be obtained from the RGB image, and the accuracy of predictions obtained using RGB images begins to decline. The scatter plots of the actual and predicted fresh weights of the weeds show that in the RGB prediction process, at the later stage of growth, the predicted fresh weight value is usually lower than the actual value. Due to a lack of height information, the predicted fresh weight value is too low. Therefore, the RGB-D model exhibits better robustness in the subsequent growth stages of weeds. However, in these six stages, the RMSE values of the results obtained using RGB-D and RGB images for Sonchus arvensis did not change substantially. Given the low height of these weed species, their aboveground fresh weight may depend more on their top-view area. In the early and late stages of growth, the difference between the RMSE values of the RGB and RGB-D predictions is not substantial, but RGB-D still provides a better fitting effect.

Model Analysis (1) Dense connections extract deep features
The Dense-NiN-Block module uses a dense connection structure. The dense structure allows access to all its previous feature maps (including transition layers). Our experiment investigates whether the trained network takes advantage of this opportunity. For each convolutional layer in a block, we calculate the average (absolute) weight assigned to the connection to layer s. Figure 13 shows the heat map of all four Dense-NiN-Block modules. The average absolute weight replaces the dependence of the convolutional layer on its previous layers. The red dot in position ( , s) indicates that the layer uses the feature map of the previously generated s layer on average.
The figure shows that all layers spread their weights over many inputs within the same block. The feature information from the weed depth data obtained in the early stages of the network is actually used by the deeper convolution filters within the same dense block. The weights of the transition layers also spread their weight across all layers within the preceding dense block, indicating information flow from the first to the last layers of the Dense module through few indirections. Therefore, the NiN module in this study effectively uses the Dense connection method to enhance the use of weed depth information.
(2) Model visualization analysis To explore which information made the greatest contribution to fresh weight prediction as well as the specific impact of depth data, we use Grad-CAM to visualize the network and compare the differences between the RGB-D network and RGB network models (DenseNet201-rgb and DenseNet201-rgbd). Figure 14 shows the visualization results. Areas with a high thermal value represent the greatest utilization of the feature map of the pixel area. the Dense module through few indirections. Therefore, the NiN module in this study effectively uses the Dense connection method to enhance the use of weed depth information. Figure 13. Visualization effect charts. The average absolute filter weights of the convolutional layers in a trained DenseNet201. The color of the pixel at ( , ℓ) encodes the average 1 norm (normalized by the number of input feature maps) of the weights connecting convolutional layer to ℓ within a dense block. The three columns highlighted by black rectangles correspond to two transition layers and the classification layer. The first row encodes the weights connected to the input layer of the dense block.
(2) Model visualization analysis To explore which information made the greatest contribution to fresh weight prediction as well as the specific impact of depth data, we use Grad-CAM to visualize the network and compare the differences between the RGB-D network and RGB network models (DenseNet201-rgb and DenseNet201-rgbd). Figure 14 shows the visualization results. Areas with a high thermal value represent the greatest utilization of the feature map of the pixel area. As shown in Figure 14, these two networks have learned the channel pixels within the weed area in order to make fresh weight predictions. In the Grad-CAM map output by two-stream dense feature fusion network, the heat value near the middle of the weed area is higher than that at the edges of the weed area. We believe that this phenomenon occurs As shown in Figure 14, these two networks have learned the channel pixels within the weed area in order to make fresh weight predictions. In the Grad-CAM map output by two-stream dense feature fusion network, the heat value near the middle of the weed area is higher than that at the edges of the weed area. We believe that this phenomenon occurs because the central area of the plant, as the main growth point of the stem, has obviously different height characteristics than the other plant parts. This leads to a large difference between the depth data in this part and in other parts, and this difference can improve the weed fresh weight prediction function of the model; in contrast, the RGB network does not have such advantages. In addition, our model not only considers to the information within the weed outline but also considers the periphery of the weed area (shown in the red circles in the figure). In the actual environment, the ground cannot be flat. Although the camera is set at a distance of 800 mm from the ground, it cannot actually be stabilized at that distance. Due to the unevenness of the ground, the camera position fluctuates around 800 mm above the ground. This results in some weeds being detected in low-lying positions, while some weeds are perceived as being relatively tall. For example, for two weeds of the same quality, the RGB image of the taller weed is larger due to the difference in geographic location, which makes the RGB network prediction value higher. From the perspective of depth data, the depth data value of shorter weeds is higher, and the depth data value of taller weeds is lower. The depth information of the weed outline area does not directly reflect this difference, but the depth data outside the weed outline directly reflects the distance between the camera and the ground. The figure above shows that distance information is also regarded as an important difference feature by the network. At the same time, the information inside and outside the weed outline constitutes the height information for the weed. The high thermal response outside the weed outline area indicates that the network has learned this indirect relationship. Therefore, the value of information outside the range of weed outlines is also used effectively. The RGB network cannot resolve the imaging difference caused by the fluctuation of the distance between the camera and the ground. On the other hand, we explored the results of using only data within the RGB-D weed contour lines. The RMSE results obtained using the network proposed in this paper showed 0.648 for Abutilon theophrasti Medicus, 0.824 for Solanum nigrum, and 0.481 for Sonchus arvensis, all lower than the method used in this paper and more evidence of the importance of ground-to-camera distance information in depth images (areas beyond the weed contour lines). distance information in the depth images (the area beyond the weed contour line).

The Relationship between IOU and Fresh Weight Prediction
In this study, manual trimming was used to create the data set when training the network. However, when using the two-stream dense feature fusion network model, the output of the YOLO-V4 model was actually accepted. There are certain differences between the two. The specific response to this difference is reflected in the IoU values, so this article compares the RMSE under different IoU values. To specifically reflect the impact of manual trimming and the YOLO-V4 output data on the accuracy of fresh weight prediction. The comparison was performed to reflect the difference in accuracy. The results are shown in Table 6. The results above show that the IOU value will have a slight impact on the prediction result. When the IOU value is greater than 50%, the RMSE values for the three weed species using the YOLO-V4 network result as the input prediction value and those using the manual trimming result as the input prediction value are 0.065, 0.034, and 0.060, respectively, and show little difference. However, as the IOU value decreases, the RMSE value gradually increases, and the network prediction accuracy decreases. These results demonstrate that the IOU value affects the accuracy of the two-stream dense feature fusion network. In this article, the IOU threshold for YOLO-V4 is selected as 0.5. Appropriately increasing the IOU threshold can make the network fitting effect more accurate.

Predictive Effects for Shaded Weeds
In the early stages of corn cultivation, weeds are small and rarely cover each other. During this period, individual weeds are easy to distinguish. However, as the weeds continue to grow, the degree of overlap between them increases, and it becomes more difficult to distinguish them. YOLO-V4 can identify weeds that have a certain degree of overlap, but misidentifications can still occur. Instances of misrecognition can be classified into three situations:

•
When two weeds cover each other, the network divides them into uniform individuals, as shown by the red bounding box.

•
When two weeds cover each other, the network identifies only part of the weed but not the whole weed, as shown by the purple bounding boxes.

•
When two weeds shade each other, the weed cannot be detected, as shown by the black arrow in (a). Figure 15a shows a situation in which, because of mutual covering, two weeds are identified as one. Sonchus arvensis is a weed species that relies heavily on its stems to reproduce multiple aboveground parts on the same root that usually overlap considerably and are close together. Therefore, it is easy for mistakes to occur during detection. Figure 15b,c show the occlusion of Solanum nigrum and Abutilon theophrasti Medicus. Unlike Sonchus arvensis, Solanum nigrum and Abutilon theophrasti Medicus have distinct individual characteristics, do not share the same root system, and are usually farther apart. Even if there is occlusion, partial recognition can be achieved, but the abovementioned problems still exist. These problems will affect the accuracy of the subsequent aboveground fresh weight prediction. The first type of error will result in the calculation of the aboveground fresh weight of the provided weed patch data, and the second type of error will cause the prediction value to be too small. However, the purpose of this article is to provide visual support for precise adjustments to herbicide application. Except for the small number of errors of the third type, the detection errors observed in this study would have little effect on variable herbicide application. Therefore, the research in this article still has practical significance. and are close together. Therefore, it is easy for mistakes to occur during detection. Figure  15b,c show the occlusion of Solanum nigrum and Abutilon theophrasti Medicus. Unlike Sonchus arvensis, Solanum nigrum and Abutilon theophrasti Medicus have distinct individual characteristics, do not share the same root system, and are usually farther apart. Even if there is occlusion, partial recognition can be achieved, but the abovementioned problems still exist. These problems will affect the accuracy of the subsequent aboveground fresh weight prediction. The first type of error will result in the calculation of the aboveground fresh weight of the provided weed patch data, and the second type of error will cause the prediction value to be too small. However, the purpose of this article is to provide visual support for precise adjustments to herbicide application. Except for the small number of errors of the third type, the detection errors observed in this study would have little effect on variable herbicide application. Therefore, the research in this article still has practical significance. Figure 15. (a-c) show the different conditions in which weeds are shaded. The green boxes represent correctly identified weeds. The red boxes represent the recognition of several weeds as one weed. The purple boxes indicate that the network identified only part of the weeds. The black arrow indicates a weed that was not detected due to occlusion.

Conclusions
In this study, we propose a new concept for the real-time detection of the aboveground fresh weight of weeds to provide visual support for precision variable herbicide spraying. At the same time, a new model that can detect weeds and predict their fresh weight in real time in the field is developed. The algorithm combines deep learning tech- Figure 15. (a-c) show the different conditions in which weeds are shaded. The green boxes represent correctly identified weeds. The red boxes represent the recognition of several weeds as one weed. The purple boxes indicate that the network identified only part of the weeds. The black arrow indicates a weed that was not detected due to occlusion.

Conclusions
In this study, we propose a new concept for the real-time detection of the aboveground fresh weight of weeds to provide visual support for precision variable herbicide spraying. At the same time, a new model that can detect weeds and predict their fresh weight in real time in the field is developed. The algorithm combines deep learning technology with 3D data. This paper proposes a strategy of using the YOLO-V4 target detection network to obtain the regional weed area and then send the RGB-D data for the weed area into a dual-stream dense feature fusion network regression model to perform a regression on the fresh weight data so that the fresh weight of weeds can be predicted. The error of the model is approximately 4%, and the fastest detection speed is 17.8 fps. To construct a data set for training these two networks, a data collection method that establishes a labelling method is proposed. This method can quickly establish the relationship between the weed RGB-D data and the fresh weight data while avoiding interference with the actual operating environment. When predicting the fresh weight of weeds taller than a certain height, more accurate results are achieved using RGB-D information as the input for the model. The visualization results show that the use of a two-stream dense feature fusion network can better address the imaging differences caused by the uneven land surface and make the predictions more accurate.
In this paper we have only done a preliminary exploration of fresh weight models for three weeds, the richness of weed species is still lacking, and the next step of the workshop is to enrich our weed types. Future work will focus on determining the type and fresh weight of weeds in order to determine the appropriate amount of herbicides to apply in real time, optimizing weeding strategies to reduce the use of herbicides, and applying the model to a variable herbicide-application robot. The approaches used to develop this model can also be extended to the prediction of the fresh weight of crop plants, which could provide support for crop breeding and genetic improvement as well as soil health.  Data Availability Statement: Given that the data used in this study were self-collected, the dataset is being further improved. Thus, the dataset is unavailable at present. … Fresh weight dataset ：Sa1200, Sn1200, Atm1200 Figure A1. Enlarged supplement to Figure 1D. Test Data Training_Data Figure A2. Training set2 and test set2 distributions.