Research on Rapeseed Seedling Counting Based on an Improved Density Estimation Method

: The identi ﬁ cation of seedling numbers is directly related to the acquisition of seedling information, such as survival rate and emergence rate. It indirectly a ﬀ ects detection e ﬃ ciency and yield evaluation. Manual counting methods are time-consuming and laborious, and the accuracy is not high in complex backgrounds or high-density environments. It is challenging to achieve improved results using traditional target detection methods and improved methods. Therefore, this paper adopted the density estimation method and improved the population density counting network to obtain the rapeseed seedling counting network named BCNet. BCNet uses spatial a tt ention and channel a tt ention modules and enhances feature information and concatenation to improve the expressiveness of the entire feature map. In addition, BCNet uses a 1 × 1 convolutional layer for additional feature extraction and introduces the torch.abs function at the network output port. In this study, distribution experiments and seedling prediction were conducted. The results indicate that BCNet exhibits the smallest counting error compared to the CSRNet and the Bayesian algo-rithm. The MAE and MSE reach 3.40 and 4.99, respectively, with the highest counting accuracy. The distribution experiment and seedling prediction showed that, compared with the other density maps, the density response points corresponding to the characteristics of the seedling region were more prominent. The predicted number of the BCNet algorithm was closer to the actual number, verifying the feasibility of the improved method. This could provide a reference for the identi ﬁ cation and counting of rapeseed seedlings.


Introduction
The rape plant is a typical crop in China [1], and its high-quality production is closely related to the development of the agricultural economy.The seedling is an important stage in the growth period of rape, so seedling number identification is directly linked to obtaining essential information about seedlings, such as their sowing survival rate and emergence rate.This information indirectly impacts the detection efficiency and yield assessment [2].At present, seedling counting mainly relies on manual observation.However, due to technical limitations of the seeding machine or improper manual operation, among other factors, uneven seed density can occur, leading to difficulties in manual observation.This method is time-consuming and consumes substantial manpower and material resources [3].Therefore, it is necessary to adopt appropriate methods to replace manual observation.With the continuous progress of computer vision and the rapid development of agriculture, the relationship between agricultural production and computer vision is becoming increasingly close [4].Based on the density estimation method [5], counting is based on learning the linear mapping between the target features and the corresponding density map, thereby integrating spatial information into the learning process [6].Meanwhile, the counting accuracy is greatly improved by the powerful feature expression capability of convolutional neural networks [7].
At present, density estimation methods have been partially applied in the agricultural field and have obtained improved performance.Qi Yang et al. [8] proposed an effective cotton counting method based on feature fusion.The comparison test results with the target counting methods MCNN, CSRNet, TasselNet, and MobileCount showed that the MAE and RMSE of the algorithm are 63.46 and 81.33, respectively.Compared with the comparison method, the average MAE and RMSE decrease by 48.8% and 45.3%, respectively.Huang Ziyun et al. [9] utilized the density class classification method to count cotton bolls in the field.They combined the classification information with the features to generate a high-quality density map, which effectively improved the accuracy of counting cotton bolls in the field.Bao Wenxia et al. [10] first equalized and segmented the field wheat spike images.They trained the field wheat spike density map estimation model through transfer learning to achieve the estimation of the number of wheat spikes in the field during the irrigation period.Lu et al. [11] determined the total number of cornhusks in the image by dividing the image into blocks, calculating local counts from the block density map, and then merged and normalized them to obtain the total number of cornhusks in the image.However, the density estimation method has not been applied to rapeseed seedlings, and the common detection method for rapeseed seedlings is to use deep learning methods.The actual environment complexity of the research mentioned above is low.The crops can be differentiated, but the rapeseed seedlings background is complex.Additionally, there are high density regions caused by improper sowing operations.Occlusion, to a certain extent, can limit the counting ability of the density estimation method.
To sum up, within a high density environment, neither the traditional detection method nor the improved method yielded superior results in this study.Therefore, the density estimation method was employed, proving to be more effective in identifying rapeseed seedlings.Moreover, to address the challenge of counting rapeseed seedlings with severe occlusion in high-density regions, spatial attention and channel attention modules was incorporated to enhance and combine feature information from spatial and channel dimensions, respectively, to enhance the overall expression capability of the feature map.In order to extract more detailed features from the spliced attention features, a 1 × 1 convolution layer was also utilized for additional feature extraction.Finally, to more effectively constrain the distribution of the model parameters, the torch.absfunction was implemented at the output layer of the network.This function was continuously trained and optimized to develop the rapeseed seedling counting model BCNet.The results showed that the counting error of BCNet was the smallest compared with the CSRNet and the Bayesian algorithm, and the MAE and MSE reached 3.40 and 4.99, respectively, indicating the highest counting accuracy.The density response points corresponding to the features in the seedling region of the improved density map are more prominent, and the predicted counts by the BCNet algorithm are closer to the actual counts.

Research Process
In this paper, a method based on density estimation is proposed to count rapeseed seedlings.During the training, the rapeseed seedling images in the training set are fed into the network for feature extraction to generate the predicted density map.At the same time, the manually labeled rapeseed seedling stage labeling data are used as supervised signals, and regression estimation is performed with the expected value of the predicted density map to calculate the loss.After training, the loss value gradually converges.During the test, the images of rapeseed seedling stage in the test set were input into the trained model to generate the predicted density map.Subsequently, the pixel values in the density map were summed to estimate the predicted quantity.Figure 1 shows the main flow of the algorithm.

Data Collection and Screening
The rapeseed seedling images were collected from the experimental field of rapeseed sowing in Henshan County, Maanshan City, Anhui Province, the sowing method of rape was scattering, the sowing time was 20 October 2022, and the images were collected on 6 December 2022, using the DJI Royal 2 UAV as shown in Figure 2. A total of 600 images of the rapeseed seedling stage were carefully screened to eliminate fuzzy and invalid images.

Data Set Production
The rapeseed seedling dataset was labeled according to the population standard dataset format, using the labeling tool Stroller-spotter, which identified and labeled the locations of rapeseed plants.Each rapeseed seedling map, after labeling was completed, generated a corresponding mat label file.Each mat label file contains the total number of point labels and the coordinate information of each point label, representing the actual location of each rapeseed plant.The division of the training and test sets of the rapeseed seedling dataset was performed at the ratio of 3:1, with 450 training sets and 150 test sets.The schematic diagram of the labeling process is shown in Figure 3.

Information on Data Sets
Table 1 shows the pertinent information regarding the rapeseed plant dataset and the rape plants in all images.The minimum number of samples indicates the minimum number of single rapeseed plants in all images, and the maximum number of samples indicates the maximum number of single rape plants in any images.The overall distribution of data samples exhibits characteristics of diversification.

Methods Based on Density Estimation
The CSRnet network [12] is primarily utilized for counting in crowded scenes.It can accurately count and generate high-quality density maps in highly crowded scenes.The CSRnet network model is mainly divided into front-end and back-end networks.VGG16 is utilized as the front-end network, and the size of the output image is 1/8th of the original input image.An increase in the number of convolutional layers will result in a smaller output image size, making it challenging to generate density maps.Therefore, this paper utilizes a dilated convolutional neural network as the back-end network.The increase in the number of convolutional layers will cause the output image to become smaller, which will increase the difficulty of generating density maps.Therefore, in this study, a novel convolutional neural network was utilized as the back-end network.This network expanded the perceptual domain while preserving the resolution to produce high-quality density maps.

VGG16
VGGNet [13] is a deep convolutional neural network built by the Visual Geometry Group (VGG) at the University of Oxford for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2014 [14].VGGNet is still a popular model for extracting image features.The VGG16 model utilizes small convolutional kernels [15] stacked multiple times with 3 × 3 networks and a maximum pooled layer of 2 × 2 networks.The network structure of VGG16 is shown in Figure 4.It consists of 16 layers [16].The overall network is divided into five segments, each comprising multiple 3 × 3 convolutional networks connected in series.A maximal pooling layer follows each segment.Finally, there are three fully connected layers and a softmax layer.

Dilated Convolution
Traditional convolution is limited by the size of the convolution kernel.For instance, when using a convolution kernel, it can only perceive an input region of a 3 × 3 size.To capture a wider range of contextual information, a larger receptive field needs to be considered in the density estimation model to capture the global information of the object.
Dilated convolution is a method for sampling data on feature graphs [17] that can expand the receptive field of the convolution kernel to better capture the long-distance dependencies in the image [18].In addition, traditional convolution reduces the size of the feature map through pooling operations to expand the receptive field [19], which leads to the loss of spatial information.There are some small targets in the rapeseed seedling images, and the conventional convolution operation can cause the structural information of these small targets to be lost, thus affecting the accuracy of rapeseed seedling detection.Therefore, this paper utilizes dilated convolution in the back-end network to produce high-quality density maps.Dilated convolution operates by increasing the dilation rate parameter, constraining the convolution kernel to a specific scale and padding the convolution elements with zeros, thus preserving the structural information of small targets.The expression for calculating the convolution kernel size of dilated convolution is as follows: In the above expression, r denotes the dilated rate coefficient in the dilated convolution, and H and W denote the size of the null convolution kernel in height and width, respectively.h and w represent the size of the original convolutional kernel in terms of height and width.The 3 × 3 convolution kernel is used as an example, and dilated convolution kernels with dilation rate coefficients (r) of 1, 2, and 4 are used, respectively.A diagram is shown in Figure 5.

The SCAM Attention Mechanism
The density distribution of rapeseed plants showed certain regularity as a result of the change in perspective during the rapeseed seedling stage scene.To address this situation, the Spatial Attention Module in SCAM (Spatial Context Attention Module) was utilized to represent large-scale contextual information and capture changes in density.The SCAM attention mechanism combines the SAM spatial and the CAM channel attention mechanisms, as shown in Figure 6.In order to distinguish foreground and background in the density map more effectively, the feature maps of all channels are weighted and summed, and the original channels are updated to enhance the effectiveness of foreground feature extraction.This design helps reduce the impact of viewpoint changes on density estimation and enhances the model's capacity to adjust to various density distributions.(1) The Space Attention Module (SAM) Figure 7 illustrates the architecture of SAM.For the base feature F output of size C × H × W, it passes through three different 1 × 1 convolution layers.Three feature maps, S1, S2, and S3, are obtained through reshape or transpose operations.In order to generate spatial attention graphs, matrix multiplication and softmax classification operations are applied to S1 and S2 [20].Finally, a spatial attention graph Sa of size HW × HW is obtained.The expression is as follows: where S represents the influence of the m position on the n position.The more similar the feature maps of two locations are, the stronger the correlation between them.After obtaining the spatial attention graph Sa, matrix multiplication is applied again between Sa and S3, and the output is reshaped to CHW.For the final summation of F, the output is scaled by a learnable parameter factor.The output definition of SAM is shown in Equation (3): where β is the learnable parameter.In practice, a convolutional layer with a kernel size of 1 × 1 is utilized to learn β.In the detailed description of the entire spatial attention module, in the final output feature map, Sfinal is a weighted sum of the attention map and the original local feature map, containing global contextual features and self-attention information.
(2) The Channel Attention Module (CAM) The departmental structure of CAM is shown in Figure 8.The CAM module consists of only one 1 × 1 convolutional layer to process the feature maps obtained from the backbone network, and the second matrix operation is the matrix operation of Ca:C × C and C3:C × HW.The main operation is similar to that of the spatial attention module.Specifically, the Ca of size C × C is defined as shown in Equation ( 4): where C denotes the effect of the i channel on the j channel.The C of size C × H × W is calculated as follows: where μ is a learnable parameter.In practical applications, a convolutional layer with a kernel size of 1 × 1 is utilized to learn μ, which is the same as in Sec.

Loss Function
In this study, the model was trained using a Bayesian loss function that utilizes the point labels in the rapeseed seedling sample data as supervised signals.In order to adapt to the supervised signal format, the expected operation is performed based on the estimated probability density map.This involves matching the discrete expectation of the probability density plot to the point labels to design the loss function and perform regression estimation of the expectation.This method helps to better align the supervised signal and enhance the performance of the model.The detailed process is shown in Figure 9.In the Bayesian loss function, a two-dimensional Gaussian distribution is utilized to represent the likelihood probability distribution of the approximate target, enabling the location distribution of each rape plant in the image.Assume that the two-dimensional spatial position of the rapeseed seedling in the density map is x, the spatial position of the rapeseed seedling labeled point is z, and the corresponding label of the rapeseed seedling labeled point is y.The probability that the nth rape plant labeled point in a given pointlabeled image is labeled y and position x can be expressed by Equation (6): where x represents the two-dimensional spatial location of the x th rapeseed plant, where m ∈ 1, M , M is the total number of density pixels that are finally output after the neural network model.Meanwhile, z denotes the spatial location of the n th rapeseed plant labeled as point, n ∈ 1, N , while N is the total number of rapeseed plants in the point labeled image.N(x ; z , σ l × ) denotes the two-dimensional Gaussian distribution computed at position x with mean x and covariance matrix σ l × , where l × is the second order unit matrix.
The probability of the rapeseed seedling is closely related to its distance from the center marking point.The closer the distance from the center marker, the greater the probability of the rapeseed seedling, and vice versa-the further the distance, the lower the probability.Equation ( 6) represents the probability distribution for a given location, according to which the posterior probability of the occurrence of a rapeseed seedling in each pixel of the density map can be calculated as shown in Equation ( 7): Based on Equation ( 7), the expected value E c for the emergence of the nth rapeseed seedling is calculated as Based on Equation ( 8), D (x ) represents the probability density of occurrence of rapeseed seedlings at different locations predicted by the network model.
Equations ( 6)-( 8) represent the calculation of the probability of the occurrence of the rape seedling y on the density map, and its ideal expected count value of rape seedling y is 1, thus defining the Bayesian loss function l as where E c is the expected count value of the nth rapeseed seedling.However, in the experiment, it was found that, while Equation ( 9) could accurately determine the boundaries of rape seedlings, it could not accurately predict the background pixels far from the marked positions of any rape seedlings.This is because such pixels are likely to be background pixels, whose computation results have a high a posteriori probability.Based on this, this study focused on the background as a specific target, introduced dynamic background dummy elements, and devised a new loss function l in the form of Equation ( 10): where E c is the desired count value of the background.The prediction process only requires inputting any rape seedling image into the network model.Finally, the number of predicted rape seedling images can be obtained by summing the network model output of rape seedling density estimation map D (x ).See Equation (11).

Overall Network Structure
In this paper, the network structure is improved based on the crowd density estimation method, and the improved network structure is shown in Figure 10.The overall network structure mainly includes five components: a front-end network, an expansion network, an attention network, a regression network, and an absolute quantity output.The front-end network consists of the initial four convolutional layers of VGG16 and the first three pooling layers, which mainly extract the basic features of the input image.The dilated convolution of the base features output from the front-end network is utilized in the expansion network to obtain more image information of rapeseed seedlings and extract more features of rapeseed seedlings.The attention network mainly enhances the feature information of rapeseed seedlings in both channel and spatial dimensions, attentively enhances the more useful features, and reduces the dense error estimation.Finally, more detailed feature extraction is carried out by an 1 × 1 convolutional layer regression network on the features after performing feature information enhancement, and the abs absolute number output module is utilized to restrict the distribution of the model parameters in order to generate the final predicted density maps.

Experimental Environment and Parameter Settings
The experimental environment was based on a Windows 10, Python 3.9, Pytorch [21] and Cuda built deep learning framework [22], the experimental hardware environment graphics card was a Gtx 1660Ti.The number of training iterations in each epoch is 1000 generations, the training batch size Batchsize is set to 1, the learning rate is 1 × 10 −6 , and the optimizer is SGD.

Evaluation Indicators
The Mean Absolute Error (MAE) and Mean Square Error (MSE) are usually utilized to reflect the accuracy of counting methods in studies related to counting by density estimation methods.They are commonly employed to measure counting performance [23].MAE is a common loss function used in regression models to reflect the distance between the estimated and true values.MSE is the most commonly used regression loss function to evaluate the performance of density plot estimation [24], which is defined as follows: where N is the number of test samples, Z is the real number of rape seedlings in the i image, Z is the number of rape seedlings predicted by the model in the i image.The smaller the MAE and MSE, the smaller the counting error and the higher the counting reliability.

Model Training Analysis
In order to verify the effectiveness of the improved algorithm model in this paper, the changes in training parameters of the improved algorithm during the training process are provided.Figure 11 shows the convergence of related parameters such as the training process loss (Loss), mean squared error (MSE), and the mean absolute error (MAE).As the number of iterations increases, the model's counting error as well as the loss show a convergence tendency, indicating that the enhanced algorithm model can effectively ensure the convergence of the model.

Distribution Experiment
In order to verify the effectiveness of the algorithms in this chapter on the attention module SCAM, the absolute quantity output module abs, and the 1 × 1 convolutional regression network module, four sets of comparative experiments were conducted on the rapeseed seedling dataset.
Experiment 1 is represented in Table 2 as A1: VGG16+Dilated+Bayesian Loss.VGG16 was utilized as the front-end network, the dilated null convolution served as the expansion network for feature extraction, and supervised training was performed using the probability density map loss function Bayesian Loss.This result was used as a comparison benchmark for verifying the effectiveness between different modules.
Experiment 2 is represented in Table 2 as A2: VGG16+Dilated+Bayesian Loss+SCAM.The attention module SCAM was added based on the basis of Experiment 1, so that, after the image entered the front-end network and the expansion network, the global feature information was enhanced using the attention module from the channel and spatial dimensions, respectively, and the attentional enhancement of the more useful features to reduce error estimation due to occlusion.Finally, supervised training was performed by the loss function Bayesian Loss.
Experiment 3 is represented in Table 2 as A3: VGG16+Dilated+Bayesian Loss+S-CAM+abs.The absolute number of the abs output module was added on the basis of 2, so that, after being enhanced with network features, it was constrained to the distribution of the model parameters by the absolute number of the abs output module, thus obtaining a better generalization performance and reducing the counting error.
Experiment 4 is represented in Table 2 as A4: VGG16+Dilated+Bayesian Loss+S-CAM+Conv11.A 1 × 1 convolutional regression network module was added to Experiment 2 for further feature extraction at each pixel point of the feature-enhanced model.
Experiment 5 is represented in Table 2 as A5: VGG16+Dilated+Bayesian Loss+S-CAM+abs+ Conv11.All three proposed modules were added to Experiment 1 to verify the effect of the final improved model.The counting results of the step-by-step experiments are shown in Table 2. Comparison of Experiment 1 and Experiment 2 shows that the MAE of the model counts decreased by 0.44 as well as the mean square error after using the attentional mechanism to enhance the feature information of the image features in both channel and spatial dimensions.
Comparing Experiment 2 and Experiment 3, it can be seen that constraining and optimizing the parameters of the feature-enhanced model by the abs absolute quantity output module does play a role in reducing the counting error.
Comparing Experiment 2 and Experiment 4, it can be seen that more accurate pixel counting also leads to a significant improvement in the counting error in the feature maps after the Concat feature splicing of the spatial and channel attention modules through a 1 × 1 convolutional regression network, which reduces the average absolute error by 0.16 in this metric alone, with a corresponding decrease in the MSE.
Comparing the effects of Experiment 4 with the other parameters, the 1 × 1 convolutional regression network with the abs absolute number output module also made an improvement in some of the redundant information after feature enhancement, reducing the model's counting error to 3.39.
In addition, the results of the step-by-step experiments described above are visualized in Figure 12. Figure 12a is the original figure, and Figure 12b-f correspond to Experiment 1-5, respectively.The brighter region in the density map of rapeseed indicates a higher density of the plant, and it is evident from the figures that Figure 12f shows more prominent density-responsive points in the overall characterization map of the rape seedling region compared to the other four density maps.

Comparison with Other Algorithms
In order to further verify the effectiveness of the algorithm discussed in this chapter, this paper also selected two other density estimation algorithms, CSRNet and Bayesian, for experimental comparison of rape seedlings, and the algorithm based on improved density estimation was named BCNet.The comparison results are shown in Table 3. From Table 3, it can be seen that, compared to the other two algorithms, BCNet has the lowest counting error, with an MAE and MSE of 3.40 and 4.99 for, respectively, indicating the highest counting accuracy.
In order to better compare the detection results across different algorithms, some of the visualized detection results are given in Figure 13, while the actual counts are included to compare the error between the predicted and actual counts.In the visualization results of the three density estimation algorithms, the brighter the region in the density map, the higher the density of the rapeseed seedlings.Due to the similar color and irregular shape of the characteristics of the target region of rapeseed seedlings, applying density estimation algorithms may cause the probability value of the density response point of a single shaded plant in a dense region to be less than 1, and therefore also result in a situation where the total number of probability predictions is lower than the actual number.By comparing the disparity between the predicted counts and the number of actual counts, it is evident that the BCNet algorithm's predicted counts more accurately reflect the actual counts.

Figure 1 .
Figure 1.Flow chart of the oilseed rape plant counting algorithm with improved density estimation method.(a) Training flowchart.(b) Test flowchart.

Figure 2 .
Figure 2.This is a figure.Schemes follow the same formatting.(a) Data collection sites.(b) Acquisition area scenarios.(c) Seedling data collection.(d) Zoom in on details.

Figure 6 .
Figure 6.SCAM structure schematic.SCAM performs feature information enhancement in channel and spatial dimensions through SAM and CAM, respectively.It generates reconstructed feature maps by Concat feature stitching of the enhanced features.The following is a detailed description of the SAM and CAM modules.

Figure 8 .
Figure 8.The detailed structure of the channel attention model (CAM).

Figure 9 .
Figure 9. Detailed procedure for the action of Bayesian loss function.

Figure 10 .
Figure 10.Improved network structure of the density estimation algorithm.

Figure 14 .
Figure 14.Comparison of the real number of rape plants in the test set with the predicted number.

1 .
The density estimation method utilized spatial attention, a channel attention module, feature information enhancement, and splicing, respectively, which improved the representation of the entire feature map, using a 1 × 1 convolutional layer for further feature extraction and introducing the torch.absfunction at the output port of the network.This improved model is named BCNet.2. Distribution experiments and result visualization were performed, and the density response points corresponding to the features of the seedling region were more prominent in the improved density map compared to the other four density maps.Compared with the CSRNet and the Bayesian algorithms, BCNet has the lowest counting error, with MAE and MSE reaching 3.40 and 4.99, respectively.BCNet exhibits the highest counting accuracy, as the predicted count closely aligns with the actual count.3.Under complex backgrounds or high density conditions, the BCNet algorithm was employed to predict the count of rapeseed seedlings, and the two curves of the actual and predicted numbers of rapeseed seedlings obtained were very close to each other, which verified the feasibility of the method in this paper.It can provide a reference for seedling identification and counting methods of rapeseed, and provide technical support for achieving precise seedling interplanting and seedling replenishment.

Table 1 .
Information about the rapeseed seedling dataset.

Table 2 .
Partial data sample display.

Table 3 .
Comparison of counting results of different algorithms.