Improved Mask R-CNN Combined with Otsu Preprocessing for Rice Panicle Detection and Segmentation

: Rice yield is closely related to the number and proportional area of rice panicles. Currently, rice panicle information is acquired with manual observation, which is inefﬁcient and subjective. To solve this problem, we propose an improved Mask R-CNN combined with Otsu preprocessing for rice detection and segmentation. This method ﬁrst constructs a rice dataset for rice images in a large ﬁeld environment, expands the dataset using data augmentation, and then uses LabelMe to label the rice panicles. The optimized Mask R-CNN is used as a rice detection and segmentation model. Actual rice panicle images are preprocessed by the Otsu algorithm and input into the model, which yields accurate rice panicle detection and segmentation results using the structural similarity and perceptual hash value as the measurement criteria. The results show that the proposed method has the highest detection and segmentation accuracy for rice panicles among the compared algorithms. When further calculating the number and relative proportional area of the rice panicles, the average error of the number of rice panicles is 16.73% with a minimum error of 5.39%, and the error of the relative proportional of rice panicles does not exceed 5%, with a minimum error of 1.97% and an average error of 3.90%. The improved Mask R-CNN combined with Otsu preprocessing for rice panicle detection and segmentation proposed in this paper can operate well in a large ﬁeld environment, making it highly suitable for rice growth monitoring and yield estimation.


Introduction
Rice is one of the most important food crops worldwide, as more than half of the world's population depends on rice as its main food source.China is the second largest rice-growing country, accounting for more than 30% of the total global rice production, and rice is the second most important crop in China [1].Rice yield is closely related to the number and proportional area of rice panicles.Traditional monitoring methods mainly rely on manual observation, which is tedious, inefficient and subjective, and has difficulty meeting real-time, rapid and nondestructive monitoring requirements.With improvements in agricultural informatization and the development of computer vision, traditional production management methods are gradually shifting to an artificial intelligence basis, and the detection and assessment of rice pests and diseases, as well as fertility, can be achieved with the help of images taken by field cameras in real time, allowing timely prevention and management [2][3][4][5][6][7].Accurate rice panicle segmentation is a key step in achieving intelligent rice monitoring and yield assessment.However, it is very difficult to segment rice panicles due to the complex natural environment in rice fields, for example, panicle and leaf shielding, the easy color mixing and uneven light changes.
Researchers have applied image detection technology in detecting and segmenting field crops such as rice panicles and wheat ears, effectively improving detection efficiency and reducing costs.By leveraging color and texture features at the pixel level, Zhou, C. selected red, green and blue images acquired by manned ground vehicles according to the light intensity for superpixel segmentation, and used the relevant features to count the number of wheat ears, but the accuracy varied greatly with the number of features and illumination intensity [8].Jose, A. proposed an automatic counting algorithm for measuring the density of ears of wheat using RGB images acquired by handheld cameras, however this method was only suitable for counting unripe wheat [9].Lu fused a simple linear iterative clustering (SLIC) method to generate superpixels and segment ears of corn, but the hardware facilities that this method relies on could not meet the requirements of real-time segmentation [10].Xiong, X. proposed a simple linear iterative clustering (SLIC) method based on superpixel region generation and convolutional neural network (CNN) classification to segment ears of wheat [11].However, this method was time-consuming and cumbersome, and the segmentation results were easily affected by many factors.Fan, M. collected images of wheat ears in a field environment with a camera, and accurately extracted the outlines of the ears based on the color and texture features of the ears, finally obtaining the number of ears in the image, however illumination had a great influence on color and texture features in the real environment [12].Li, H. used the LAB color model and the Otsu thresholding segmentation (Otsu) method to extract rice seedling information and combined it with skeleton extraction to detect the number of machine-planted seedlings [13].Cao, Y. used digital images of experimental rice fields taken by unmanned aerial vehicles (UAV), the number of rice spikes in the ground plot sample and other measured data, and applied the feature parameters extracted by the best subset selection algorithm to construct a rice spike segmentation model and obtain the number of rice spikes [14].
Nevertheless, it was difficult to prepare the data set, the color features were unstable, and the segmentation effect was closely related to the shooting.By implementing candidate area-based classification and a counting method by segmenting (spike or background), and counting the generated candidate areas, Li, Q. generated spike candidate regions using Laws' texture energy to calculate the number of spikes for spike detection.Interference from noise was reduced by applying area and height thresholds, however the method was just used for individual plants in a laboratory setting [15].Olsen, P. proposed a computer vision-based method for UAV-acquired rice spike images for detection and counting [16].In summary, the above methods have a variety of limitations: color and texture alterations caused by uneven changes in light and noise; fuzzy boundaries caused by similarities between the panicle and leaf colors; and serious panicle adhesion in field environments, which leads to low detection accuracy or poor segmentation results.Developments in computer vision have led to the widespread adaptation of object detection, semantic segmentation and instance segmentation in agriculture [17][18][19][20][21][22][23][24][25][26][27][28].Recent studies have applied CNN to spike recognition and counting and segmenting gram crops in the field.Zhang, L. constructed a wheat spike recognition model based on the characteristics of winter wheat images collected at the flowering stage in a large field environment, and trained the model for winter wheat spike counting using the gradient descent method [29].Madec, S. estimated the spike density of wheat based on the Faster R-CNN algorithm with 91% counting accuracy [30].Yang, M. constructed a semantic segmentation network for UAV images based on a deep learning image processing method to construct a semantic segmentation network for estimating the number of rice crops in large-area rice fields and experimentally found that, combined with excess green factor retraining, this method yielded better results [31].Duan, L. applied a rice image dataset and data augmentation techniques to train three models for rice crop segmentation offline and found that SegNet-based fully CNN effectively improved accuracy and efficiency in rice spike segmentation [32].However, in this method, there was a large amount of work required for dataset edging and manual annotation in photoshop.Kong, H. proposed a Mask R-CNNbased feature extraction and 3D recognition method for rice spike computed tomography (CT) images, extracting 3D spike grain features based on the Euclidean distance to calculate the maturing rate and determine rice spike fullness or dryness, however this method was not suitable for practical environment and the equipment was difficult to operate [33].Theoretically, methods based on CNN are capable of detection and segmentation, but various factors in real-time field environments can interfere with these processes and have not been fully considered in the literature, and only separate studies on rice panicle detection or segmentation have been conducted.
Typical object detection algorithms such as Faster R-CNN and the YOLO (you only look once) series mainly obtain target location information through analyzing rectangular boxes [18,20].Semantic segmentation can be used to distinguish individual pixels in the image but cannot distinguish between different individuals of the same target [22,23].In contrast, instance segmentation can be used to determine both the location information and semantic information of the target.However, the common instance segmentation algorithm SOLOv2 requires excessive training, and the boundary information from the PolarMask algorithm can be unclear [26,27].Mask R-CNN, another common instance segmentation algorithm, can be used to detect and segment rice panicles [34].Compared with other algorithms, it implements a non-unique input image scale and has a faster detection speed.Using a constructed data set to directly train the model for detection, we can obtain information on the quantity, location and area occupancy of rice panicle, all without concern for false detections.However, the rice panicle distribution can be relatively tight, and the colors of the panicles and leaves vary unevenly, which leads to missed detection.
Therefore, this paper proposes an improved Mask R-CNN combined with Otsu preprocessing for rice panicle detection and segmentation [35].The data set is first expanded through data enhancement and annotated with LabelMe, and then the base network training is optimized by combining the KL divergence and the Soft-DIoU-NMS algorithm [36,37].The real-time field rice images are preprocessed using the Otsu algorithm and then fed into the trained model for detection and segmentation.This detection and segmentation method is easy to operate and has high accuracy, making it highly suitable for monitoring rice growth and yield estimation.In the following section, we will introduce the dataset and methods in detail, the third section is the results and analysis of the experimental data, the fourth section is a comparison of the rice panicle image detecting and segmenting methods, and the last section is the conclusion.

Dataset and Preprocessing
The dataset was obtained from the experimental rice fields at the Anhui Agricultural Meteorological Center Hefei Branch experimental base, Hefei, China (117 • 03 26 E, 31 • 57 20 N) [7].Images of the whole process from transplanting to the harvesting of four varieties of one-season rice (Dangyujing 10, Xuanjing Nuo 1, Chuangliangyou 699 and Liangyou 631) were collected in 2019 and 2020.They were planted in six different sowing periods in 12 plots of equal area and proportion (each plot size was 12 m × 5 m).As shown in Figures 1 and A1, network cameras were set up at the diagonal points of the rice field.Camera 1 took images of the whole growth period of the rice in areas 1-6, camera 2 took images of the whole growth period of the rice in areas 7-12, the blurred images taken in area 6 and 7 are not available.The Hikvision iDS-2DF8825IX-A(T5) camera (made in Hangzhou, Zhejiang, China) was used to acquire a total of 300 images for creating a rice milk ripening stage image dataset.The camera video output supports 3840 × 2160 @25 fps, 2100 lines, 37× optical zoom, a maximum of 300 preset positions, 18 cruise paths, and a mounting riser 2.5 m above ground [7].Ten shooting points were set in each sowing period block of each experimental field, and the images were automatically uploaded to the FTP server after a timely fixed-point collection, as to avoid as many substantial changes in light intensity caused by direct sunlight as possible.Due to the large size of the original images used for segmentation, they were randomly segmented into areas of 573 × 880 pixels; each image was cropped only once to avoid training overfitting.To improve the algorithm robustness, the dataset was expanded using random cropping, mosaicking, and scaling to simulate images obtained at different angles and distances and in different environments.
The expanded dataset consisted of a total of 1000 images and was divided into a training set, test set, and validation set at a ratio of 8:1:1.
the FTP server after a timely fixed-point collection, as to avoid as many substantial changes in light intensity caused by direct sunlight as possible.Due to the large size of the original images used for segmentation, they were randomly segmented into areas of 573 × 880 pixels; each image was cropped only once to avoid training overfitting.To improve the algorithm robustness, the dataset was expanded using random cropping, mosaicking, and scaling to simulate images obtained at different angles and distances and in different environments.The expanded dataset consisted of a total of 1000 images and was divided into a training set, test set, and validation set at a ratio of 8:1:1.

Method
The principle of the original Mask R-CNN algorithm is shown in Figure 2. Mask R-CNN is an algorithm for the detection and segmentation of targets in complex backgrounds.Feature pyramid networks (FPN) and anchor technologies are used to optimize the detection effect of targets at different scales, and the fully convolutional networks (FCN) are combined to achieve accurate segmentation of the objective [35].
Here, we propose a method combining an improved Mask R-CNN with Otsu preprocessing, called Panicle-Mask, to further improve detection accuracy.The flowchart is shown in Figure 3.The original Mask R-CNN algorithm was first improved and optimized by adjusting the anchor and detection box accuracy for its dataset, labelled as B, C, and D, and the rice images are preprocessed by the Otsu algorithm, labelled as A, and then input into the trained model for detection and segmentation.

Method
The principle of the original Mask R-CNN algorithm is shown in Figure 2. Mask R-CNN is an algorithm for the detection and segmentation of targets in complex backgrounds.Feature pyramid networks (FPN) and anchor technologies are used to optimize the detection effect of targets at different scales, and the fully convolutional networks (FCN) are combined to achieve accurate segmentation of the objective [35].Here, we propose a method combining an improved Mask R-CNN with Otsu preprocessing, called Panicle-Mask, to further improve detection accuracy.The flowchart is shown in Figure 3.The original Mask R-CNN algorithm was first improved and optimized by adjusting the anchor and detection box accuracy for its dataset, labelled as B, C, and D, and the rice images are preprocessed by the Otsu algorithm, labelled as A, and then input into the trained model for detection and segmentation.

Otsu Preprocessing
Otsu is a traditional adaptive thresholding algorithm for grayscale image segmentation [35].In this case, rice panicles can be segmented based on the grayscale pixel distribution in an image.The rice panicle in a rice image shows a clear contrast with the background.First, the excess green index (ExG) is calculated based on the dataset [38].Then, the collected image is converted to grayscale using the ExG, and the Otsu algorithm is used to calculate the threshold to transform the grayscale image into a binary image that allows segmentation without leakage.For the collected rice color images, the values of the rice panicle and background in the R, G, and B color components have different characteristics.In creating the ExG grayscale images, the original image is first separated

Otsu Preprocessing
Otsu is a traditional adaptive thresholding algorithm for grayscale image segmentation [35].In this case, rice panicles can be segmented based on the grayscale pixel distribution in an image.The rice panicle in a rice image shows a clear contrast with the background.First, the excess green index (ExG) is calculated based on the dataset [38].Then, the collected image is converted to grayscale using the ExG, and the Otsu algorithm is used to calculate the threshold to transform the grayscale image into a binary image that allows segmentation without leakage.For the collected rice color images, the values of the rice panicle and background in the R, G, and B color components have different characteristics.In creating the ExG grayscale images, the original image is first separated into three independent primary color planes, and different color feature combinations are selected.Then, each pixel in the image is transformed to enhance the contrast between the target and the background.The ExG is obtained by the system after automatically exhausting the color components and adding human supervised judgement.Equation ( 1) is the RGB linear combination calculation formula, and the best combination is calculated by the dataset, that is, the ExG.This combination is robust to light and color changes and can more completely extract the features of the part of the image corresponding to the ears of rice.Compared to the original images, the ExG-based extraction of green plant images is better; shadows, withered grass, and soil elements can be substantially suppressed, and areas representing the plant are more prominent.For crop or weed recognition, the most commonly used gray method is the ExG method.Equation (2) represents a common ExG expression.For segmentation based on the Otsu algorithm, the ExG is calculated based on the dataset and then used to perform feature extraction to segment the rice panicles.Equation (3) shows the ExG expression in this paper, where R and G are the red and green color components of the image, respectively.T(i,j) = rR(i,j) + gG(i,j) + bB(i,j) (1) where T(i,j) is the resultant feature after the linear operation, R(i,j), G(i,j), and B(i,j) are the grayscale values of the red, green and blue color components, respectively, of the image at (i,j), r, g, and b are the linear coefficients of the color components R(i,j), G(i,j), and B(i,j), and (i,j) denotes the two-dimensional array variables of the color components.Equation ( 4) is the Otsu algorithm formula: where I is the gray value of image pixels, and threshold is the threshold that maximizes the variance between all pixel classes in the gray map.When the pixel gray value in the image is greater than the threshold, the pixel corresponds to a rice panicle; otherwise, it belongs to the background area.
According to the underlying principle of the Otsu algorithm, to obtain a better segmentation effect, the variance between the classes needs to be maximized while minimizing the within-class variances in the target and background.Thus, the variance is a measure of the uniformity of the grey scale distribution; the larger the value is, the greater the difference between the target and the background after segmentation at a given threshold, and the more favorable the feature extraction [35].When part of the target is misclassified into the background or the background is misclassified into the target, the difference between the two parts of the image decreases; thus, by achieving the largest variance between classes, the probability of misalignment will be minimal.This method can segment out the ears of rice and will not mistake them as background.However, a field environment is complex; the color of the panicles and leaves is greatly affected by light, split rice panicles are mutilated and incomplete, and some leaves or stalks may be misdetected.Therefore, this method uses the improved Mask R-CNN in combination with Otsu preprocessing for detecting and segmenting rice panicles.

Adjustment of the RPN Anchor Box
The region proposal network (RPN) lightweight neural network was first proposed in the Faster RCNN algorithm, and the Mask R-CNN follows its basic structure and adds segmentation [24].The RPN generates all possible target candidate regions through fixed window sliding on feature images extracted from the feature extraction network, and then maps these candidate regions to the original map through the region of interest alignment (ROI align) operation.Finally, it performs category classification, box regression, and mask generation on these candidate regions [25].In the Mask R-CNN algorithm, the RPN generates a total of five different scales of anchor boxes, which are the five initial anchor boxes with set areas and proportions.By further fine-tuning their position and size, the best detection frame containing the target can be selected.Whether the setting of the initial anchor box is appropriate or not affects the prediction box accuracy.In this study, the anchor box size was appropriately increased and decreased from the original size under the same conditions for the experiments, and the optimal initial anchor box size was obtained by comparing the average accuracy (mean average precision, mAP).

Bounding Box Adjustment
In the traditional Mask R-CNN algorithm, the candidate box is represented by a fourdimensional vector (x, y, w, h).By learning the deformation ratio of the real and prediction boxes, the final prediction boxes are obtained by step-by-step fine-tuning.During the training process, the outputs are all a series of probability distributions converging to a Gaussian distribution.The Mask R-CNN defines a multitask loss function consisting of three components, as shown in Equation (5), which is the specific calculation of the classification error and detection error.(5) Appl.Sci.2022, 12, 11701 7 of 20 where L cls and L reg are the classification error and detection error, respectively, and L mask is the semantic segmentation error.The N cls and N reg are the parameters of standardization, i is the index of anchor in a mini-batch, p i stands for the predicted value of an object, p i * stands for the true value of an object.The t i and t i * represent the coordinates of the prediction box and the real box, respectively.
However, the bounding box may not be clear when the target is partially obscured, leading to inaccurate labelling and a blurred bounding box, which directly affects the detection error.To solve the problem of blurred bounding boxes, this paper uses the KL divergence loss function instead of the traditional loss function.The KL loss function still consists of three parts; the classification error and semantic segmentation error remain unchanged and the detection error is Lreg', which is calculated using Equation ( 6).The KL divergence, called information gain or information divergence in statistical model inference, measures the difference between two probability distributions [39].According to the KL divergence theorem, the boundary prediction box and the real box are modeled as Gaussian distributions and Dirac delta functions, respectively, and the KL loss function is the KL distance between the distribution area of the prediction box and that of the real box [36].Therefore, it is only necessary to calculate the similarity between the Gaussian distribution of the prediction box and the probability distribution of the real box.When the KL divergence of the two distributions is smaller, the probability distributions of the two are closer; that is, the prediction box is closer to the real box.
where x g is the estimated coordinate, ang x e is the coordinate of the predicted box, and α = log(σ 2 ), where σ is the standard deviation, which is used to measure the difference between the predicted box and the real box.As σ → 0, the predicted box approaches the real box, and the prediction is more accurate.

Prediction Box Deletion and Selection
Mask R-CNN uses the non-maximum suppression (NMS) algorithm to rank the predictors of a certain category by confidence level, and filters them by calculating the intersection-over-union (IoU) to obtain the object.The best position is obtained by calculating the IoU; however, the detection box with the second highest confidence in this algorithm may be mistakenly deleted due to its high overlap with the detection box with the highest confidence.The improved Soft DIoU-NMS can effectively solve this problem [37].
The Soft DIoU-NMS algorithm first sorts the detection boxes according to the confidence level and selects the highest scoring detection box as the benchmark, and the remaining detection boxes are calculated using the first linear decay.The detection box with the highest confidence level is retained, and the next highest confidence level is used as the benchmark.In this way, the confidence level remains unchanged after processing.Finally, the desired effect is achieved by comprehensive culling.The improved algorithm is shown in Equation (7).Compared with the original NMS algorithm, the complexity of the updated algorithm is practically unchanged, and the implementation is equally simple.
where S i is the confidence score of the current category, R DIoU is the penalty term of the D IoU loss function, B i denotes all the compared detection boxes in the current category, and µ denotes the detection box with the highest confidence among all the prediction boxes, generally taking ε = 0.5.

Evaluation Indicators 2.3.1. Precision, Recall, F1-Score, and IoU
Precision and recall are often used as model evaluation indicators in target detection.Precision, also known as accuracy, reflects the proportion of correct classifications among the number of positive samples classified by the model.Recall, known as the completeness rate, reflects the proportion of correctly predicted positive samples out of the number of actual positive samples.The area under the precision-recall (P-R) curve represents the correctness of the model (AP), and the AP values of all categories are averaged to obtain the average correctness (mAP), which is calculated with Equation ( 8).The F1-score is the weighted harmonic average of the precision and recall.The IoU, the overlap rate of the candidate bound and ground truth bound generated in target detection; that is, the ratio of their intersection to their union, is calculated as Equation ( 9).P = TP/(TP + FP) R = TP/(TP + FN) F1-score = (1 + a 2 ) × P × R/(a 2 (P + R)), a > 1 where TP is the number of correctly predicted positive samples, FP is the number of negative samples incorrectly predicted as a positive sample, and FN is the number of positive samples incorrectly predicted as a negative sample.In Equation ( 9), a is the number of target categories, A is the area of the detected frame, and B is the area of the real frame.

SSIM
The structural similarity (SSIM) is an indicator used to measure the similarity of images, consisting of a comparison of brightness l, contrast c, and structure s [40,41].As the image size must be the same during the calculation, the image needs to be converted to grayscale first, and then the SSIM value of the corresponding sub-image is obtained by moving along the image pixel-by-pixel with a window of a certain size.Finally, the similarity of the two images is obtained by averaging the values of the windows.The SSIM is a number between 0 and 1, and the calculation is shown in Equation (10).The larger the value is, the smaller the difference between the two images.

SSIM (x, y
where l (x, y) is used as an estimate of luminance, c (x, y) is used as an estimate of contrast, and s (x, y) is as a measure of structural similarity, calculated as in Equation (11).The α, β, and γ are weights.l (x, y) = (2µ x µ y + C 1 )/(µ x 2 + µ y where µ x and µ y are all pixels of the image, σ x and σ y are the standard deviations of the pixel values of the image, σ xy is the covariance of x, and y, C 1 , C 2 , and C 3 are constants.Finally, to simplify the calculation by letting α = β = γ = 1 and C 3 = C 2 /2, we obtain the final Equation (12).

pHash
The hash algorithm (Hash) specifically includes the mean hash, difference hash, and perceptual hash (pHash), which is commonly used to determine the similarity of two images irrespective of the height, width, brightness or color [42].The pHash algorithm reduces the image frequency through a discrete cosine transform (DCT) and forms the hash sequence by encoding a unique hash fingerprint.The similarity is then calculated by sequentially comparing the values in the sequence.Compared with the other two algorithms, this algorithm has better robustness and higher accuracy.The higher the value of pHash is, the less similar the two images are; if the value is less than 5, the two images are very similar.Comparing the input image to a two-dimensional signal, the low-frequency part contains most of the image information with small brightness variation, and the high-frequency part contains the image details with large brightness variation.The calculation is shown in Equation (13).
where E(u) and F(u,v) are the coefficients of the one-dimensional and two-dimensional transformations, respectively, N is the number of points in the original signal, f(i) is the initial signal of the input, c is a coefficient of the matrix orthogonal transformation, and u and v are the number of points in each dimension of the input signal.

Experimental Process
The Keras deep learning framework was implemented on a computer with an Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30 GHz, 12 GB memory, and an NVIDIA GeForce GTX 1080 Ti GPU with a Windows 10 Professional operating system to perform the experiments in this study.Python was used to train the rice panicle detection model.First, LabelMe was used to label the dataset, applying a different label to each rice panicle.Then, the boundary of each rice panicle was obtained from the generated json-format file to form a mask image.The Mask R-CNN and improved Mask R-CNN algorithms were trained separately.Under the same conditions, five sets of different fixed anchor box sizes were used for training.After the experiment, the parameters of the improved algorithm were set as follows: learning_rate = 1e-5, RPN_ANCHOR_SCALES = (16,32,64,128,256), epochs = 100, and steps_per_epoch = 200.The improved model converged after 20,000 training iterations, and the corresponding loss and P-R curves are shown in Figures 4 and 5, respectively.As seen in Figure 5, the loss function converged, and the model finished training.a mask image.The Mask R-CNN and improved Mask R-CNN algorithms were trained separately.Under the same conditions, five sets of different fixed anchor box sizes were used for training.After the experiment, the parameters of the improved algorithm were set as follows: learning_rate = 1e-5, RPN_ANCHOR_SCALES = (16,32,64,128,256), epochs = 100, and steps_per_epoch = 200.The improved model converged after 20,000 training iterations, and the corresponding loss and P-R curves are shown in Figures 4 and 5, respectively.As seen in Figure 5, the loss function converged, and the model finished training.

Experimental Results
Table 1 shows the detection results of the model with different anchor box sizes.Th accuracy of the original Mask R-CNN model is 68.58% using original prior box sizes {32,64,128,256,512}.Under the same conditions, the model was trained with different ancho frame sizes, producing the RPN-panicle1, RPN-panicle2, Mask R-CNN, RPN-panicle3, an RPN-panicle models.The optimal model was RPN-panicle2, which showed an accurac

Experimental Results
Table 1 shows the detection results of the model with different anchor box sizes.The accuracy of the original Mask R-CNN model is 68.58% using original prior box sizes of {32,64,128,256,512}.Under the same conditions, the model was trained with different anchor frame sizes, producing the RPN-panicle1, RPN-panicle2, Mask R-CNN, RPN-panicle3, and RPN-panicle models.The optimal model was RPN-panicle2, which showed an accuracy improvement of 11.67% over the original model with an optimal RPN anchor box size of {16,32,64,128,256}.Then, the detection box accuracy was improved by improving the loss function and NMS algorithm, and the final improved model, named Panicle-Mask, achieved a mAP of 89.10% after training.The IoU of the detection box was 84.42% after training.The IoU value obtained by training using the most basic Mask R-CNN model was 59.73%, while the IoU value obtained by using the method in this paper was 87.42%.Figure 6 shows the detection results for four randomly selected images.Otsu, Mask R-CNN, and Panicle-Mask represent the segmentation method based on Otsu thresholding, detection and segmentation based on the Mask R-CNN, and detection and segmentation based on the improved Mask R-CNN combined with the Otsu thresholding algorithm.As shown in Figure 6, all three methods could segment the rice panicles.Among them, the latter two methods detected different rice panicles labelled with different colors, where each rectangular box represents the smallest rice panicle enclosed within the box, and the labelled numerical part indicates the confidence level of detection.Unlike the Otsu algorithm, the Mask R-CNN-based algorithm both detected and segmented the rice panicles, and their number and relative positions were obtained from the number and coordinate information of the rectangular boxes.The background interfered greatly with the Otsu algorithm, but when the color difference between the rice panicle and the background was relatively large, a better segmentation effect was achieved.In contrast, when the image contains similar colors, the segmentation is incomplete, and there are many false detections.Due to the use of the Otsu algorithm to preprocess the image, the influence of the background on detection was reduced, and the model was improved and optimized according to the characteristics of the dataset.Therefore, Panicle-Mask has higher accuracy.
Figure 7 shows examples of the binary images transformed using the three methods after segmentation.The similarity of the images can be further measured according to the gray image, and the proportion of rice panicles in the image can be calculated from the binary image.As seen in Figure 7, the binarized images obtained by the Otsu algorithm based on threshold segmentation were correctly segmented, but a large amount of background was also segmented as rice panicles with gaps.The results detected directly using the original Mask R-CNN algorithm showed many missed detections, while the detection using the combination of the Otsu thresholding segmentation algorithm and the improved Mask R-CNN algorithm made up for the respective shortcomings of the individual elements.Further analysis of the examples in Figure 6 reveals that the results obtained by the Otsu algorithm were both on the high side and those obtained by the original Mask R-CNN algorithm were both on the low side.The results obtained by the combined and improved Panicle-Mask algorithm were the most similar to the real results.
achieved.In contrast, when the image contains similar colors, the segmentation is incomplete, and there are many false detections.Due to the use of the Otsu algorithm to preprocess the image, the influence of the background on detection was reduced, and the model was improved and optimized according to the characteristics of the dataset.Therefore, Panicle-Mask has higher accuracy.Figure 7 shows examples of the binary images transformed using the three methods after segmentation.The similarity of the images can be further measured according to the gray image, and the proportion of rice panicles in the image can be calculated from the binary image.As seen in Figure 7, the binarized images obtained by the Otsu algorithm based on threshold segmentation were correctly segmented, but a large amount of background was also segmented as rice panicles with gaps.The results detected directly using the original Mask R-CNN algorithm showed many missed detections, while the detection using the combination of the Otsu thresholding segmentation algorithm and the improved Mask R-CNN algorithm made up for the respective shortcomings of the individual elements.Further analysis of the examples in Figure 6 reveals that the results obtained by the Otsu algorithm were both on the high side and those obtained by the original Mask R-CNN algorithm were both on the low side.The results obtained by the combined and improved Panicle-Mask algorithm were the most similar to the real results.

Error Analysis
Twenty randomly selected rice images at the milk ripening stage were detected and segmented using the three methods.The results were converted into grayscale images, and the SSIM values and pHash values were calculated with reference to the grayscale images to comprehensively measure the detection result accuracy.The reference grayscale images were then converted into rice panicle images after manual labeling.Some example images are shown in Figure 8.In some of the images, such as those in Figure 8a-c, the contrast between the rice panicle and background is obvious, while in others, such as those seen in Figure 8d-f the contrast between the two elements is poor.

Error Analysis
Twenty randomly selected rice images at the milk ripening stage were detected and segmented using the three methods.The results were converted into grayscale images, and the SSIM values and pHash values were calculated with reference to the grayscale images to comprehensively measure the detection result accuracy.The reference grayscale images were then converted into rice panicle images after manual labeling.Some example images are shown in Figure 8.In some of the images, such as those in Figure 8a-c, the contrast between the rice panicle and background is obvious, while in others, such as those seen in Figure 8d-f the contrast between the two elements is poor.segmented using the three methods.The results were converted into grayscale images, and the SSIM values and pHash values were calculated with reference to the grayscale images to comprehensively measure the detection result accuracy.The reference grayscale images were then converted into rice panicle images after manual labeling.Some example images are shown in Figure 8.In some of the images, such as those in Figure 8a-c, the contrast between the rice panicle and background is obvious, while in others, such as those seen in Figure 8d-f the contrast between the two elements is poor.Table 2 shows the SSIM results obtained using the different methods.Analysis of the data in Table 2 showed that among those obtained with the three methods, the SSIM obtained by direct detection using the Mask R-CNN algorithm yielded the second-highest results, with a maximum of 92.31%.The results obtained using Otsu segmentation and the improved model were the highest, with minimum values above 88% and a maximum of 95.86%.The results obtained from the Otsu algorithm showed that the segmentation results were better for rice images 9, 12, and 13, where the color difference between the rice panicles and the background was larger, while the color difference was smaller for rice images 3, 5, 19, and 20.Images 3 and 20 were first preprocessed using the Otsu algorithm to remove interference from the background region, and then detected and segmented using the improved Mask R-CNN algorithm model.The obtained SSIM values improved by approximately 10%.Table 3 shows the pHash results obtained using the different methods; the pattern of the data was approximately the same as that in Table 2.As seen in Table 3, the accuracy of the results obtained by directly using the Mask R-CNN algorithm for detection was low, with a maximum pHash value of 12, which may be related to the different growth characteristics of rice with different panicle shapes, sizes, and colors.The accuracy of the results obtained by using the Otsu algorithm showed overall improvements; specifically, the segmentation results for images 12 and 13 were more similar to the real results.The segmentation accuracy was further improved by combining the Otsu algorithm and Mask R-CNN in the Panicle-Mask algorithm, and the perceived hash value of all the detected rice images was no more than 5.The perceived hash value of images 12 and 13 was reduced to 2, which indicates that the results obtained by this method are very similar to the real results.

Detection and Segmentation Results
From the above analysis, it is clear that detection using the Otsu algorithm combined with the Mask R-CNN algorithm produces images with the highest similarity to the input image.This method first removes most of the background interference and then uses the trained model for detection and segmentation, which not only produces both the number and specific location information of the rice panicles but also calculates their proportional area.The proportional area of the rice panicles was little affected by their close distribution, the parts of the rice panicles that were stuck together, and the large error in the statistics of the number of rice panicles.Figures 9 and 10 show a comparison between the reference number and proportional area of the rice panicles and those obtained using the proposed method for 40 randomly selected rice images at milk maturity.The horizontal axis represents the number of images, and the vertical axis represents the number of panicles and the proportion area of the panicle of the corresponding image in Figures 9 and 10, respectively.The reference value refers to the number of rice panicles and the proportion of rice panicle area obtained by manual calculation and program statistics after the corresponding picture is manually segmented, and the test value refers to the corresponding result obtained by using the method in this paper.The first 20 images were from the validation set, i.e., 2019 and 2020 rice images, and the last 20 images were randomly selected from among the 2018 milk-ripening rice images to further verify the generalizability of this method.If the panicle was completely blocked by leaves, it was considered to be different from the same rice panicle.Analysis of Figures 9 and 10 shows that the results obtained by the method in this paper were all lower than the reference values, and the error for the number of panicles was slightly higher than that for the proportional area, with an average error of 16.73% versus 3.90%.The corresponding minimum errors were 2.02% and 1.97%, respectively.

Comparison with Rice Panicle Image Detecting and Segmenting Methods
The proposed method was compared with existing rice panicle detection and segmentation methods, as shown in Table 4.The table lists the specific algorithms used and their evaluation indexes, and summarizes the advantages and disadvantages of each method.In addition to the evaluation indicators already listed above, root mean square error (RMSE), also called standard error, is a measure of how close the calculated value is to the true value; the smaller the value, the higher the accuracy of the model.Accuracy

Comparison with Rice Panicle Image Detecting and Segmenting Methods
The proposed method was compared with existing rice panicle detection and segmentation methods, as shown in Table 4.The table lists the specific algorithms used and their evaluation indexes, and summarizes the advantages and disadvantages of each method.In addition to the evaluation indicators already listed above, root mean square error (RMSE), also called standard error, is a measure of how close the calculated value is to the true value; the smaller the value, the higher the accuracy of the model.Accuracy

Comparison with Rice Panicle Image Detecting and Segmenting Methods
The proposed method was compared with existing rice panicle detection and segmentation methods, as shown in Table 4.The table lists the specific algorithms used and their evaluation indexes, and summarizes the advantages and disadvantages of each method.In addition to the evaluation indicators already listed above, root mean square error (RMSE), also called standard error, is a measure of how close the calculated value is to the true value; the smaller the value, the higher the accuracy of the model.Accuracy (acc) refers to the proportion of the correct quantity to the total quantity.As can be seen Table 4, the method proposed in this paper is obviously better than the method proposed by Xiong, X. and slightly worse than that propose by Duan, L. [11,32].However, the images of rice in reference [32] were taken from the top with a Nikon D40 digital camera, and the number of rice panicles in each image was small.In this paper, rice images were taken from the side with the help of field cameras, and there were many rice panicles in each picture, which was more complex but more practical.Compared with the method proposed by Cao, Y., the RMSE value of the number of rice panicles is slightly lower; however, the method in reference [14] has higher requirements for the flight height of the UAV when collecting images, and is not suitable for the small range in the rice field environment [14].Compared with the method proposed by Kong, H., this method in reference [33] was used for the experiment and counting of grain per ear of rice panicle in a laboratory environment [33].Although the result in this paper was worse than that of reference [33], as the accuracy of the number of rice panicles with the proposed method in this paper was 83.27%, it was more practical for a real field environment.It can be seen from Table 4 that the method proposed in this paper has good results in an actual field environment.

Conclusions
Based on the basic Mask R-CNN instance segmentation algorithm, this research proposes an improved Mask R-CNN combined with Otsu algorithm preprocessing for detecting and segmenting rice panicles, and compared the results among the three methods.Compared to the image produced by the component, the binary image obtained using the proposed method after detection, transformation and calculation is more comparable to the real image, and information on the number and positions of rice panicles can be obtained with less interference from the background.After the rice image is segmented, the number of rice panicles and their relative proportional area can be calculated, and the rice panicle quality can be judged by color, texture and other features.Therefore, this method is of great value for monitoring rice growth and estimating yield.Compared with prior methods, the present method has several advantages: (3) This method can operate well in a field environment and is of great value for monitoring rice growth and estimating yield.
The detection and segmentation accuracy of this method needs to be further improved; the next steps will involve improving both the dataset and model optimization.In terms of the dataset, we plan to improve the diverseness with an automatic data enhancement algorithm to better match the statistics of actual field environments.For model optimization, the network structure will be changed to automatically search for the best module for training to improve the accuracy of model detection and segmentation.

Figure 6 .
Figure 6.Examples of detection results using different methods.

Figure 6 .Figure 7 .
Figure 6.Examples of detection results using different methods.

21 Figure 9 .
Figure 9.Comparison of the results of the number of rice panicles.

Figure 10 .
Figure 10.Comparison of the results of the proportional area of rice panicles.

Figure 9 . 21 Figure 9 .
Figure 9.Comparison of the results of the number of rice panicles.

Figure 10 .
Figure 10.Comparison of the results of the proportional area of rice panicles.

Figure 10 .
Figure 10.Comparison of the results of the proportional area of rice panicles.

( 1 )
The classical Mask R-CNN model has been improved and optimized by combining the KL divergence and the soft NMS algorithm with the best RPN anchor box size, which makes the model more accurate and efficient in detecting rice panicles, and improves the long training time, low detection accuracy and fuzzy boundaries of the original algorithm;(2) Before the actual images are inputted into the detection model, the ExG is calculated based on the dataset features.Then, the traditional Otsu threshold segmentation method is used for preprocessing, which reduces the influence of background interference and improves the model detection accuracy to a certain extent;

Figure A1 .
Figure A1.Capturing real-time images of a rice field with a web camera.

Table 1 .
Detection results of the models for different anchor box sizes.

Table 2 .
SSIM values for the different methods.

Table 3 .
pHash values for the different methods.

Table 4 .
Comparison between Panicle-Mask and other methods for rice panicle detection and segmentation.