Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN

Tian, Cong; Wang, Jiawei; Zheng, Decong; Li, Yangen; Zhang, Xinchi

doi:10.3390/agronomy15030536

Open AccessArticle

Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN

by

Cong Tian

^1,2,*,

Jiawei Wang

^1,2,

Decong Zheng

^1,2,

Yangen Li

^1,2 and

Xinchi Zhang

^1,2

¹

College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong 030801, China

²

Dryland Farm Machinery Key Technology and Equipment Key Laboratory of Shanxi Province, Jinzhong 030801, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(3), 536; https://doi.org/10.3390/agronomy15030536

Submission received: 4 February 2025 / Revised: 18 February 2025 / Accepted: 21 February 2025 / Published: 23 February 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

In order to enable oat ears to be quickly and accurately identified in the natural environment, this paper proposes an oat ears detection and counting model based on an improved Faster R-CNN. In the backbone network, the commonly used single convolutional neural network is replaced by a parallel convolutional neural network to realize the feature extraction of oat ears, and a feature pyramid network (FPN) is incorporated to improve the small target-missed detection problem and the multi-scale problem of oat ears. Then, the anchor box configuration is optimized according to the size and distribution of the labeled boxes in the dataset, which improves the efficiency of the model to detect oat ears. Finally, progressive non-maximum suppression (Progressive-NMS) was used to replace non-maximum suppression (NMS) to optimize the screening process of prediction boxes. According to the data from different experiments designed, the optimized model can effectively detect oat ears in the natural environment and complete the counting of oat ears per unit area. Compared with the traditional Faster R-CNN detection model, the mean average precision (mAP) of the improved model is increased by 13.01%, which could provide reference for oat yield prediction and intelligent operation.

Keywords:

oat ears detection and counting; parallel convolutional neural network; anchor box optimization progressive-NMS

1. Introduction

Oat is a global crop and an important branch of wheat crops. According to the 2022 data, the global harvested area of oats reached 9923 thousand hectares, up 4.5% year-on-year; in terms of production, the total global oat production was 24,998 thousand metric tons, up 11.1% year-on-year. Currently, about 73% of global oats are used for livestock feed and forage, 12% are processed into nutritious food, and another 15% are used as seed or applied in industrial production [1]. The yield prediction of oats plays an important reference role in subsequent agricultural production management decisions, while the number of oat ears is one of the key parameters to characterize its yield. Therefore, the rapid and accurate detection and counting of oat ears is crucial for its yield prediction.

At present, the identification of wheat crops and the yield estimation based on the identification results mainly depend on deep learning technology, which is realized by applying target detection algorithm. In the research of key wheat crops such as wheat and barley, Faster R-CNN and YOLO, two mainstream target detection algorithms, provide effective technical support for their detection and counting, and related research has made significant progress. Faster R-CNN has higher detection accuracy, especially in the detection of small targets. Moreover, the algorithm can flexibly select different basic networks to meet different detection requirements and then complete the feature extraction task. However, its disadvantage is that the detection speed is relatively slow, and the computational complexity is high. Ling Haibo et al. [2] introduced a weighted box fusion algorithm to replace the original NMS method for the problem of missed detection of the traditional Faster R-CNN algorithm in the wheat ears detection task. The improved strategy constructs the final fusion box based on the confidence of all the prediction boxes generated by the regional recommendation network, thus effectively reducing the omission rate of the wheat ears. Xu Bowen et al. [3] introduced the BiFPN weighted fusion unit into the ResNet50 network, making full use of its dual-path advantages and effectively reducing the information loss of deep features and shallow features. At the same time, more features are fused by jump connection, and the corresponding weight coefficients are given according to the contribution rate of each feature to the model. Compared with the traditional Faster R-CNN using Vgg16 as the feature extraction network, the detection accuracy of the network is significantly improved. Li Lei et al. [4] explored the possibility of image-based spike number per unit area detection on RGB images. After comparing the data of field manual counting, image processing-based counting and Faster R-CNN counting, it was found that Faster R-CNN can efficiently count the number of spikes per unit area, which is an accurate and fast phenotypic identification method. WU et al. [5] collected the image dataset of wheat grains under three different varieties, six background environments, and two different height, angle, and grain number settings. Based on these data, they constructed a detection and enumeration model based on the Faster R-CNN algorithm. The model can adapt to a variety of background conditions, different image sizes, different particle sizes, various shooting angles and heights, and different grain crowding degrees, showing a wide range of application effectiveness. With simple structure and high computational efficiency, YOLO is suitable for embedded devices and real-time application scenarios. Although its accuracy is relatively low, it shows strong competitiveness in real-time applications due to its excellent detection speed and efficiency. Huang Shuo et al. [6] proposed a new network structure combining the convolutional block attention module (CBAM) and YOLOv5, namely CBAM-YOLOv5, aimed at the problems of strong subjectivity, low efficiency, and lack of systematic deployment of image processing methods in the method of manually counting wheat ears. The network improves the accuracy and efficiency of wheat ear number measurement per unit area by adaptively refining the feature map. Lu Qiao Kuan et al. [7] proposed an anchor-free wheat ear target detection method based on the YOLO framework. This method uses CSPDarkNet53 as the feature extraction network, and its middle layer uses the feature graph pyramid network structure to design the feature processing module so as to increase the receptive field and extract the multi-scale information of the image to obtain the feature map of the fusion of high and low-level semantic information. At the back end, the Fovea Box anchor-free detector is used, and, finally, the accurate detection of wheat ears is effectively realized. Lu Ziao et al. [8] used the backbone network of EfficientViT to replace the feature extraction network layer of YOLO v7-tiny to enhance the extraction ability of image features. In the feature fusion network layer, the CARAFE upsampling module is introduced to replace the up-sampling module in the original model, and the feature fusion process is further optimized. At the same time, an efficient multi-scale attention mechanism based on cross-space learning is introduced in the feature fusion network layer and the output layer to effectively improve the target detection performance of the model. Furong et al. [9] proposed a lightweight hybrid design network called Wheat-YOLO Net for wheat ear recognition tasks. The network is based on the YOLOv8 architecture. By retaining only the detection heads for small targets and large targets, the parameter scale of the network is effectively reduced while ensuring that the detection performance is not affected. In addition, they integrated the CBAM attention mechanism module in the backbone structure of the network to enhance the feature extraction ability of the model. At the same time, Wise-IoU is used as the loss function to optimize the sample balance. These improvements significantly reduce the number of parameters of the detection model and achieve better detection results. Chen et al. [10] constructed a dense wheat head dataset, and proposed an efficient and fast CB-YOLO real-time model based on this. The model skillfully combines the spatial attention mechanism and the channel attention mechanism to integrate the channel and spatial features and shows a good detection effect when dealing with dense and occluded wheat ear detection tasks. In addition to the above two mainstream detection models, many other types of detection algorithms have also been applied in the detection of wheat crops and yield estimation. Wang et al. [11] are committed to improving the accuracy of winter wheat yield estimation and improving the problems of high yield underestimation and low yield overestimation in the existing yield estimation models. Combining the local feature extraction ability of CNN and the global information extraction ability of the Transformer network based on the self-attention mechanism, they constructed the CNN-Transformer deep learning model to realize the estimation of winter wheat yield. The results show that the model can effectively predict the yield of winter wheat and capture the critical period of winter wheat growth. Aimed at the problems of interference and the occlusion of the wheat ears in a complex field environment, Yu Junwei et al. [12] proposed an improved Oriented R-CNN wheat ear rotation frame detection and counting method to achieve accurate positioning and counting. This method optimizes and adjusts the traditional horizontal frame detection model to the rotating frame detection model, thus effectively solving the problem of missed detection caused by wheat ear occlusion. Combined with UAV technology, Bao et al. [13] proposed an automatic counting method of wheat ears in UAV wheat images based on FE-P2Pnet. This method increases the difference between the wheat ear target and the background by enhancing the brightness and contrast of the images collected by the UAV so as to effectively reduce the interference of complex background factors such as leaves and stalks on wheat ear recognition. At the same time, the network P2Pnet based on point labeling is introduced as the baseline network to cope with the challenges brought by the dense distribution of wheat ears.

Compared with wheat, barley, and other wheat crops, the application of target detection technology in oat yield prediction is particularly insufficient, the few studies that can be retrieved are mainly focused on the growth trend of oats. In the use of target detection algorithms to achieve crop identification and yield prediction of oats, the existing research results are very limited, and it is difficult to find complete and systematic materials for reference. At the same time, the development of oat intelligent agricultural machinery and equipment based on such research is not as mature as other wheat crops. Based on the experience of popular research objects such as wheat and barley, when applying the target detection algorithm to oats, it is necessary to fully consider the differences between oats and other wheat crops in terms of appearance characteristics, growth environment and technical application, and select the appropriate target detection algorithm for optimization. These differences can be embodied in the following aspects: (1) The ear of oat is loose and small, which is easy to be blocked by branches and leaves, resulting in missed detection; (2) the stem of oat is slender and the branches are dense. In the wind environment, different oat plants are prone to overlap with each other in the ear parts, resulting in false detection when counting the number of oat ears; (3) oats are greatly affected by light orientation or water conditions, which may lead to inconsistent ear size of the same variety of oats, resulting in multi-scale problems of oat ears [14]; and (4) a lack of high-quality oat ear datasets.

This paper is devoted to solving the differences between oats and other wheat crops in the detection process from the two aspects of datasets and detection and counting models, so as to fill the blank of oat ear detection and counting research. Firstly, this paper proposes a new oat ear dataset to solve the problem of lack of oat ear datasets. Then, we chose Faster R-CNN, which is better for small target detection, as the basic oat ear detection model, and optimized it with the main factors that need to be considered in the above oat ear detection process. We hope that the research results of this manuscript can provide a valuable reference for the prediction of oat yield and the development of oat intelligent agricultural machinery and equipment based on target detection technology.

2. Materials and Methods

2.1. Oat Ear Image Acquisition and Data Annotation

The crop used in this paper was ‘Pinyan 4’ oat. This variety was provided by Shenfeng Seed Research and Development Cultivation Base of Shanxi Agricultural University (Shanxi, China), and the base has accumulated 5 years of experience in the development and cultivation of oat seeds. ‘Pinyan 4’ is a new oat variety carefully selected from naked oats by sexual hybridization combined with pedigree method. After maturity, the morphological characteristics of this variety are consistent with those of conventional naked oats, so the difference in seed selection will not lead to the particularity of the test results. In the process of sowing, the mechanical operation is adopted. The specific parameters were set as follows: row spacing 20 cm, sowing depth 4 cm, and sowing amount 2.5 g per square meter.

During the ripening period of local oats from May to June 2024, data collection was completed by five professionals familiar with the growth characteristics of oats. At the mature stage, the growth trend of oats is significant, biomass accumulates rapidly, and the color of oat ears gradually changes from green to light yellow or golden yellow, which is an important period for laying oat yield [15]. Therefore, this paper selects the oat ear in this stage as the object of image acquisition.

Oat ear image acquisition is divided into three periods every day, 6:00, 13:00, and 17:00, with each period lasting for two hours. Professionals collected oat ear images under different weather conditions (sunny, cloudy, and windy) and different backgrounds (sparse, dense, and occluded), which improved the diversity of oat ear images. The shooting distance is 20~50 cm, and the sample composition and description are shown in Figure 1. After manual screening and cutting of the collected oat ear images, 2000 oat ear images with a resolution of 720 pixels × 720 pixels were obtained. Set the ratio of 8:1:1, and divide these oat ear images into a training set, validation set, and test set.

In this paper, the image labeler tool of MATLAB (2024a) was used to annotate the region where each oat ear is located in the original image, and a (.mat) annotation file was generated in table format. The label information of a single oat ear labeled box is (

M, x_{i}, y_{i}, w_{i}, h_{i}

).

x_{i}

and

y_{i}

define the coordinates of the upper left corner of the labeled box, and

w_{i}

and

h_{i}

represent the width of the labeled box along the X-axis direction and the height of the labeled box along the Y-axis direction, respectively.

M

is not only the number of labeled boxes but also represents the number of oat ears in the picture. The labeling of a single oat ear is shown in Figure 2.

2.2. Faster R-CNN Algorithm and Its Composition Framework

Compared with Fast R-CNN, Faster R-CNN is a two-stage target detection algorithm based on its further optimization, which is one of the typical representatives of the target detection algorithm [16]. Feature extraction network, region of interest pooling (ROI pooling), region proposal network (RPN), and classification and regression are the core modules of the Faster R-CNN algorithm [17]. Figure 3 shows the architecture of the Faster R-CNN algorithm and the basic principle of algorithm operation.

Faster R-CNN combines the two-stage detection method and RPN and achieves high-precision target detection performance through an efficient region proposal generation mechanism and accurate target classification and positioning capabilities [18]. Compared to other single-stage detection networks, Faster R-CNN demonstrates better performance in multi-scale problems, especially in detecting targets of different sizes, which is crucial for diverse scenarios in practical applications. Faster R-CNN demonstrates excellent performance on a wide range of standard datasets and has a strong migration capability, which allows it to be simply fine-tuned to adapt to new specific tasks or dataset [19]. This versatility and robustness make it a popular algorithmic model for research and practical applications.

Although Faster R-CNN achieves significant performance improvement, there are still some limitations and shortcomings: (1) Faster R-CNN usually does not perform as well as dealing with minimal targets as compared to dealing with conventional targets because the features of the minimal targets may be lost in the deep network, which in turn reduces the accuracy of the detection; (2) although the RPN effectively reduces the number of region proposals, it still generates a number of irrelevant region proposals, which will affect the speed of the detection process and the final accuracy; (3) the region mismatch problem caused by ROI pooling after two quantization processes [20]; (4) the performance of Faster R-CNN in the operation process depends to a certain extent on its own anchor frame size and proportion. Improper anchor frame setting will lead to a decline in its detection performance; and (5) when RPN generates the proposals, in order to circumvent overlapping prediction boxes, the NMS is used for optimization processing, which is not friendly enough to the occluded targets and can lead to two mutually occluded target proposals being incorrectly filtered out one by one.

Aimed at the problem of information loss that may occur in the convolution process of Faster R-CNN and the limitations of processing mechanisms such as RPN, ROI pooling, and NMS, researchers have carried out a lot of targeted work to improve its target detection performance. Miao Ru et al. [21] introduced Swin Transformer to replace the original ResNet50 to enhance the feature extraction ability of the model for remote sensing images. Hu Zhaohua et al. [22] introduced an asymmetric modulation mechanism, aiming to enhance the ability of network to detect different specifications of targets distributed in remote sensing images. Zhang Y et al. [23] combined mixed data enhancement and Soft-NMS algorithm as the screening method of the most prediction box, effectively improving the recognition accuracy of overlapping strawberries.

2.3. Optimization of the Faster R-CNN Algorithm

In this study, Faster R-CNN was selected as the basic model for oat ears detection and counting, and the feature extraction network, the configuration of the anchor box, the ROI pooling, and the prediction box screening method were optimized according to the characteristics of oat ears.

2.3.1. Parallel Convolutional Neural Network

The classical feature extraction networks commonly used in Faster R-CNN include AlexNet, Vgg16, GoogLeNet, ResNet50, and so on. A single feature extraction network has been widely adopted for its excellent performance and has achieved significant results in several application areas [24]. However, with the increasing complexity of network models, their timeliness in applications is challenged. In order to shorten the training and testing time while ensuring the efficiency of the deep neural network, researchers began to try to design parallel deep learning computing systems on parallel computing hardware platforms with different architectures using suitable software interfaces and optimizing their performance, which has become a hot topic of research for scholars at home and abroad [25]. Deng Xueyang et al. [26] designed a lightweight parallel convolutional neural network and applied it to the detection of large-scale ship communication abnormal data. At the same time, the softmax classification function was used to derive the characteristics of ship communication abnormal data, and the detection of large-scale ship communication abnormal data were successfully realized. In order to realize the rapid detection and detection of bearing faults, He Feixiang et al. [27] proposed a two-dimensional parallel residual network model intelligent diagnosis method based on channel attention mechanism. Through the combination of channel attention mechanism and conventional convolutional neural network, different channels are adaptively weighted, which greatly improves the network’s perception of faulty bearings. Zhao Xin et al. [28] try to apply the medical image alignment model based on CNN and Transformer parallel processing to the detection of medical images, which ensures the topology of the image before and after alignment, while reducing the parameters of the model. This paper uses ResNet50 as the main convolutional path and designs an auxiliary convolutional path to interact with oat ears detection information. The ResNet50 and auxiliary parallel convolutional neural network structure is shown in Figure 4.

The emergence of the ResNet network effectively improves the problems of gradient vanishing and model degradation by introducing residual learning and jump connections, improves the convergence speed, and reduces the risk of overfitting [29]. In Figure 4, RConv is used as the main convolutional route, preserving the basic architecture of ResNet50. Part 00 denotes the input stage of the 224 × 224-pixel image. Part 01 of ResNet50 contains only the RConv1 stage; in this stage, a 7 × 7 convolution kernel is used, and the step size is set to 2 to reduce the size of the feature map to 112 × 112, and then the size of the feature map is further halved to 56 × 56 by a 3 × 3 maxpool layer. Part 02 to Part 05 of ResNet50 contain four large blocks, each of which is internally composed of a number of corresponding small blocks, which are presented in Figure 4 as RConvx-x.

Another convolutional route reduces the number of parameters by using consecutive small 3 × 3 convolutional kernels instead of larger ones. As shown in Figure 4, AConv, as a branch, is mainly involved in the down sampling process where the feature map size is reduced from 56 × 56 to 7 × 7. In order to realize the one-to-one correspondence between the small blocks of this branch and each small block in ResNet50, the network structure of this part is designed as follows: three small blocks are integrated in Part 02, four small blocks in Part 03, six small blocks in Part 04, and three small blocks in Part 05. At the beginning of each large block from Part 03 to Part 05, a dilated convolutional layer of size 3 × 3 with an expansion factor of 2 is added in place of the regular pooling layer. While down sampling is realized, the receptive field is expanded. The receptive field calculation of the convolutional layer is shown in Formula (1):

r_{n} = r_{n - 1} + (k_{n} - 1) \prod_{n}^{n - 1} s_{i}

(1)

In the formula, r_n is the receptive field size of the layer, r_n−1 is the size of the convolution kernel of this layer, and s_i is the step size of the i-th layer.

For dilated convolutions, the actual size of the convolution kernel is calculated as shown in Formula (2):

k^{'} = k + (k - 1) (d - 1)

(2)

In the formula, k′ is the actual convolution kernel size after expansion, k is the original convolution kernel size, and d is the convolution expansion rate.

Therefore, according to the above formula, it can be seen that the dilated convolution can effectively expand the receptive field compared with the use of ordinary convolution kernels for dimensionality reduction. Of course, the use of the pooling layer can also expand the receptive field, but this will reduce the spatial resolution. In contrast, dilated convolution does not lose resolution while expanding the receptive field and can keep the relative spatial position of the pixels unchanged.

In order to improve the detection accuracy of the improved model in different specifications of oat ears, the FPN is introduced into the ResNet50 and auxiliary parallel convolutional neural network, as shown in Figure 5. The structure consists of three lines connected horizontally from bottom to top, from top to bottom, and in the middle. The bottom-up part is composed of the above ResNet50 and auxiliary parallel convolutional neural network backbone network. The top-down part starts from the high-level feature map and performs 2 times upsampling by using a simple nearest neighbor interpolation method and performs feature fusion layer-by-layer downward transmission with the same 2 times unsampled feature map below.

By fusing the shallow, middle, and deep features of oat ear images, the information exchange of different pieces of oat ear convolution layer information were realized. In practical applications, this mechanism is of great significance for scenes with target size differences caused by changes in shooting distance or unique shape features of the target itself. FPN has been successfully integrated into numerous target detection models, such as YOLO, Faster R-CNN, etc., significantly enhancing the detection accuracy and robustness of these algorithms [30].

2.3.2. Optimization of Anchor Box in Size and Quantity

The anchor box mechanism is the core component of RPN, which defines a rectangular area, and appropriate anchor box settings can reduce the running burden of the network in the calculation process and improve the detection efficiency of the target [31]. In the default configuration of Faster R-CNN, nine anchor boxes are generated by using three different regional scales (128 × 128, 256 × 256, 512 × 512) and three aspect ratios (1:1, 1:2, 2:1) [32]. The raw area scales and aspect ratios were designed to be used for the detection of 20 different objects in the PASCAL VOC dataset with good results [33].

The anchor box and the labeled box complement each other and work together. The anchor box continuously learns to adapt to the actual situation of the target object more accurately by matching with the labeled box, calculating the loss and optimizing the steps, so as to achieve the goal of accurately detecting the target object. In order to visually observe the distribution of the labeled boxes in the dataset, we visualize all the labeled boxes, and their distribution is shown as the red dots in Figure 6. The horizontal axis of the figure indicates the area of the labeled boxes, and the vertical axis indicates the horizontal to vertical ratio of the labeled boxes. The nine anchor boxes configured by Faster R-CNN by default cover a certain range, as shown by the green box in Figure 6. However, outside the coverage, many points with smaller area sizes are missed, and within the coverage, there is an obvious mismatch between the sizes and shapes of some of the anchor boxes and the actual labeled boxes. These problems will lead to a decrease in the localization precision and recall of the detection algorithm, as well as a waste of computational resources. Therefore, the original anchor box configuration is not suitable for the dataset used in this study.

The number of anchor boxes selected is a key training hyperparameter. A measure of the quality of the anchor boxes is the average intersection over union (IoU). The IoU value of 0.5 means that the anchor box overlaps well with the labeled box in the training data [34]. Also, the value of IoU will increase with the number of anchor boxes. However, using too many anchor boxes will increase the computational cost of the network and may lead to overfitting, which will greatly affect the running speed of the model [35]. In order to make the number of anchor boxes and IoU in the dataset in a balanced position, this paper draws a relationship diagram between the two, as shown in Figure 7. When two anchor frames are used, the IoU value is 0.68, which is higher than the default goodness threshold of 0.5. As the number of anchor frames gradually increases from 2 to 5, the IoU value increases significantly. However, when the number of anchor box exceeds 5, the increasing trend of IoU value begins to slow down.

Based on the above two points, this paper optimizes the size and number of anchor box. The original 3 anchor boxes with different area scales were adjusted to 5, with sizes of (100 × 80, 160 × 120, 256 × 100, 300 × 130, and 256 × 160), respectively, and 15 anchor boxes were finally generated according to the ratios of 1:1, 1:2, and 2:1. The coverage of the optimized anchor boxes is shown in the blue box in Figure 8. Compared with the anchor boxes of the default configuration, the total area covered by the optimized anchor boxes is significantly reduced, while the number of red dots covering the representative labeled boxes is increased significantly, especially for the labeled boxes of smaller sizes, and the coverage effect is well improved. In actual detection, the increase in the number of anchor boxes effectively reduces the omission of small-sized oat ears but also leads to a large number of redundant prediction boxes for local oat ears, as shown in Figure 9.

2.3.3. ROI Align Regional Feature Aggregation

The conventional Faster R-CNN uses ROI pooling for pooling operation, and the workflow is divided into two stages: first, the boundary of the prediction box is quantified based on the integer coordinates; secondly, the quantized boundary region is evenly divided into k × k units to ensure that the boundary of each unit is quantized, which will lead to problems such as mismatching and information loss in the region after the stocking process, that is, the position of the image in the check box will deviate from the original regression position due to two quantifications, and this deviation will affect the accuracy of subsequent operations [36].

In order to avoid the problems caused by using ROI pooling, ROI Align is used in this paper. ROI Align uses bilinear interpolation instead of two quantization to make the whole feature aggregation process a continuous process, so as to obtain the image value generated at the floating coordinate pixels. When performing specific operations on specific images, ROI Align does not simply fill and aggregate the coordinate points on the boundary of the prediction region, as shown in Figure 10. The operation process of ROI Align can be summarized as three steps: firstly, each prediction region is iteratively processed, and the floating-point boundary is not quantified during the process; secondly, the entire prediction region is divided into k × k units, and the boundaries of these units are not quantified; finally, four fixed coordinate positions are identified inside each unit, which are calculated by bilinear interpolation, and then the maximum pooling operation is performed to extract features.

The small grids formed by the dashed lines in the figure are equivalent to feature maps, while the grids formed by the solid lines represent variable-size ROI. If there are four sample points in each grid, then these points are the center of the cell after the cell has been divided equally into four small squares. In addition, because the coordinates of the sample points are floating points, bilinear interpolation method is needed to determine the pixel value. The above method effectively avoids the quantization mismatch problem of ROI pooling and enhances the extraction of small features.

2.3.4. Progressive-NMS

The NMS algorithm is an important processing step in various types of target detection algorithms. Its core idea is to search for local maxima and suppress non-maxima. In the process of target detection, a large number of unwanted prediction boxes may be generated on the same detection object. Each prediction box is assigned a confidence score according to the degree of matching with the target [37]. First, all prediction boxes are sorted from high to low according to their confidence scores, and then for each high-scoring prediction box, check its overlap with the reserved box. If the overlap exceeds a certain threshold, the prediction box is suppressed. If the prediction box is not suppressed, it is retained as the final output result. The definition of IoU region is shown in the red region in Figure 11, and the calculation formula is shown in Equation (3).

ratio = \frac{area (A \cap B)}{area (A \cup B)}

(3)

The calculation formula of NMS is shown in Equation (4), where s_i represents the score of the

i

prediction boxes,

M

is stipulated to be the highest-scoring prediction boxes, b_i represents the i prediction boxes, N_t is the threshold of intersection and merger ratio of prediction boxes, and

i o u

represents the intersection and merger ratio of M and b_i.

Despite its effectiveness in the task of target detection, NMS still suffers from some shortcomings. As a greedy algorithm, NMS tends to focus on the currently selected region and thus may ignore the existence of the global optimal solution. The optimization of anchor boxes improves the detection efficiency of the model, enabling small-sized oat ears, which are otherwise easily missed, to be accurately identified, but a large number of redundant prediction boxes are generated around the oat ears. When an oat ear is detected while other oat ears are still detected in the proximity range, the difficulty of the NMS in dealing with this situation increases significantly as it needs to accurately filter out the true prediction boxes among the interference of numerous redundant prediction boxes for the current target as well as the neighboring targets.

S_{i} = \{\begin{cases} s_{i}, i o u (M, b i) < N_{t} \\ 0, i o u (M, b i) \geq N_{t} \end{cases}

(4)

This paper proposes an improved Progressive-NMS prediction box screening method based on NMS to achieve the purpose of reducing redundant prediction boxes. This method divides the extraction process of prediction boxes into two steps. Firstly, according to the standard NMS algorithm, the ratio of intersection area to union area between bboxA and bboxB is calculated, and a threshold is set to preliminarily suppress the global prediction box and reduce the number of redundant prediction boxes. Then, the ratio of the intersection area of bbox A and bbox B to the minimum area of the two bounding boxes is calculated, as shown in Formula (5), and the threshold is set again to suppress each local small range more finely, so as to obtain the final prediction box.

ratio = \frac{area (A \cap B)}{\min (area (A), area (B))}

(5)

2.4. Test Environments

The operating system for this paper is Windows 10 (64 bit); the parameters of CPU are Intel (R) core (TM) I7-13700 H of 13 th generation; the capacity of RAM is 32 GB; the GPU model is NVIDIA GeForce RTX 4070 Laptop GPU; the design and code running environment of oat ears detection model is MATLAB 2024a; the CUDA version is 12.2. Other parameter settings are shown in Table 1.

2.5. Model Evaluation Methodology

In this paper, frame rate and average accuracy mean are selected as the evaluation metrics of the detection model. The frame rate reflects the number of pictures continuously displayed per second. The higher the frame rate, the more image information processed per unit time [38]. The average precision mean is a measure of the average performance of a comprehensive model in multiple categories and is the average value of the average precision of the detection system [39]; the gradual calculation formula is as follows:

P_{r} = \frac{T_{p}}{T_{p} + F_{p}}

(6)

R_{e} = \frac{T_{p}}{T_{p} + F_{n}}

(7)

M = \sum_{i = 1}^{c} \frac{A_{p}}{C} = \sum_{i = 1}^{c} \frac{\sum_{i = 1}^{n} P (i) Δ R (i)}{C}

(8)

where P_r is the precision of the detection algorithm, R_e is the recall of the detection algorithm, A_p is the average precision of the detection algorithm, M is the mean of the average precision of the detection algorithm, T_p is the number of samples with correct predictions and correct true values, F_p is the number of samples with correct predictions and incorrect true values, F_n is the number of samples with incorrect predictions and correct true values, and C is the number of total categories.

3. Results and Discussion

3.1. Comparison of Detection Performance of Various Feature Extraction Networks Under Different Anchor Boxes

Based on the same oat ear dataset, according to the model evaluation method, the Faster R-CNN algorithm using different feature extraction networks, anchor box configurations, and prediction box screening methods was compared and analyzed. The results are shown in Table 2.

According to the data in Table 2, the Faster R-CNN algorithm with AlexNet as the feature extraction network performs the worst on the mAP value, at only 27.13%, under the conventional box size (128 × 128, 256 × 256, 512 × 512). When ResNet50 is used as the feature extraction network, the Faster R-CNN algorithm reaches the highest mAP value of 77.79%, which is 50.66% higher than the lowest value. The performance of Vgg16 in terms of mAP value is second only to ResNet50. In terms of detection speed, the Faster R-CNN algorithm using GoogLeNet as the feature extraction network has the slowest speed, which is 9.33 fps, while the Faster R-CNN algorithm using AlexNet as the feature extraction network has the fastest detection speed, reaching 20.58 fps, which is 11.25 fps higher than the slowest speed. The performance of ResNet50 in detection speed is second only to GoogLeNet. After using ResNet50 and auxiliary parallel convolutional neural network as the feature extraction network, mAP is increased by 2.28% compared with the single ResNet50 model that performs best on the mAP value. However, in terms of detection speed, ResNet50 and the auxiliary parallel convolutional neural network is 7.89 fps slower than the optimal AlexNet.

After the anchor box is optimized (100 × 80, 160 × 120, 256 × 100, 300 × 130, 256 × 160), the mAP value of ResNet50 and auxiliary parallel convolutional neural network is increased by 2.89%. After further introducing Progressive-NMS, the mAP value is increased by 1.45% again, making the final mAP reach 84.41%. This result is 4.34% higher than the initial ResNet50 and auxiliary parallel convolutional neural network, and 6.62% higher than the best-performing ResNet50 among the four commonly used feature extraction networks. However, the Faster RCNN algorithm’s detection speed continues to decline when using the ResNet50 and auxiliary parallel convolutional neural network as a feature extraction network, from 12.69 fps to 12.14 fps, which is 3.17 fps slower than using ResNet50 alone.

The application of a parallel convolutional neural network, the optimization of the anchor box, and the improvement of the NMS effectively improved the accuracy of the oat ear detection model. However, the detection speed continues to decline, which may be due to the increase in the number of parameters of each component after optimization, resulting in an increase in the operating burden of the detection model. How to find a balance between detection speed and accuracy will be the focus of our follow-up experiments.

In order to intuitively show the detection performance of the improved oat ear detection model, we selected two of the four feature extraction networks commonly used in Table 2-Vgg16 and ResNet50 with higher mAP values and compared them with ResNet50 and the auxiliary parallel convolutional neural network for further discussion. We plot the Precision-Recall (PR) curves of these three models together to facilitate the analysis and comparison of their performance, as shown in Figure 12. The PR curve is a graph used to evaluate the performance of the classification model, which shows the accuracy under different recall rates. The larger the area covered by the PR curve, the better the performance of the model [40]. Therefore, it can be seen that the improved Faster R-CNN is superior to the traditional Faster R-CNN using Vgg16 and ResNet50 networks alone in oat ears detection.

3.2. Comparison of the Detection Effects of Faster RCNN on Oat Ears in Natural Environment Before and After Improvement

Under natural environmental conditions, factors such as light intensity, wind level, and the actual growth status of oats and surrounding weeds may interfere with the oat ear detection model. In order to verify the effectiveness of the proposed improved method for oat ear recognition in the natural environment, this study will continue to evaluate the three oat ear detection models with outstanding detection performance before and after improvement to test its detection effect in the natural environment, as shown in Figure 13.

It can be seen from Figure 13 that when the oat ears overlap with each other, model 1 and model 2 are greatly disturbed, and these models are difficult to accurately distinguish whether the identified oat ears are from the same oat plant. It can be seen from the original image on the left side that there is a difference in the shape between the overlapping oat ears and the single oat ear, and the former has a larger size. After optimizing the anchor box, model 3 improved the detection ability of oat ears of different sizes. When the wind level in the external environment is high or the sowing is uneven, it is possible to cause overlap between the oat ears of different oat plants.

When there are more weeds with similar colors around oats, there is a greater interference on the three oat ear detection models. Specifically, when the color of weeds and oat ears is similar, the contrast between the background and the target is reduced, which makes the oat detection model produce false detection and missed detection. Although weeding is a necessary link in the process of crop growth, for the detection model, similar to strong interference targets such as weeds, it is still necessary to further improve the quality of the dataset and the detection performance of the model in subsequent experiments to reduce the interference of such targets.

When the light intensity was different, the detection effect of model 1 on oat ears was worse than that of model 2 and model 3. Under the condition of strong light, model 1 misjudged more, the main reason being that the single oat ear was misjudged as overlapping oat ear. By comparing the original image on the left side, it can be seen that when the light is sufficient, the color of the oat ear at the mature stage is similar to that of the branches and leaves, and the distance is close, which has a great interference on the oat ear detection model. Under the condition of weak light, model 1 missed more, and the target of missed judgment was mainly oat ears that had yellowed. These oat ears are usually at a lower position or around the branches and leaves are more concentrated, its own color and the surrounding dark environment make the detection model difficult to identify. In contrast, the feature extraction networks of model 2 and model 3 show better feature capture performance when dealing with complex targets composed of oat ears. This advantage has been reflected in the data comparison between Table 2 and Figure 12.

In summary, model 1 has a poor detection effect on oat ears, while model 3 shows a good detection effect, and model 2 is located in the middle. The detection effect of different oat ear detection models still depends on the ability of the model to extract the shape and color features of oat ears in the face of the overlap of oat ears, the complexity of the surrounding ecology, and the change in the light intensity of the growth environment. This is also the reason why the optimized model 3 can achieve better detection results. Of course, for the common problem that weed occlusion leads to the inability to effectively identify, in addition to continuing to optimize the performance of the oat ear detection model, it is also necessary to further improve the quality of the dataset.

By combining the Progressive-NMS algorithm, the model’s detection results are significantly improved compared to the standard NMS. Progressive-NMS incrementally filters the prediction boxes and avoids the one-size-fits-all phenomenon caused by a single threshold value. The results of the two NMS for the same situation at their respective most appropriate thresholds are shown in Figure 14. The traditional NMS can effectively reduce the redundant prediction boxes at one time due to the use of a single threshold, but it also results in more missed detections. Progressive-NMS processes the redundant prediction boxes gradually, which effectively reduces the problem of missed detections, but a few redundant prediction boxes are still left. Taken together, the detection performance of Progressive-NMS is better than that of traditional NMS; this result also verifies the feasibility of the Progressive-NMS proposed in this paper.

Based on the analysis of the mAP, detection speed, PR curve, and actual detection effect of the above models, the improved Faster R-CNN model performs better. However, compared with the traditional model, the improved oat ear detection model requires longer processing time, and there is still room for improvement when dealing with a large area of weed environment with similar color to oat ears.

3.3. Oat Ears Counting Test

In this paper, a counting test for oat ears was designed, and three models with better detection effects on oat ears in a natural environment were selected. These three models ranked the top three in mAP index. In view of the fact that there is no related research on oat ear target detection at present, this paper selects two models that were successfully reproduced from the literature and used for wheat ear target detection, and adds them to the counting experiment for comparative study (the model training environment for copying is Python 3.8.5 and PyTorch 1.8.1, and the remaining computer hardware configurations are the same as in this paper). Based on the row spacing of oats, their actual growth status, and to prevent the overwhelm of the surrounding oats, we set 0.5 m × 0.5 m as a unit area and used the same size of the box to define a certain number of test areas. In order to ensure the objectivity of manual counting, the five professionals mentioned above were still invited to conduct manual counting and take the mean value. The counting results are shown in Table 3.

In the oat ears counting experiment, the detection model using Vgg16 as the feature extraction network is slightly better than ResNet50 in practical applications, although it is not as good as ResNet50 in some performance indicators. Compared with the traditional Faster R-CNN algorithm with VGG16 or ResNet50 as the feature extraction network, the improved oat ear detection model shows better performance in the counting test, and its accuracy is 13.60% and 15.20% higher than the original model, respectively. Therefore, the optimization scheme of oat ear detection model in manuscript is reasonable and feasible.

It is worth noting that in the oat ear detection model based on Faster-RCNN, the model counting effect in the manuscript is the best. The counting accuracy of the improved model in the manuscript is 7.20% higher than that of the model based on the Faster R-CNN algorithm, previously used for wheat ear detection. If the counting effect of the Faster R-CNN model originally used for wheat ear detection is compared with the traditional Faster R-CNN detection model, the model shows a significant performance improvement. The key to its performance improvement is that the model also optimizes the RPN of Faster R-CNN. At the same time, this comparison shows that when using Faster R-CNN for target detection, the original configuration is not necessarily suitable for the current detection task, but needs to optimize its components, especially key components, which is the reason why it can achieve better detection accuracy in many detection models.

Compared with the improved single-stage target detection algorithm YOLOv5 originally used for wheat ear detection, the accuracy of the improved oat ear detection model in this manuscript is 0.80% lower. Under the action of CBAM, the improved YOLOv5 has higher detection accuracy and can better detect all oat ears in some test areas. The attention mechanism module enables the model to better understand the importance of different pieces of local information in the image, simplify the model, and improve efficiency. For the two-stage detection model, after optimizing its own component modules according to the actual dataset or task environment, trying to add modules other than itself that can further improve the detection performance. This is a direction for us to continue to optimize the oat ear detection model in subsequent experiments.

4. Conclusions

In this paper, through the targeted improvement of Faster R-CNN, the accurate detection and counting of oat ears in a natural environment are effectively realized. At the level of detection performance, the improved oat ear detection model significantly surpassed the traditional Faster R-CNN model with Vgg16 and Resnet50 as feature extraction networks in mAP index, and the improvement rate reached 13.01% and 6.62%, respectively. Specific to the actual oat ear counting task, its accuracy was also 13.60% and 15.20% higher than the above two traditional models, respectively. However, the complexity of the improved detection model increases, resulting in increased computational costs and slower model inference speed. In the subsequent experiments, we will focus on improving the oat ear detection model from the following two core dimensions and look forward to future development: firstly, the optimization of the detection model itself; secondly, the expansion and adaptation of future application scenarios.

Around the improvement of the model, we will try to consider the following aspects: (1) expand the dataset: improve the quality of the original dataset by adding more oat ear images and introduce different varieties of oats to enhance the universality of the oat ears detection model; (2) introduce attention mechanism: on the basis of model stability, adding an appropriate attention mechanism must be considered. The application of the attention mechanism can improve the detection performance of the model, which has been confirmed in many studies; (3) optimization of calculation and detection speed: Although the improved oat ears detection model is superior to other conventional models in accuracy, it is not as fast as other models in detection speed. Therefore, we will consider optimizing the components of ResNet50 and auxiliary parallel convolutional neural network or other Faster RCNN to reduce the amount of calculation and improve the detection speed; (4) the influence of hardware performance on the detection algorithm. The computer configuration selected in the experiment is a reference to the configuration of the loading on the agricultural machinery and equipment, which is a more conventional configuration. Therefore, we need to identify whether the hardware performance plays a key role in the detection speed of the detection algorithm.

For the future application prospects, we mainly look forward to the following points: (1) research and development of oat ears automatic harvester. Wheat, barley, and other crops have been widely developed in the research and development of intelligent agricultural machinery relying on target detection technology, especially automatic harvester. It can be considered that the detection information of oat ears can be regarded as a quantity. Through the change in this quantity, an oat ear automatic harvester that can independently adjust the operating parameters can be developed; (2) development of freely installable APP: The oat ears detection model was further developed into a freely installable APP, which can realize the information collection of oat ears field remotely; (3) selection of superior oat ears. In the subsequent industrial chain after oat ears harvest, the selection of superior oat ears can be realized on the basis of detection so as to process the corresponding grade products.

Author Contributions

Conceptualization, C.T. and X.Z.; data curation, D.Z.; formal analysis, J.W. and X.Z.; investigation, D.Z. and Y.L.; methodology, C.T. and J.W.; project administration, J.W. and D.Z.; software, J.W.; supervision, Y.L.; validation, X.Z.; visualization, Y.L.; writing—original draft, C.T. All authors have read and agreed to the published version of the manuscript.

Funding

The work of this paper has been supported by the central guidance of local science and technology development fund project—wheat and sorghum crop harvesting machinery achievements transformation and demonstration application (YDZJSX20231C009), Shanxi Agricultural University academic recovery project—grain crop mechanization technology and key equipment research and development (2023XSHF2), Shanxi basic research project—buckwheat threshing separation mechanism and intelligent low-loss threshing device research (202403021211050).

Data Availability Statement

The data resources of this experiment are available and can be obtained from the author. We are willing to share the oat ear image set used in the experiment.

Acknowledgments

Thank you for the three projects that provided financial support for this study; we express our sincere respect to all those who participated in the experiment and the discussion of the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Q. Analysis of the Current Situation of Global and Chinese Oat Industry in 2022. 2024. Available online: https://www.huaon.com/channel/trend/892251.html (accessed on 22 October 2024).
Ling, H.B.; Zhou, X.C.; Zhang, J.Y.; He, F.G. The detection method of wheat ear in natural environment based on improved Faster R-CNN. J. Chifeng Univ. (Nat. Sci. Ed.) 2021, 37, 17–21. (In Chinese) [Google Scholar]
Xu, B.; Tong, M. Wheat ear detection and recognition based on improved Faster R-CNN. J. Xiangtan Univ. 2022, 44, 48–59. (In Chinese) [Google Scholar]
Li, L.; Hassan, M.A.; Yang, S.; Jing, F.; Yang, M.; Rasheed, A.; Wang, Y.; Xia, X.; He, Z.; Xiao, Y. Development of image-based wheat spike counter through a Faster R-CNN algorithm and application for genetic studies. Crop J. 2022, 10, 1303–1311. [Google Scholar] [CrossRef]
Wu, W.; Yang, T.L.; Li, R.; Chen, C.; Liu, T.; Zhou, K.; Guo, W.S. Detection and enumeration of wheat grains based on a deep learning method under various scenarios and scales. J. Integr. Agric. 2020, 19, 1998–2008. [Google Scholar] [CrossRef]
Huang, S.; Zhou, Y.; Wang, Q.; Zhang, H.; Qiu, C.; Kang, K.; Luo, B. Improved YOLOv5 for measuring spike number per unit area of wheat in field. Acta Agric. Eng. 2022, 38, 235–242. [Google Scholar]
Lu, Q.K.; Zhang, J.L. Method of wheat ear detection based on anchorless YOLO detection network. J. Shandong Agric. Univ. (Nat. Sci. Ed.) 2022, 53, 796–802. [Google Scholar]
Lu, Z.; Zhang, J.; Han, B.; Li, Y. Wheat spike detection method based on improved YOLO v7-tiny. Jiangsu Agric. Sci. 2024, 1002–1302. (In Chinese) [Google Scholar]
Jing, F.; Wang, C.; Li, J.; Yang, C.; Liu, H.; Chen, Y. A Dual Detection Head YOLO Model with Its Application in Wheat Ear Recognition. Int. J. Cogn. Inform. Nat. Intell. 2024, 18, 1–17. [Google Scholar] [CrossRef]
Chen, W.; Lusi, A.; Gao, Q.; Bian, S.; Li, B.; Guo, J.; Zhang, D.; Hu, W.; Yang, C.; Huang, F. CB-YOLO: Dense Object Detection of YOLO for Crowded Wheat Head Identification and Localization. J. Circ. Syst. Comput. 2024; prepublish. [Google Scholar] [CrossRef]
Wang, P.X.; Du, J.L.; Zhang, Y.; Liu, J.M.; Li, H.M.; Wang, C.M. Estimation of winter wheat yield based on remote sensing multi-parameters and CNN-Transformer. J. Agric. Mach. 2024, 55, 173–182. (In Chinese) [Google Scholar]
Yu, J.W.; Chen, W.W.; Guo, Y.S.; Mu, Y.S.; Fan, C. Detection and counting model of rotating frame wheat ear based on improved Oriented R-CNN. Acta Agric. Eng. Sci. 2024, 40, 248–257. [Google Scholar]
Bao, W.; Su, B.; Hu, G.; Huang, C.; Liang, D. Wheat ear counting method of UAV wheat image based on FE-P2Pnet. J. Agric. Mach. 2024, 55, 155–164+289. (In Chinese) [Google Scholar]
Zhou, Z.L.; Liu, J.M.; Cao, D.; Liu, B.L.; Wang, D.X.; Zhang, H.G. Comparison of grass yield, agronomic traits, and forage quality among different oat varieties. Crop J. 2024, 1, 132–140. (In Chinese) [Google Scholar]
Jing, F.; Liu, Y.M.; Ren, S.L.; Bian, F.; Chen, F.; Zhang, C.J. Effects of variety and planting density on yield, quality and disease of oat forage. Grassland J. 2023, 31, 3174–3184. (In Chinese) [Google Scholar]
Sharma, A.; Kumar, V.; Longchamps, L. Comparative performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN models for detection of multiple weed species. Smart Agric. Technol. 2024, 9, 100648. [Google Scholar] [CrossRef]
Li, Z.; Li, Y.; Yang, Y.; Guo, R.; Yang, J.; Yue, J.; Wang, Y. A high-precision detection method of hydroponic lettuce seedlings status based on improved Faster R-CNN. Comput. Electron. Agric. 2021, 182, 106054. [Google Scholar] [CrossRef]
Li, J.H.; Lin, L.J.; Tian, K. Improved Faster R-CNN for field detection of bitter melon leaf diseases. J. Agric. Eng. 2020, 36, 179–185. (In Chinese) [Google Scholar]
Jiang, S.; Cao, Y.F.; Liu, Z.Y.; Zhao, S.; Zhang, Z.Y.; Wang, W.X. Identification of tea leaf diseases based on improved Faster R-CNN. J. Huazhong Agric. Univ. 2024, 43, 41–50. (In Chinese) [Google Scholar]
Yang, Q.; Ma, S.; Guo, D.; Wang, P.; Lin, M.; Hu, Y. A Small Object Detection Method for Oil Leakage Defects in Substations Based on Improved Faster-R-CNN. Sensors 2023, 23, 7390. [Google Scholar] [CrossRef]
Miao, R.; Yi, L.; Ke, Z.; Na, Z.Y.; Ran, C.R.; Gen, M. Research on an Improved Faster R-CNN Multi object Detection Model for Remote Sensing Images. Comput. Eng. 2022, 2, 1–14. (In Chinese) [Google Scholar] [CrossRef]
Hu, Z.; Wang, C. Improved Faster R-CNN for Small Object Detection in Remote Sensing Images. Comput. Eng. Sci. 2024, 46, 1063–1071. (In Chinese) [Google Scholar]
Zhang, Y.; Zhang, L.; Yu, H.; Guo, Z.; Zhang, R.; Zhou, X. Research on the Strawberry Recognition Algorithm Based on Deep Learning. Appl. Sci. 2023, 13, 11298. [Google Scholar] [CrossRef]
Zhou, C.; Ye, L.; Peng, H.; Liu, Z.; Wang, J.; Ramírez-De-Arellano, A. A Parallel Convolutional Network Based on Spiking Neural Systems. Int. J. Neural Syst. 2024, 34, 2450022. [Google Scholar] [CrossRef] [PubMed]
Zhu, H.M.; Li, P.; Jiao, L.C.; Yang, S.Y.; Hou, B. A Review of Parallel Research on Deep Neural Networks. J. Comput. Sci. 2018, 41, 1861–1881. (In Chinese) [Google Scholar]
Deng, X.; Deng, D.; Su, W. Research on ship communication anomaly data detection based on parallel deep convolutional neural network. Ship Sci. Technol. 2023, 45, 119–122. (In Chinese) [Google Scholar]
He, F.; Kou, Z.; Chi, Z. Research on Bearing Fault Diagnosis Based on Two-dimensional Parallel Residual Network. S. Agric. Mach. 2024, 55, 151–153. (In Chinese) [Google Scholar]
Zhao, X.; Li, X.; Xu, J.; Liu, B.; Bi, X. A Medical Image Registration Model Based on Convolutional Neural Network and Transformer Parallel. Comput. Appl. 2022, 44, 3915–3921. (In Chinese) [Google Scholar]
Zhu, H.; Li, X.; Meng, Y.; Yang, H.; Xu, Z.; Li, Z. Tea sprout detection based on Faster R-CNN network. J. Agric. Mach. 2022, 53, 217–224. (In Chinese) [Google Scholar]
Hou, J.; Yang, C.; He, Y.; Hou, B. Detecting diseases in apple tree leaves using FPN–ISResNet–Faster R-CNN. Eur. J. Remote Sens. 2023, 56, 2279–7254. [Google Scholar] [CrossRef]
Zhang, P.C.; Hu, Y.J.; Zhang, Y.; Zhang, C.H.; Chen, Z.; Chen, X. Research on Identification of Peach Tree Yellow Leaf Disease in Complex Background Based on Improved Faster R-CNN. Chin. J. Agric. Mach. Chem. 2024, 45, 219–225+251. (In Chinese) [Google Scholar]
Beigeng, Z.; Rui, S. Enhancing two-stage object detection models via data-driven anchor box optimization in UAV-based maritime SAR. Sci. Rep. 2024, 14, 4765. [Google Scholar]
Zheng, J.; Zhao, S.; Xu, Z.; Zhang, L.; Liu, J. Anchor boxes adaptive optimization algorithm for maritime object detection in video surveillance. Front. Mar. Sci. 2023, 10, 1290931. [Google Scholar] [CrossRef]
Liu, Y.; He, Y.; Wu, X.; Wang, W.; Zhang, L.; Lv, H. Potato germination and surface damage detection method based on improved Faster R-CNN. J. Agric. Mach. 2024, 55, 371–378. (In Chinese) [Google Scholar]
Wen, J.; Wang, Y.; Li, J.; Zhang, Y. Faster R-CNN object detection algorithm with multi head self-attention mechanism. Mod. Electron. Technol. 2024, 47, 8–16. (In Chinese) [Google Scholar]
Li, J.; Yang, Y. HM-YOLOv5: A fast and accurate network for defect detection of hotpressed light guide plates. Eng. Appl. Artif. Intell. 2023, 117, 105529. [Google Scholar] [CrossRef]
Li, H.; Jia, H.; Luo, B.; Bao, T. Research on chip surface defect detection method based on improved Faster R-CNN. Laser J. 2022, 45, 48. (In Chinese) [Google Scholar]
Shuqin, H.; Fule, H.; Liuming, L.; Feng, Q.; Yanzhou, L. Research on weed detection algorithm in sugarcane field based on Faster R-CNN. Chin. J. Agric. Mach. Chem. 2024, 45, 208–215. (In Chinese) [Google Scholar]
Mao, R.; Zhang, Y.; Wang, Z.; Gao, S.; Zhu, T.; Wang, M.; Hu, X. Identification of wheat stripe rust and yellow dwarf disease using improved Faster R-CNN. J. Agric. Eng. 2022, 38, 176–185. (In Chinese) [Google Scholar]
Du, Y.Y.; Yang, J.H.; Li, H.; Mao, Y.; Jiang, Y. A Few Samples Object Detection Algorithm Based on Improved Faster R-CNN. Electro Opt. Control 2023, 30, 44–51. (In Chinese) [Google Scholar]

Figure 1. Sample composition and description.

Figure 2. Oat ear labeling.

Figure 3. The composition and operation principle of the Faster R-CNN algorithm.

Figure 4. ResNet50 and auxiliary parallel convolutional neural network (RConv1, RConv2-x, and RConv5-x represent the sub-network structure of ResNet50. AConv2-x and AConv5-x represent the sub-network structure of ResNet50’s auxiliary convolution. DConv3-1, DConv4-1, and DConv5-1 represent the sub-network structure of dilated convolution. ⋯ represents the rest of the convolutional neural network that is not displayed in the figure due to the limitation of the image size. ⊕ represents the element-by-element addition of the previous feature map and the latter feature map).

Figure 5. Schematic diagram of converged FPN network (1 × 1 represents the size of the convolution kernel. Upsampling represents the upsampling operation. ⊕ represents the element-by-element addition of the previous feature map and the latter feature map. Contact represents the connection operation. P2 and P5 represent the feature map of conv2-x, conv3-x, conv4-x, and conv5-x in the feature extraction network after feature fusion).

Figure 6. Distribution of labeled boxes (The red dots represent the labeled boxes in the dataset. The arrows represent the size range of the anchor box under the original configuration).

Figure 7. Number of anchor boxes vs. IoU (The blue squares represent the corresponding mean IoU value in the case of different numbers of anchor boxes).

Figure 8. Coverage effect of anchor boxes of different sizes (The red dots represent the labeled boxes in the dataset. The arrows represent the size range of the anchor box under the optimized configuration).

Figure 9. Comparison of anchor box before and after optimization.

Figure 10. Principle of ROI Align (The black dots represent specific pixels).

Figure 11. IoU region definition (The blue box and the green box represent two adjacent prediction boxes, and the red box represents the overlapping area of the two boxes).

Figure 12. Comparison of PR curves.

Figure 13. The detection effect of different models on oat ears in natural environment (The detection model 1 uses Vgg16 and NMS. The detection model 2 uses ResNet50 and NMS. The detection model 3 uses parallel convolution and Progressive-NMS. The red box marks the place of false detection, and the yellow box marks the place of missed detection).

Figure 14. Comparison of different NMS.

Table 1. Setting the rest of the parameter values.

Parameter Name	Optimizer	MaxEpochs	InitialLearnRate	Activation Function	Mini BatchSize
Parameter value	Sgdm	10	1 × 10⁻⁵	Relu	1

Sgdm: Stochastic Gradient Descent with Momentum. Relu: Rectified Linear Unit.

Table 2. Performance comparison of various feature extraction networks.

Backbone Network	Anchor Box	Precision/%	Recall/%	mAP/%	Detection Speed/(frame/s)
Alexnet	$(\begin{array}{l} 128 \times 128 \\ 256 \times 256 \\ 512 \times 512 \end{array})$	16.85	29.47	27.13	20.58
Vgg16	$(\begin{array}{l} 128 \times 128 \\ 256 \times 256 \\ 512 \times 512 \end{array})$	60.63	79.57	71.40	10.83
GoogLeNet	$(\begin{array}{l} 128 \times 128 \\ 256 \times 256 \\ 512 \times 512 \end{array})$	37.21	45.96	41.81	9.33
ResNet50	$(\begin{array}{l} 128 \times 128 \\ 256 \times 256 \\ 512 \times 512 \end{array})$	56.09	82.29	77.79	15.31
Parallel convolution	$(\begin{array}{l} 128 \times 128 \\ 256 \times 256 \\ 512 \times 512 \end{array})$	66.60	82.45	80.07	12.69
Parallel convolution	$(\begin{array}{l} 100 \times 80 \\ 160 \times 120 \\ 256 \times 100 \\ 300 \times 130 \\ 256 \times 160 \end{array})$	65.20	84.04	82.96	12.60
Parallel convolution (Progressive-NMS)	$(\begin{array}{l} 100 \times 80 \\ 160 \times 120 \\ 256 \times 100 \\ 300 \times 130 \\ 256 \times 160 \end{array})$	73.30	86.54	84.41	12.14

Table 3. Oat ears counting results of different models.

Experimental Area Number	Manual Count	FasterR-CNN+Vgg16		FasterR-CNN+ResNet50		Improved Model		Ling Haibo [2]		Huang Shuo [5]
Experimental Area Number	Manual Count	Count	Accuracy %	Count	Accuracy %	Count	Accuracy %	Count	Accuracy %	Count	Accuracy %
1	10	7	70.00	7	70.00	9	90.00	7	70.00	9	90.00
2	13	9	69.23	8	61.54	11	86.61	11	86.61	11	86.61
3	25	19	76.00	17	68.00	23	92.00	20	80.00	22	88.00
4	8	4	50.00	5	62.50	6	75.00	6	75.00	8	100
5	9	7	77.78	7	77.78	7	77.78	7	77.78	8	88.89
6	7	4	57.14	6	85.71	7	100	7	100	7	100
7	13	10	76.92	11	84.62	12	92.31	10	76.92	10	76.92
8	17	15	88.24	14	82.35	15	88.24	15	88.24	16	94.12
9	12	10	83.33	8	66.67	12	100	10	83.33	12	100
10	11	10	90.91	9	90.00	10	90.91	10	90.91	10	90.91
Total	125	95	76.00	93	74.40	112	89.60	103	82.40	113	90.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, C.; Wang, J.; Zheng, D.; Li, Y.; Zhang, X. Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN. Agronomy 2025, 15, 536. https://doi.org/10.3390/agronomy15030536

AMA Style

Tian C, Wang J, Zheng D, Li Y, Zhang X. Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN. Agronomy. 2025; 15(3):536. https://doi.org/10.3390/agronomy15030536

Chicago/Turabian Style

Tian, Cong, Jiawei Wang, Decong Zheng, Yangen Li, and Xinchi Zhang. 2025. "Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN" Agronomy 15, no. 3: 536. https://doi.org/10.3390/agronomy15030536

APA Style

Tian, C., Wang, J., Zheng, D., Li, Y., & Zhang, X. (2025). Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN. Agronomy, 15(3), 536. https://doi.org/10.3390/agronomy15030536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN

Abstract

1. Introduction

2. Materials and Methods

2.1. Oat Ear Image Acquisition and Data Annotation

2.2. Faster R-CNN Algorithm and Its Composition Framework

2.3. Optimization of the Faster R-CNN Algorithm

2.3.1. Parallel Convolutional Neural Network

2.3.2. Optimization of Anchor Box in Size and Quantity

2.3.3. ROI Align Regional Feature Aggregation

2.3.4. Progressive-NMS

2.4. Test Environments

2.5. Model Evaluation Methodology

3. Results and Discussion

3.1. Comparison of Detection Performance of Various Feature Extraction Networks Under Different Anchor Boxes

3.2. Comparison of the Detection Effects of Faster RCNN on Oat Ears in Natural Environment Before and After Improvement

3.3. Oat Ears Counting Test

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI