Convolutional Neural Networks in Computer Vision for Grain Crop Phenotyping: A Review

: Computer vision (CV) combined with a deep convolutional neural network (CNN) has emerged as a reliable analytical method to effectively characterize and quantify high-throughput phenotyping of different grain crops, including rice, wheat, corn, and soybean. In addition to the ability to rapidly obtain information on plant organs and abiotic stresses, and the ability to segment crops from weeds, such techniques have been used to detect pests and plant diseases and to identify grain varieties. The development of corresponding imaging systems to assess the phenotypic parameters, yield, and quality of crop plants will increase the conﬁdence of stakeholders in grain crop cultivation, thereby bringing technical and economic beneﬁts to advanced agriculture. Therefore, this paper provides a comprehensive review of CNNs in computer vision for grain crop phenotyping. It is meaningful to provide a review as a roadmap for future research in such a thriving research area. The CNN models (e.g., VGG, YOLO, and Faster R-CNN) used CV tasks including image classiﬁcation, object detection, semantic segmentation, and instance segmentation, and the main results of recent studies on crop phenotype detection are discussed and summarized. Additionally, the challenges and future trends of the phenotyping techniques in grain crops are presented.


Introduction
Global food security remains an important issue for human development [1].By 2050, the global population is likely to exceed 9 billion, which means that agricultural production will need to increase by at least 70% from its current level to meet the growing demand for food [2].Grains are the main component of the human diet, and rice, wheat, corn, and soybean account for more than 80% of global grain production [3].Intelligent perception of crop phenotypic information helps to achieve precise field management, such as the selection of new varieties of high-yield and high-quality crops, and the minimization of agricultural inputs without affecting crop output.Plant phenotypes are the recognizable morphological, physiological, and biochemical characteristics and traits resulting from gene-environment interactions, including plant structure, composition, growth, and development [4].This means that phenotypic assessment not only involves the traits expressed by crop genes, but also reflects complex traits such as physiology, biochemistry, quality, stress resistance, or ones that are influenced by the external environment.
Computer vision (CV), when combined with pattern recognition algorithms and automatic classification tools, exhibits outstanding performance.Traditional plant phenotype detection relies on manual observation and measurement to obtain a description of the external morphology of the plant, and then assess the relationship between genes or external environment and phenotype.However, this approach can only detect individual traits from a small sample of crops, thus the acquisition process is inefficient and the amount of data available is very limited.With the increasing demand for high-volume plant phenotypic information, researchers urgently need high-precision, high-throughput, and low-cost techniques to replace traditional manual methods of obtaining relevant data.A variety of imaging techniques are available to collect complex traits related to growth, yield, and adaptation to biotic or abiotic stresses (e.g., diseases, insects, water stress, and nutrient deficiencies), including color imaging (e.g., machine vision), imaging spectroscopy (e.g., multi-spectral and hyperspectral remote sensing), thermal infrared imaging, fluorescence imaging, 3D imaging, and laminar imaging [5].
Over the past few decades, computer vision has been widely applied to analyze the phenotypic characteristics of grain crops and thus ease the food supply problem.Although a review of the phenotypic assessment of grain crops based on computer vision was published in 2018, the research mainly summarized the application of traditional machinelearning algorithms such as the support vector machine (SVM) and the back-propagation neural network (BPNN) [6].In addition, some researchers have reviewed the research on pest and disease analysis of crops [7], crop and weed identification [8], and physical and chemical phenotypic characteristics of crops [9], but they only mentioned a particular phenotyping task.Importantly, some new network architectures and strategies applied to the field of the convolutional neural network (CNN) and computer vision are rarely covered in the extensive reviews covering crop phenotype detection since2019.Several papers have been published in the last three years that provide comprehensive reviews of deep learning techniques for such computer vision tasks as image classification [10], object detection [11], and semantic and instance segmentation [12].These reviews effectively summarize the basic principles, development history, and future trends for the latest CNNs in computer vision, but none of them provide information related to agriculture, which highlights a gap between these technological theories and phenotyping applications.
Focusing on the state-of-the-art CNN algorithms rather than traditional machine learning (the specific differences are shown in Figure 1), this study is an important early step in the search for phenotyping of grain crops.Given the importance of the four most productive grain crops (rice, wheat, maize, and soybean) in the world, the related work on computer vision-based CNN models for the detections of crop organs, crops in weeds, plant diseases, insect infestations, abiotic stresses, and grain varieties since 2019 has been reviewed.The goal is to provide a comprehensive overview of novel CNN models combined with CV for phenotype detection in grain crops and to provide researchers and breeders with clear guidance for related decisions.This will greatly boost the productivity of grain crops.

Computer Vision (CV) and Convolutional Neural Networks (CNNs) 2.1. CV
In recent years, both the hardware and software of CV systems have been significantly developed.The hardware, including cameras, lights, and communication devices, is the foundation of CV, while the software, such as image processing algorithms, is the core of the system.A typical image acquisition system is indispensable to illumination devices.The illumination devices can be divided into point light sources, strip light sources, ring light sources, backlight light sources, structure light sources, and combined light sources.These light sources can be further classified as light-emitting diode (LED) light sources, halogen light sources, and high-frequency fluorescent light sources.In addition, the camera can be characterized as a global shutter or a roll-up shutter camera.

CNN
Since 2012, CNNs have dominated solutions to CV tasks, showing superior performance over traditional machine-learning methods [14].CNNs are deep learning architectures with spontaneous feature learning for image processing and image recognition.After the parameter optimization of training and learning, the CNN performs multiple layers of nonlinear transformations on the input data, continuously coupling the low-level features, and finally obtains a high-level semantic representation.Compared with traditional machine learning, a CNN can use a deeper neural network model to train the input data to simplify the data processing process.
A typical CNN consists of a convolutional layer, a pooling layer, and a fully connected layer [15].The neurons in the convolutional layer are arranged in a matrix to form a multi-channel feature map.A neuron in each channel is connected to only a part of the feature map before that layer [16].The final input of the neuron is obtained by convolving it with a convolution kernel and then using an activation function.CNNs emphasize weight sharing as a key component.Neurons located on the same channel feature map of the same convolutional layer are obtained by applying the same convolutional kernel to the previous feature map of the layer.Guided by local features in higher feature maps, the convolutional layer searches for links between them, while pooling layers combine data with the same semantics.Because the graphical information formed by adjacent positions may be slightly jittered, the pooling operation extracts the main information from the upper feature map.Maximum pooling and average pooling are common pooling operations.The model is able to keep translation and rotation invariant while preserving features [15].After alternating between convolution and pooling, a fully connected layer often appears.Each neuron in the fully connected layer is connected to every neuron in the upper layer.All the information is combined to turn the multi-dimensional features into one-dimensional features, which are handed over to the final regressor and classifier to produce the final result.

Image Classification
Image classification aims to assign predefined class labels to images.The CNN is currently the most popular neural network that combines a set of mathematical operations (e.g., convolution, pooling, and activation), using various connection schemes, such as plain stacking, start, and residual connections, to learn operational parameters from annotated images in order to classify image datasets (Figure 2 In 2012, the first modern CNN architecture named AlexNet was proposed.The algorithm demonstrated strong performance in image classification in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) competition in that year [17].The report of this model introduces a new era of image classification and other CV tasks using CNNs.From 2014 to 2017, researchers developed several representative CNNs such as the residual neural network (ResNet) [18], the visual geometry group network (VGG) [19], and the dense convolutional network (DenseNet) [20] for image classification.These CNNs significantly improved the learning ability and recognition complexity by using efficient computational algorithms and modified connectivity schemes.From 2017, more studies focused on the use of reinforcement learning to search for the best CNN architecture that could yield higher performance [21].This process introduces a reinforcement learning framework to find the optimal convolutional image elements on small datasets, followed by stacking and transferring the resulting image elements in a different way to a large unknown dataset.
Researchers have investigated the mechanisms of CNNs for image classification.A recent study improved AlexNet to create a new variant (ZFNet) using a visualization tool.This tool is a framework integrated with CNNs that can map neuronal activity back to the input pixel space.Thus, pixel-level activations can be visualized after each convolutional layer, which is particularly useful for understanding the CNN mechanism for further upgrades.CNNs can learn general representations of images rather than features solely for classification.Subsequent research developed various gradient-based methods, including guided backpropagation, gradient-weighted class activation mapping (Grad-CAM), and layer-by-layer relevance propagation (LRP).Meanwhile, some general frameworks (e.g., LIME and occlusion maps) can also be used to display important image regions for classification results [22,23].

Object Detection
Object detection is defined as determining the location of objects in a given image and the class to which each object belongs.As shown in Figure 2, object detection using CNNs can be divided into two categories: single-level and two-level CNN architectures.In the early framework development, OverFeat is the most representative model [24], and won the localization task of the 2013 ILSVRC competition.Then, a series of region-based regionconvolution neural network (R-CNN) frameworks was introduced, including the original R-CNN [25], Fast R-CNN [26], and Faster R-CNN [27].There are three key techniques in the RCNN architectures, including the region proposal network (RPN), region of interest (ROI) pooling operation, and multi-task loss function.The R-CNN family has been widely adopted as object detectors for various domain datasets.

Semantic and Instance Segmentations
Semantic segmentation aims to assign a class to each pixel in an image, but objects in the same classes are not distinguished.Instance segmentation outputs the mask and class of the target.Typically, CNN architectures for semantic and instance segmentations can be divided into two categories, including encoder-decoder-based frameworks and detection-based frameworks, as shown in Figure 2. The encoder-decoder-based model is the most primitive intelligent image segmentation network for improving segmentation accuracy.In the encoder stage, the CNN extracts semantic features from input samples.In the decoder stage, deconvolution is used to assign the extracted features to the label of each pixel.Representative models based on encoder-decoder include full convolutional networks (FCNs) [28], DeepLab [29], and U-Net [30].Frameworks including R-CNN, Faster R-CNN, and Mask R-CNN have been widely used for instance segmentation [31,32].
tions can be divided into two categories, including encoder-decoder-based frameworks and detection-based frameworks, as shown in Figure 2. The encoder-decoder-based model is the most primitive intelligent image segmentation network for improving segmentation accuracy.In the encoder stage, the CNN extracts semantic features from input samples.In the decoder stage, deconvolution is used to assign the extracted features to the label of each pixel.Representative models based on encoder-decoder include full convolutional networks (FCNs) [28], DeepLab [29], and U-Net [30].Frameworks including R-CNN, Faster R-CNN, and Mask R-CNN have been widely used for instance segmentation [31,32].

Advances in Phenotyping of Four Grain Crops Based on CV and CNN
In grain crops, conventional inbreeding and artificial breeding based on molecular and genomic engineering are closely dependent on phenotypic information, which remains a bottleneck limiting crop breeding [34,35].From field to table, grain crop phenotypes play an important role in enhancing crop germplasm, strengthening breeding, and evaluating commercial performance.Related researchers have invested a lot of effort in developing high-throughput and low-cost advanced phenotyping techniques.Particularly widely acknowledged is the development of CNNs combined with CV technology, which marks a new stage in crop phenotype detection.One of the challenges in breeding grain crops is to improve yield potential and quality stability [36].However, traditional phenotyping methods based on manual measurements are typically labor-intensive and time-consuming when assessing multiple traits of crops [37].The combination of cutting-edge CNNs and CV technology can achieve high-throughput screening of high-quality crop varieties, accurate yield prediction, automatic field weed detection, and early automatic diagnosis of pests and diseases, all of which are essential for the study of crop yield and quality enhancement.

Crop Organ Detection and Counting
Recent advances in CV and breakthroughs in deep learning have created new opportunities for the detection and counting of crop organs [38].Traditional crop organ phenotypic information was obtained by manual measurement, such as measuring crop height and leaf width with a straightedge or counting using the naked eye.This is not only time-consuming and labor-intensive, but also has a limited variety of extracted features and low precision.CNN-based methods have shown promising results compared to traditional methods for crop selection in breeding programs [39].Three methods, including object detection, semantic segmentation, and instance segmentation, are proposed for recognizing and counting the organs of the four major grain crops.
An object detection method which integrates the feature pyramid network (FPN) into the Faster R-CNN network has been successfully used for counting rice spikes [40].Li et al. [41] investigated the performance of Faster R-CNN and RetinaNet in predicting the number of wheat spikes at different growth stages.The RetinaNet model achieved higher accuracy for wheat spikes at the filling and maturity stages.Compared to Faster R-CNN and RetinaNet, Cascade R-CNN obtained a higher average precision (AP) of 89.6 for the detection and counting of soybean flowers and seeds [42].You only look once (YOLO)v4 architecture was used to improve the detection speed and accuracy of wheat spikes [43].TasselNet (ResNet34) was then established to detect the tassels of maize at different stages [44].The backbone part of YOLOv4 was enhanced by adding a dual spatial pyramid pool (SPP) network to boost feature learning and broaden the perceptual domain of the convolutional network.The results obtained showed the superiority of the detection module compared to the early methods using SVM classifiers (Lu et al. [45], 2015) and neural network intensity model (Lu et al. [46], 2016).
Many studies investigated semantic segmentation based crop organ detection and counting.Sadeghi-Tehran et al. [47] not only successfully identified and quantified the number of wheat spikes in RGB images taken under initial natural field conditions, but also developed an efficient CV and CNN system based on DeepCount.The method used simple linear iterative clustering (SLIC) to segment images into super pixels and constructed a reasonable feature model for the semantic segmentation of wheat spikes.The results indicated that the model was able to detect the total number of wheat spikes in an image and estimate the number of spikes per square meter with a maximum accuracy of 98%.In another study, Xiong et al. [48] proposed a simple and effective contextual extension of TasselNet-TasselNetv2-that could significantly improve the performance of local regression networks.Experiments showed that TasselNetv2 was faster than TasselNet.Meanwhile, the classical model of semantic segmentation based on CNNs was used to detect the corn cob (Kienbaum et al. [49]).The Mask R-CNN model was used to extract shape parameters including asymmetry, ellipticity, and length of cobs, achieving an accuracy of about 100% for maize cob phenotypic parameters.It was found that the number of kernels in a corn cob image can be accurately estimated by DeepCorn (Khaki et al. [50]).In their work, DeepCorn uses VGG-16 as the backbone for feature extraction to merge elemental maps from multiple scales of the network, making it robust to image scale variations.DeepCorn successfully counted the kernels on a cob, regardless of their orientation and illumination conditions.In addition to this, Yang et al. [51] proposed a novel synthetic image generation and enhancement method based on domain randomization.The study used the Mask R-CNN model combined with transfer learning to perform the semantic segmentation of soybean seed images and successfully obtained specific organ phenotype parameters, which further deepens the application of CNNs in semantic segmentation tasks.
A noticeable concern is that although CNNs could provide accurate semantic masks, the counting accuracy can still suffer from inaccurate postprocesses.To address this concern, studies explored the use of instance segmentation CNNs that can directly segment individual objects in images [52].For instance, a sophisticated soybean phenotypic measurement algorithm, named soybean phenotypic measurement instance segmentation (SPM-IS), was developed, enabling more rapid and accurate acquisition of phenotypic data for soybean stems, pods, and seeds (Li et al. [53]) This study used the Resnet-101-FPN model and SPM-IS algorithms to perform instance segmentation on images to measure the length and width of target objects to extract soybean phenotypic data.The test results showed that the mask MAP of pods, stems, and seeds were 95.7%, 93.5%, and 94.6%, respectively.
Faster R-CNN, Mask R-CNN, RetinaNet, and VGG have been widely studied with regard to the detection and counting of organs of grain crops, as shown in Table 1.Some strategies such as SPP, SLIC, and domain randomization have been added to the model training for the first time to achieve feature enhancement.In general, object detection is more widely used than semantic and instance segmentation in this area, but image classification was used less often in crop organ identification and counting.It is encouraging that some counting methods based on 3D image sequences or videos are emerging to provide a strategy to solve the above problems [54].

Weed and Crop Recognition and Segmentation
Weeds in a field compete with crops for nutrients, sunlight, and growing space.They need to be removed in time to avoid affecting crop yields [57].Early applications of machine-learning methods to solve weed recognition problems generally used the color cooccurrence matrix (CCM) to extract features in terms of hue, color saturation, and intensity, or morphological and color features as input to the classifier [58].However, the leaves of different plants often have the same color and shape, and it is challenging to identify weeds by the difference of leaf features.Traditional methods select artificially designed features to be extracted for distinction, which only performs well on specific datasets.With advances in intelligent sensing technology, the CNN-combined-with-CV technique has emerged as a promising tool for accurate and real-time detection of weeds and crops in the field Rice is a crop with fixed rice row spacing, which can be identified by the location identification method.Lin et al. [59] developed a Faster R-CNN model to determine the specific row spacing parameters and successfully detected rice seedlings from weeds with an accuracy of 89.8%.Wang et al. [60] proposed a new method for the recognition of rice seedling rows based on row vector grid classification.In their research, seedling feature extraction and row vector grid classification were built into an end-to-end CNN model.The method successfully realized crop recognition in complex weed scenarios.The Faster R-CNN model was also used to distinguish between weeds and maize [61].This study proposed an architecture using a VGG19 pre-trained network for distinguishing maize seedlings from weeds under complex field conditions.The results revealed that Faster R-CNN model has great potential for plant detection.Additionally, Jiang et al. [62] proposed a graph convolutional network (GCN) recognition method based on a similar approach.The GCN graph was constructed using the extracted CNN features of weeds and their Euclidean distances for maize and weed recognition.The results show that the GCN-ResNet-101 method achieved an accuracy score of 97.80%, which was better than state-of-the-art methods including AlexNet, VGG16, and ResNet-101.
Semantic segmentation has also been applied to the management of weeds.In a recent study, a multi-task semantic segmentation-convolutional neural network (MTS-CNN) model was designed for detecting crops and weeds using one-stage training [63].This approach has heightened the correlations between the crop and weed classes, so that the object (crop and weed) region is trained intensively with the highest segmentation accuracy.Weirong et al. [64] proposed an improved Mask R-CNN-based algorithm for maize seedling segmentation.The model was trained using ResNeXt50/101-FPN as a feature extraction network.The average recognition accuracy of the model was higher than 94.7%.Furthermore, Zhang et al. [65] developed a weed classification model based on the YOLOV3-tiny network.In the study, a real-time detection system for field weeds based on unmanned aerial vehicles (UAVs) and mobile devices was designed to detect five kinds of weeds.Furthermore, Haq [66] and Babu and Ram [67] have conducted taxonomic studies on grasses and broadleaf weeds in soybean.The network architectures used were CNNs with learning vector quantization (LVQ), and a deep residual convolutional neural network (DRCNN).Both methods achieved over 97% accuracy for the individual targeting of two weed species.
In summary, the discrimination of crops such as rice and maize from complex weeds depends on the correct identification and localization of the plant by the model.Researchers have proposed many CNN-based solutions, most of which are implemented using object detection and semantic segmentation (the related studies are tabulated in Table 2).All of these results far exceed the accuracy achieved by a wide range of methods with artificially designed features.In recent years, researchers have achieved high recognition and segmentation accuracies on rice and maize image datasets by using classical networks such as ResNet and Faster R-CNN, or by building other shallow networks.Other studies were carried out on the classification of weed species based on supervised and semi-supervised learning methods [68,69].In the future, advanced network models and more comprehensive datasets are needed to enable the identification of multiple crops and common weeds [70].

Crop Disease Detection and Classification
The intelligent detection of plant diseases has received increasing attention in recent years.Crop diseases negatively affect agricultural production [72].Early detection and control of crop diseases play a crucial role in the management of and decision making involved in agricultural production.Traditional machine learning approaches to feature analysis of crop photographs can detect diseases earlier than human observation.Nevertheless, such methods focusing on a limited number of crops were usually performed on small data sets.In recent years, methods based on deep learning and image technologies have been widely used in plant pathology.
CNNs have been successfully used for the classification and detection of crop diseases.Sharma et al. [73] developed a CNN model based on transfer learning to classify diseases in rice leaf images.Based on the disease features, Krishnamoorthy et al. [74] successfully distinguished three invasive rice diseases, including leaf blast, white leaf blight, and brown spot, from healthy rice leaves, with an accuracy of 95.67%.Singh and Arora [75] and Kumar and Kukreja [76] developed seven CNN models to classify wheat diseases including powdery mildew, stem rust, and leaf rust.Compared with VGG16, VGG19, AlexNet, ResNet-34, ResNet-50, and ResNet-18, ResNet101 achieved the highest accuracy of 98.6%.Similarly, Jiang et al. [77] adopted the PlantVillage dataset to pretrain several CNN models based on transfer learning.Figure 3 shows the comparison of CNNs used in this study in terms of accuracy, memory, and processing time, which will provide a reference direction for other disease diagnosis.

Crop Disease Detection and Classification
The intelligent detection of plant diseases has received increasing attention in years.Crop diseases negatively affect agricultural production [72].Early detectio control of crop diseases play a crucial role in the management of and decision m involved in agricultural production.Traditional machine learning approaches to fe analysis of crop photographs can detect diseases earlier than human observation.N theless, such methods focusing on a limited number of crops were usually perform small data sets.In recent years, methods based on deep learning and image techno have been widely used in plant pathology.
CNNs have been successfully used for the classification and detection of cro eases.Sharma et al. [73] developed a CNN model based on transfer learning to cl diseases in rice leaf images.Based on the disease features, Krishnamoorthy et a successfully distinguished three invasive rice diseases, including leaf blast, whit blight, and brown spot, from healthy rice leaves, with an accuracy of 95.67%.Sing Arora [75] and Kumar and Kukreja [76] developed seven CNN models to classify diseases including powdery mildew, stem rust, and leaf rust.Compared with VG VGG19, AlexNet, ResNet-34, ResNet-50, and ResNet-18, ResNet101 achieved the h accuracy of 98.6%.Similarly, Jiang et al. [77] adopted the PlantVillage dataset to pr several CNN models based on transfer learning.Figure 3 shows the comparis CNNs used in this study in terms of accuracy, memory, and processing time, whic provide a reference direction for other disease diagnosis.Object detection combined with the classical Faster R-CNN model plays a grea in the detection of grain crop diseases.Bari et al. [78] found a solution for the rea detection of rice leaf diseases using Faster R-CNN to precisely localize the target.R showed that the approach has accuracies of 98.09%, 98.85%, and 99.17% for auto detection of three rice blast, brown spot, and hispa, respectively.In another study, et al. [79] proposed a fast rice disease detection method based on the K-mean clus algorithm (FCM-KM) and fast R-CNN.FCM-KM was optimized using the chaosdynamic population firefly algorithm and maximum minimum distance.Zhang [80] designed a multi-feature fusion faster R-CNN (MF3R-CNN) model for the det of soybean leaf disease, with an average accuracy of 83.34%.Compared to the stud Shrivastava et al. [81] (2017) and Pires et al. [82] (2016) using the k-nearest neura work and local descriptors to distinguish the diseased from the healthy, the mod practical implications for multiple disease identifications.
Grain crop diseases in complex environments have been successfully de based on semantic segmentation.For instance, Ennadifi et al. [83] conducted a ma CNN to segment wheat spikes from the background.Then, a DenseNet121 model bined with gradient-weighted class activation mapping (GradCAM) was used for Object detection combined with the classical Faster R-CNN model plays a great role in the detection of grain crop diseases.Bari et al. [78] found a solution for the real-time detection of rice leaf diseases using Faster R-CNN to precisely localize the target.Results showed that the approach has accuracies of 98.09%, 98.85%, and 99.17% for automatic detection of three rice blast, brown spot, and hispa, respectively.In another study, Zhou et al. [79] proposed a fast rice disease detection method based on the K-mean clustering algorithm (FCM-KM) and fast R-CNN.FCM-KM was optimized using the chaos-based dynamic population firefly algorithm and maximum minimum distance.Zhang et al. [80] designed a multi-feature fusion faster R-CNN (MF3R-CNN) model for the detection of soybean leaf disease, with an average accuracy of 83.34%.Compared to the studies of Shrivastava et al. [81] (2017) and Pires et al. [82] (2016) using the k-nearest neural network and local descriptors to distinguish the diseased from the healthy, the model has practical implications for multiple disease identifications.
Grain crop diseases in complex environments have been successfully detected based on semantic segmentation.For instance, Ennadifi et al. [83] conducted a mask R-CNN to segment wheat spikes from the background.Then, a DenseNet121 model combined with gradient-weighted class activation mapping (GradCAM) was used for localizing the diseased areas on wheat spikes in an unsupervised manner, yielding an accuracy of 93.47%.Nevertheless, wheat disease classification is susceptible to various visual disturbances.Lin et al. [84] proposed an M-bCNN model for the classification of wheat leaf diseases, achieving a test accuracy of 90.1%.Su et al. [85] developed a Mask-RCNN model to evaluate Fusarium head blight (FHB) severity with an accuracy of 77.19% (Figure 4).In their study, a ResNet-101 network-based FPN was used as the backbone of Mask-RCNN to segment wheat spikes and diseased areas, yielding accuracies of 77.76% and 98.81%.On this basis, the FPN based on the ResNet network was further upgraded as the backbone of BlendMask networks for the severity assessment of wheat FHB [86].The newly constructed model demonstrated outstanding performance in the identification of wheat spikes occluded by awns, which is more concise and efficient than the Mask R-CNN.
izing the diseased areas on wheat spikes in an unsupervised manner, yielding an accuracy of 93.47%.Nevertheless, wheat disease classification is susceptible to various visual disturbances.Lin et al. [84] proposed an M-bCNN model for the classification of wheat leaf diseases, achieving a test accuracy of 90.1%.Su et al. [85] developed a Mask-RCNN model to evaluate Fusarium head blight (FHB) severity with an accuracy of 77.19% (Figure 4).In their study, a ResNet-101 network-based FPN was used as the backbone of Mask-RCNN to segment wheat spikes and diseased areas, yielding accuracies of 77.76% and 98.81%.On this basis, the FPN based on the ResNet network was further upgraded as the backbone of BlendMask networks for the severity assessment of wheat FHB [86].The newly constructed model demonstrated outstanding performance in the identification of wheat spikes occluded by awns, which is more concise and efficient than the Mask R-CNN.In general, image classification has more comprehensive applications than object detection and segmentation tasks in crop disease detection.The Mask R-CNN model not only allows for semantic segmentation, but also allows for more efficient analysis of disease severity levels.Faster R-CNN, when used as a tool for object detection, focuses more on spot location identification.When combined with FCM-KM, the results obtained are more comprehensive and convenient.Detailed comparison results are shown in Table 3.The CNN model combined with FCM-KM, GradCAM, and other strategies have been groundbreakingly optimized to provide new idea for crop disease detection.In addition, it has been found that a transfer-learning method employing retuning of all parameters produced the highest accuracy [77].In general, image classification has more comprehensive applications than object detection and segmentation tasks in crop disease detection.The Mask R-CNN model not only allows for semantic segmentation, but also allows for more efficient analysis of disease severity levels.Faster R-CNN, when used as a tool for object detection, focuses more on spot location identification.When combined with FCM-KM, the results obtained are more comprehensive and convenient.Detailed comparison results are shown in Table 3.The CNN model combined with FCM-KM, GradCAM, and other strategies have been groundbreakingly optimized to provide new idea for crop disease detection.In addition, it has been found that a transfer-learning method employing retuning of all parameters produced the highest accuracy [77].Gao, Wang, Li and Su [86] N-CNN-neonatal convolutional neural network; MAF-multi-pathway activation function; M-bCNN-matrix-based convolutional neural network; RPN-region proposal network; PSP Net-pyramid scene parsing network; BiLSTM-bi-directional long short-term memory.

Crop Insect Infestation Detection
Pests have a significant impact in crop destruction [95].However, the extensive use of chemicals such as pesticides to control pests has had adverse effects on agro-ecosystems [96].Traditional methods use chlorophyll histograms to detect discoloration caused by pests, or SVM combined with special algorithms to identify the presence of pests [97,98].For these methods, segmentation becomes difficult if the background contains distractions such as other leaves and plants.In addition, designing artificial features such as color histograms and texture features requires expertise, which is difficult to apply universally.Lately, numerous CNN-based pest identification methodologies have been presented in the computer vision field, which have showed brilliant execution in early pest control.
For insect infestation classification tasks, many cutting-edge models and strategies have been continuously developed in recent years, which in turn resulted in the proposal of more efficient deep networks.For example, four different CNN models, including VGG16, VGG19, InceptionV3, and MobileNetV2, were applied to the detection of maize leaves infected by fall armyworms (faw) [99].The study found that InceptionV3 and MobileNetV2 performed better than the other models, with identification accuracies of 100%.Moreover, Tetila et al. [100] and Abade et al. [101] innovatively used the simple linear iterative clustering (SLIC) strategy and the NemaNet model, respectively, to classify pest-infested soybean images and both showed extremely promising results Object detection is a computer vision task that involves the identification of an object class with its location in the image.On this basis, Li et al. [102] developed a Resnet-50 with the region proposal network (RPN) for pest identification in wheat fields, achieving the accuracies of 90.88%, 88.76%, and 70.2% for wheat sawfly, wheat aphid, and wheat mite, respectively.In addition, the Faster R-CNN model was effectively applied to detect pest infection in grain crops [103].Furthermore, Verma et al. [104] developed three popular CNN models to identify pests in soybean.YOLO v5 exhibited better performance than YOLOv3 and YOLOv4 in pest detection and recognition.
In summary, CNN-based network models including VGG, Faster R-CNN, and YOLO were effectively used for the detection of crop insect infestations.The VGG model mainly focuses on the species of the insect infestations, while Faster R-CNN and YOLO models are more often used for the identification and localization of the sites of infection, as shown in Table 4.The development of models using SLIC and models with RPN, or models that are currently popular but not used in this field, is a new direction for solving pest problems.FS-full-scale; TL-time limit.

Abiotic Crop Stress Phenotype Assessment
Abiotic stresses, such as nutrient deficiency, drought, temperature, and salinity stresses, are major challenges for agriculture, and they lead to a significant reduction in crop growth and productivity [106].The stress phenotyping assessment is an important tool for improving crop stress resistance, which can be divided into four stages: (1) identification (presence of stress); (2) classification (type of stress); (3) quantification (severity of stress); and (4) prediction (likelihood of stress occurrence) [13].Although traditional machinelearning methods such as SVM, artificial neural networks (ANN), and Random Forests are often used to study abiotic stress phenotypes of crops [107], the development of deep CNN offers new opportunities to advance in this field.Image classification combined with CNN can be effectively used for abiotic crop stress detection.For rice crops, Nitrogen (N) concentration is a key indicator of health status.Sethy et al. [108] proposed a CNN-based method for predicting N deficiency stress in rice.They used six leading CNN architectures including ResNet-18, ResNet-50, GoogleNet, AlexNet, VGG-16, and VGG-19 to predict nitrogen deficiency.ResNet-50 +SVM outperformed the other five CNN-based classification models with an accuracy of 99.84%.Additionally, Wang et al. [109] and Rizal et al. [110] developed the Densenet-121 model and the ResNet-50 model respectively to evaluate the nutrient deficiencies of the rice leaves affected by three different types of nutrient deficiencies including N, phosphorus (P), and potassium (K), with an accuracy of over 97%.Furthermore, water stress affects the normal growth of grain.Zhuang et al. [111] developed a multi-scale CNN architecture with 2-Covs Units for the assessment of the water stress severity of maize, which realized automatic detection and severity quantification of water stress through computer vision techniques in a non-destructive way.
To sum up, in the current research results as shown in Table 5, image classification is more widely used for abiotic stress assessment than other computer vision tasks.Advanced CNN models such as VGG, YOLO, and Mask R-CNN have not been developed for this research area.It is worthwhile to expect that imaging modalities (e.g., hyperspectral imaging) combined with CNN will provide a new idea for phenotypic stress detection [112].SVM-support vector machines.

Crop Seed Variety Classification
As a key input for crop production, seed is of great economic value, and its varietal classification is crucial for maintaining crop yield and varietal purity [113].However, the phenotypic characteristics of different varieties of grain crop seeds are very similar, with significant overlap in morphology and color.Traditional seed variety classification usually requires manual annotation and judgment by experts in the agricultural field, which is very inefficient.Therefore, it is necessary to explore reliable methods to improve classification efficiency.
In the current research results, crop seed varieties can be classified effectively based on CNN technology.For example, Laabassi et al. [114] utilized five standard CNN structures (such as DensNet201, Inception V3, and MobileNet) trained based on transfer learning to classify wheat seeds into four varieties (Simeto, Vitron, ARZ, and HD), with the best classification accuracy of 95.68% attributed to DensNet201 architecture.Javanmardi et al. [115] successfully classified nine corn seed varieties based on a VGG-16 pre-trained CNN model.Gao et al. [116] proposed a CNN based variety classification model for multiple growth periods of wheat.In the study, the CMPNet achieved high classification precision at the seed stage of wheat (Specific performance index shown in Figure 5) based on ResNet and SENet.In addition, Velesaca et al. [117] utilized a Mask R-CNN architecture to perform instance segmentation of maize seed samples.Meanwhile, some other typical models were then developed in the study, which showed that the CK-CNN achieved the best robustness and stability compared with VGG16 and ResNet50.
In the current research results, crop seed varieties can be classified effectively based on CNN technology.For example, Laabassi et al. [114] utilized five standard CNN structures (such as DensNet201, Inception V3, and MobileNet) trained based on transfer learning to classify wheat seeds into four varieties (Simeto, Vitron, ARZ, and HD), with the best classification accuracy of 95.68% attributed to DensNet201 architecture.Javanmardi et al. [115] successfully classified nine corn seed varieties based on a VGG-16 pre-trained CNN model.Gao et al. [116] proposed a CNN based variety classification model for multiple growth periods of wheat.In the study, the CMPNet achieved high classification precision at the seed stage of wheat (Specific performance index shown in Figure 5) based on ResNet and SENet.In addition, Velesaca et al. [117] utilized a Mask R-CNN architecture to perform instance segmentation of maize seed samples.Meanwhile, some other typical models were then developed in the study, which showed that the CK-CNN achieved the best robustness and stability compared with VGG16 and ResNet50.It is worth mentioning that the newly improved model architectures combined with transfer learning such as P-ResNet showed the best accuracy to classify maize seeds in a non-destructive, fast and efficient manner [118].The process is given in Figure 6.The result highlighted the advantages of transfer learning and its potential in deep learning, providing new solutions for CNN-based computer vision and spectroscopic techniques for seed classification and detection.It is worth mentioning that the newly improved model architectures combined with transfer learning such as P-ResNet showed the best accuracy to classify maize seeds in a non-destructive, fast and efficient manner [118].The process is given in Figure 6.The result highlighted the advantages of transfer learning and its potential in deep learning, providing new solutions for CNN-based computer vision and spectroscopic techniques for seed classification and detection.
the best classification accuracy of 95.68% attributed to DensNet201 architectur Javanmardi et al. [115] successfully classified nine corn seed varieties based on a VGGpre-trained CNN model.Gao et al. [116] proposed a CNN based variety classificatio model for multiple growth periods of wheat.In the study, the CMPNet achieved hig classification precision at the seed stage of wheat (Specific performance index shown Figure 5) based on ResNet and SENet.In addition, Velesaca et al. [117] utilized a Ma R-CNN architecture to perform instance segmentation of maize seed samples.Mea while, some other typical models were then developed in the study, which showed th the CK-CNN achieved the best robustness and stability compared with VGG16 an ResNet50.It is worth mentioning that the newly improved model architectures combined wi transfer learning such as P-ResNet showed the best accuracy to classify maize seeds in non-destructive, fast and efficient manner [118].The process is given in Figure 6.The r sult highlighted the advantages of transfer learning and its potential in deep learnin providing new solutions for CNN-based computer vision and spectroscopic techniqu for seed classification and detection.In conclusion, various CNN models all have advantages and disadvantages for classifying crop seed varieties (the results of comparison are tabulated in Table 6).The classic DensNet and the novel corn kernel-CNN (CK-CNN) have higher accuracy than other models.Furthermore, the proposed methods, such as transfer learning and gradient-weighted class activation mapping (Grad-CAM) techniques, provide new perspectives to maintain the classification accuracy and robustness of the model.In addition to computer vision, thermal imaging [119] and hyperspectral detection techniques [120] have also achieved great success in variety identification based on CNNs over the last decade.

Discussion
With the development of computer vision and deep learning, image processing has achieved great success over the last decade.One of the key techniques leading to this success is that of CNNs [121].CNNs are often used as algorithmic tools for analyzing data.As a technique for automatic feature extraction, the CNN can be used for automatic acquisition of crop phenotype information.On the basis of that, the CNN technology, when combined with different computer vision tasks, can perform various phenotype detections of grain crops.For crop organ detection, CNN can not only count flowers, seeds, and spikes, but can also detect organ length, width, and other shape parameters, with an accuracy of over 90%.The technology can also implement effective recognition of weeds and crops by extracting leaf features.In terms of accuracy, CNNs have achieved more than 94% recognition rate for maize plants.The overall recognition rate is roughly the same as previous work when the dataset is larger, the crop growth stages are more diverse, and the background is more complex.Furthermore, CNN-based crop pest research involves a wide range of diseases, including common rusts, powdery mildew, and blast, and insect pests such as mites, wheat aphid, and corn borers.The tasks not only include basic tasks such as pest and disease classification and detection, but also refer to more complex tasks such as determination of infection levels.In addition to biotic stresses, abiotic stress phenotypes, such as nutrient deficiency and water stress assessment, showed high accuracies.It is worth mentioning that the method has potential in seed variety identification.With the limited phenotypic information of seeds, the proposed CNN models were able to classify seeds effectively with an accuracy of more than 95%.In general, CNN combined with CV has been widely used in grain crop phenotypic research.
The performance of different CNNs In the phenotype detection of grain crops is influenced by several factors.The main factor is the network architecture.Generally, deep CNN models have higher accuracy than shallow networks [20].For example, researchers found that DenseNet-121 (with 121 layers) and ResNets (with 50 and 101 layers) had accuracies over 95% [63], while the ResNet net with 18 layers only achieved an accuracy of 88.54% for weed and crop recognitions [60].Similarly, as an upgraded version of Faster R-CNN [85], Mask R-CNN allows simultaneous target detection, image classification and instance segmentation in a neural network due to its performance improvements in architecture.In addition to this, the performance of CNNs is also affected by the training strategy used.Jiang et al. [77] found CNNs trained from scratch had slow convergence.In contrast, the CNNs trained by fixed feature extraction converged more rapidly but had the lowest accuracy.Furthermore, the input dataset is crucial for training CNNs, because it is the basic source of information.By providing geometrically transformed replicates of the sample images to provide a larger and more general dataset, the accuracies of CNNs were improved [55,89,90].In addition, image quality can interfere with crop phenotyping detection results.In particular, images collected in field conditions can be affected by environmental factors such as complex backgrounds, unstable lighting, and image blur, all of which might lead to misanalysis [60,79].Therefore, annotated datasets of large size and rich variety will always be required for CNNs.In summary, the development of different CNN model architectures with appropriate strategies and datasets can solve various phenotypic tasks.Object detection based on Fast R-CNN was more universal when crop organ was the object of counting, while Mask R-CNN showed better performance, with accuracies as high as 99%.Image classification was not widely used for the recognition of weed and crops, but it is favored by biotic and abiotic stress assessment.The performance of YOLO, GoogleNet, and Inception models were outstanding in classifying images of crops infected with pests and diseases, with accuracies of over 95% in different cases.In addition, the VGG-16 model combined with different strategies and datasets successfully completed the tasks of target detection and instance segmentation, respectively, with accuracies as high as 98% [115,117].

Challenges of CV and CNNs in Grain Crops and Future Trends
The annotation of datasets is a crucial factor in building robust CNN models.As CNNs need to perform different CV tasks, the relevant datasets require instance-level (bounding box) and pixel-level (mask) annotations.Both of these are very time-consuming tasks [122].Therefore, in future work, it will be necessary to continue developing semisupervised learning (SSL) and unsupervised learning (UL) to lower the cost and time of data labeling [123].Advanced SSL and UL on CNN methods, such as K-means, transfer learning, and generative adversarial networkgan (GAN) [124], have permeated multiple areas of crop phenotyping study.Moreover, a number of promising approaches, such as reinforcement learning [125] and contrastive learning [126], which have succeeded in other areas to reduce the computational and energy costs, need further exploration in crop research.There is no doubt that the increasing application of advanced algorithms will effectively alleviate the problems of insufficient training data and scarcity of labeled data in grain crops.
All studies mentioned in this review used RGB images for grain crop phenotype detection.In addition to RGB cameras, more-informative sensors (e.g., multi-spectral or hyperspectral sensors) are opening new possibilities [127].These sensors have been mounted on UAVs and autonomous robots to obtain more information, covering a larger area of crop phenotypes [123,128,129].Especially, UAVs could help farmers to monitor their agricultural fields and apply agrochemical products to crops with ease and high precision [130].Chemical control can be conducted more effectively.
The development of lightweight CNNs on mobile devices (e.g., cell phone and computer software) is of great practical relevance to help farmers in agricultural management.In addition, given that GPU performance on mobile devices is not inferior to GPUs on computers, processing is slower when lightweight CNNs are deployed to mobile devices.Hence, the tradeoff between accuracy, time, and memory should be considered in the model design.
It is worth mentioning that Transformer is not inferior to CNNs as another mainstream deep learning architecture in some detection studies.Compared to CNNs, Transformer has the strong advantage of a self-attention mechanism, which allows it to make exciting progress on various vision tasks, including the four tasks mentioned above, multi-modal tasks, video processing, low-level vision, and three-dimensional analysis [131,132].Many recent studies have tried to introduce Transformer encoders into the improved model as a convolution operation, such as to identify field crop diseases [133] and split crops of remote sensing images [134].Numerous results show that this combined model outperforms a single CNN or Transformer approach with good generalization capabilities.This provides possibilities for transferring deep learning models to mobile phenotype detection devices.Although spectroscopic techniques based on machine learning have been extensively studied during the past few years, CV techniques based on CNNs will show greater potential in agricultural production [135][136][137][138][139][140][141][142].Finally, it is clear that much work has been conducted on phenotyping of rice, wheat, maize, and soybean, but other species of grains and even other types of crops (e.g., fruits and vegetables) have also been explored, both to demonstrate interest in the phenotypic problem and to show the potential of CNN-based CV techniques to address it efficiently.

Conclusions
CNNs are widely used for phenotype detection of four grain crops.Different CNN models including VGG, YOLO, Fast R-CNN, and Mask R-CNN have been used for image classification, object detection, semantic segmentation, and instance segmentation.In this paper, we reviewed the latest CNN networks pertinent to organ counting, weed segmentation, biotic and abiotic stress assessment, and seed variety classification.The results demonstrate the importance of network architecture, development strategy, and annotated datasets in the model design for different tasks, which can directly affect the performance of CNNs.To benefit from the great potential of CNNs, high-quality sample images remain a crucial element for crop phenotyping, and robust CNNs on mobile devices are desired for practical applications.Given the recent boom in the development of CNNs combined with CV technology, it is anticipated that the method will become widespread for obtaining crop phenotype data in real time, leading to more impactful results that will contribute to precision agriculture and food security in the future.
).The current development of modern CNNs for image classification can be divided into three phases: (1) the appearance of modern CNNs (2012-2014); (2) the development and refinement of CNN architecture intensification (2014-2017); and (3) the introduction of reinforcement learning and artificial intelligence for CNN architecture design (start of 2017).

Figure 2 .
Figure 2. Diagrams of CNN architecture mechanisms for image classification, object detection, and semantic and instance segmentation [33].

Figure 2 .
Figure 2. Diagrams of CNN architecture mechanisms for image classification, object detection, and semantic and instance segmentation [33].

Figure 6 .
Figure 6.Process of transfer learning and classification of maize seeds [118].

Table 1 .
Summary of major CNN-combined-with-CV tasks for crop organs images.

Table 2 .
Summary of major CNN-combined-with-CV tasks for weed and crop images.

Table 3 .
Summary of major CNN-combined-with-CV tasks for crop disease images.

Table 3 .
Summary of major CNN-combined-with-CV tasks for crop disease images.

Table 4 .
Summary of major CNN-combined-with-CV tasks for crop insect infestations images.

Table 5 .
Summary of major CNN-combined-with-CV tasks for abiotic crop stress images.

Table 6 .
Summary of major CNN-combined-with-CV tasks for crop seeds images.