Advanced Technology in Agriculture Industry by Implementing Image Annotation Technique and Deep Learning Approach: A Review

: The implementation of intelligent technology in agriculture is seriously investigated as a way to increase agriculture production while reducing the amount of human labor. In agriculture, recent technology has seen image annotation utilizing deep learning techniques. Due to the rapid development of image data, image annotation has gained a lot of attention. The use of deep learning in image annotation can extract features from images and has been shown to analyze enormous amounts of data successfully. Deep learning is a type of machine learning method inspired by the structure of the human brain and based on artiﬁcial neural network concepts. Through training phases that can label a massive amount of data and connect them up with their corresponding characteristics, deep learning can conclude unlabeled data in image processing. For complicated and ambiguous situations, deep learning technology provides accurate predictions. This technology strives to improve productivity, quality and economy and minimize deﬁciency rates in the agriculture industry. As a result, this article discusses the application of image annotation in the agriculture industry utilizing several deep learning approaches. Various types of annotations that were used to train the images are presented. Recent publications have been reviewed on the basis of their application of deep learning with current advancement technology. Plant recognition, disease detection, counting, classiﬁcation and yield estimation are among the many advancements of deep learning architecture employed in many applications in agriculture that are thoroughly investigated. Furthermore, this review helps to assist researchers to gain a deeper understanding and future application of deep learning in agriculture. According to all of the articles, the deep learning technique has successfully created signiﬁcant accuracy and prediction in the model utilized. Finally, the existing challenges and future promises of deep learning in agriculture are discussed.


Introduction
The agriculture sector is the backbone of most countries, providing enormous employment opportunities to the community as well as goods manufacturing and food supply. Fruit plantation is one of the most important agricultural activities. The production and protection of fruit per capita has recently been considered an essential indicator of a country's growth and quality of life [1]. Population growth of 7.2 to 9.6 billion people is expected produce high-accuracy responses [31,32]. Several AIA techniques have been proposed other than the deep learning approach such as support vector machines, Bayesian, texture resemblance and instance-based method. Deep learning techniques, on the other hand, have succeeded in image processing throughout the last decade [33]. The high accuracy of deep learning is generated by high computational and storage requirements during the training and inference phase. This is because the training process is both space consuming and computationally intensive, as millions of parameters are needed to refine over multiple periods of time [34]. Due to complexity of the data models, training is quite expensive. Furthermore, deep learning necessitates the use of costly graphic user interfaces (GPUs) and many machines. This raises the cost to the users. The image annotation training set based on deep learning can be classified into supervised, unsupervised and semi-supervised categories.
Supervised deep learning involves training a data sample from a data source that has been classified correctly. Its algorithm is trained on input data that has been labeled for a certain output until it is able to discern the underlying links between the inputs and output findings. The system is supplied with labeled datasets during the training phase, which will inform it which outputs are associated with certain input values. Supervised learning provides a significant challenge due to the requirement of a huge amount of labeled data [35,36] and at least hundreds of annotated images are required during the supervised training [37]. The training approach consists of providing a large number of annotated images to the algorithm to assist the model to learn, then testing the trained model on unannotated images. To determine the accuracy of this method, annotated images with hidden labels are often employed in the algorithm's testing stage. Thus, annotated images for training supervised deep learning models achieve acceptable performance levels. Most of the studies applied supervised learning, as this method promises high accuracy as proposed in [38][39][40]. Another attractive annotation method is based on unsupervised learning. Unsupervised learning, in contrast to supervised learning, deals with unlabeled data. In addition, labels for these cases are frequently difficult to obtain due to insufficient knowledge data or the labeling is prohibitively expensive. Furthermore, the lack of labels makes setting goals for the trained model problematic. Consequently, determining whether or not the results are accurate is difficult. The study by [41] employed unsupervised learning in two real weed datasets using a recent unsupervised deep clustering technique. These datasets' results signal a potential direction in the use of unsupervised learning and clustering in agricultural challenges. For circumstances where cluster and class numbers vary, the suggested modified unsupervised clustering accuracy has proven to be a robust and easier to interpret evaluation clustering measure. It is also feasible to demonstrate how data augmentation and transfer learning can significantly improve unsupervised learning.
Semi-supervised learning, like supervised and unsupervised learning, involves working with a dataset. However, the dataset is separated into labeled and unlabeled parts. When the labeling of acquired data is too difficult or expensive, this technique is frequently used. In fact, it is also possible to use it if the labeled data are poor quality [42]. The fundamental issue in large-scale image annotation approaches based on semi-supervised learning is dealing with a large, noisy dataset in which the number of images expands faster. The ability to identify unwanted plants has improved because of the advancement in farm image analysis. However, the majority of these systems rely on supervised learning, which necessitates a large number of manually annotated images. As a result, due to the huge variety of plant species being cultivated, supervised learning is economically infeasible for the individual farmer. Therefore, [43][44][45] proposed an unsupervised image annotation technique to solve weed detection in farms using deep learning approaches.
Deep learning has significant potential in the agriculture sector in increasing the amount and quality of the produce by image-based classification. Consequently, many researchers have employed the technology and method of deep learning to improve and automate tasks [3]. Its role in this sector gives excellent results in plant counting, leaf counting, leaf segmentation and yield prediction [46]. Noon et al. [47] have reviewed the application of deep learning in the agriculture sector by identifying plant leaf stress in early detection to enable farmers to apply the suitable treatment. Deep learning is effective in detecting leaf stress for various plants. However, implementing deep learning in agriculture requires a large amount of data regarding the plants, in terms of collecting and processing. The necessary data are basically collected using wireless sensors, drones, robots and satellites [48]. The more data used to train the deep learning model, the more robust and pervasive the model becomes [49].
Unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) are examples of robotics systems that provide a cost-effective, adaptable and scalable solution for product management and crop quality [50]. Weeds are able to reduce crop production and their growth must be monitored regularly to keep them under control. Additionally, applying the same amount of herbicide to the entire field results in waste, pollution and a higher cost for farmers. The combination of image analytics from UAV footage and precision agriculture is able to assist agronomists in advising farmers on where to focus herbicides in particular regions in the field [51,52]. As stated in [53], the first stage in site-specific weed management is to detect weed patches in the field quickly and accurately. Therefore, the authors proposed object detection implemented with Faster RCNN in training and evaluating weed detection in soybean fields using a low-altitude UAV. The proposed technique was the best model in detecting weeds by obtaining an intersection over union (IoU) performance of 0.85. Franco et al. [54] have captured a thistle weed species, Cirsium arvense, in cereal crops by utilizing a UAV. This tool is used to gather a view of an agriculture site with detailed exploration and is attractive due to its low operational costs and flexible driving. A UAV captured RGB images of thistles at 50 m above the ground, annotated weed and cereal classes and grouped them under a unique label of pixels. According to [51], labeling plants in a field image consumes a lot of time and there is very little attention paid to annotating the data by training a deep learning model. Therefore, the authors proposed a deep learning technique to detect weeds using UAV images by applying overlapping windows for weed detection [51]. Deep learning techniques will provide the probability of the plant being a weed or crop for each window location. Deep learning can make harvesting robots more effective when generating robust and reliable computer vision algorithms to detect fruit [55]. The usage of UAVs in dataset collection has also been applied in palm oil tree detection [56], rice phenology [57], detection and classification of soybean pests [58], potato plant detection [59], paddy field yield assessment [60] and corn classification [61].
Over the last few decades, UGVs have been used to achieve efficiency, particularly by reducing manpower requirements. UGVs have been employed for soil analysis [62], precision spraying [63], controlled weeding [64] and crop harvesting [65]. Mazzia et al. [66] employed a UGV for path planning using deep learning as an estimator. Row-based crops are ideal for testing and deploying UGVs that can monitor and manage to harvest the crops. The study proposed by the authors proved the feasibility of the deep learning technique by demonstrating the viability of a complete autonomous global path planner. In [67], a robot harvester with the implementation of a deep learning algorithm is used to detect an obstacle and observe the surrounding environment for rice. The image cascade network's employment successfully detects obstacles and avoids collision with an average success rate of 96.6%. Besides UAVs and UGVs, deep learning provides a practical solution in the agriculture field from satellite imagery. A vital component of agricultural monitoring systems is having accurate maps of crop types and acreage. Therefore, the application of satellites is able to determine the boundary of smallholder farms since their boundaries are hazy, in irregular shapes and frequently mixed with other land uses. Persello et al. [68] presented a deep learning technique to automatically delineate smallholder farms using a convolutional network in combination with a globalization and grouping algorithm. The proposed solution outperforms alternative strategies by autonomously delineating field boundaries with F scores greater than 0.7 and 0.6 for the proposed test regions, respectively. Furthermore, satellites are implemented to capture images in identifying crops as presented in [69]. The authors utilized multiexposure satellite imagery of agricultural  [70], enough data should be collected for training in order to predict crop yields and forecast crop prices reliably. Data availability is a significant limitation that can be overcome using satellite imagery that can cover huge geographic areas. The combination of utilizing deep learning using satellite imagery applications gives a significant advantage results in extracting field boundaries [71], monitoring agricultural areas [72], weather prediction [73], crop classification [74] and soil moisture forecast [75].
Various implementations of deep learning in agriculture approaches have been extensively reviewed in recent years as proposed in [5,37,[76][77][78][79]. Among those, Koirala et al. [77] reviewed the application of deep learning in fruit detection and yield estimation, Zhang et al. [80] explore dense scene analysis of the application deep learning in agriculture and Moazzam et al. [79] emphasized the challenges of weed and crop classification using deep learning. Based on the great attention on the implementation of deep learning in the agriculture sector in recent years, and contrary to existing surveys, this article concisely reviews the use of deep learning techniques in image annotation, focusing on plants and crop areas. This review article presents the most recent five years of research on this method in agriculture, covering the new technology and trends. The presentation covers the techniques of annotating images, the learning techniques, the various architectures proposed, the tools used and, finally, the applications. The application issues are basically in plant detection, disease detection, counting, yield estimation, segmentation and classification in the agriculture sector. These tasks are difficult to perform manually, time consuming and require workforce involvement. The lack of people's ability to identify objects for these tasks is finally compensated for by using current technology and trends, particularly image annotation and deep learning techniques, which also boost process efficiency. There are many different types of plants. To identify plants, especially rare ones, knowledge is required. Additionally, a systematic and disciplined approach to classifying various plants is crucial for recognizing and categorizing the vast amount of data acquired on the many known plants. To solve this problem, plant detection and classification are crucial tasks. Since segmentation helps to extract features from an image, it will improve classification accuracy. A crucial concern in agriculture is disease detection. Disease control procedures can waste time and resources and result in additional plant losses without accurate identification of the disease and its causative agent. Furthermore, in the agriculture industry, counting is essential in managing orchards, yet it can be difficult because of various issues, including overlapping. In particular, counting leaves provides a clear image of the plant's condition and stage of development. Especially in the age of global climate change, agricultural output assessment is essential for solving new concerns in food security. Accurate yield estimation benefits famine prevention efforts in addition to assisting farmers in making appropriate economic and management decisions. Therefore, this manuscript emphasizes these efforts to boost agriculture production by summarizing these tasks using deep learning, which improves prediction and accuracy. Various architecture structures of CNNs are well described as a reference for researchers to better understand the implementation of deep learning in the agriculture sector to illustrate how they work. This article also proposes the future trends and technology that could be implemented to improve the quality and productivity in the agriculture field.

Deep Learning for Image Annotation
Image annotation using deep learning is the most informative method that requires more complex training data. It is essential for functional datasets because it informs the training model about the crucial parts of the image and may use those details to recognize the classes in test images. The majority of automatic image annotation methods perform by extracting features from training and testing images at the first step. Secondly, based on the training data, the annotation model is developed. Finally, annotations are developed based on the characteristics of the test images [81]. Figure 1 illustrates the detail of the image  [82].
Image annotation using deep learning is the most informative method that requires more complex training data. It is essential for functional datasets because it informs the training model about the crucial parts of the image and may use those details to recognize the classes in test images. The majority of automatic image annotation methods perform by extracting features from training and testing images at the first step. Secondly, based on the training data, the annotation model is developed. Finally, annotations are developed based on the characteristics of the test images [81]. Figure 1 illustrates the detail of the image annotation process. Feature extraction is a technique for indexing and extracting visual content from images. Color, texture, shape and domain-specific features are examples of primitive or low-level image features [82]. Depending on the approach utilized, various annotation types are used to annotate images. The popular image annotation techniques employed in agriculture based on deep learning are bounding box [83][84][85][86] and segmentation [87][88][89][90]. The study in [91] proposed the tools to boost the efficiency of identifying agriculture images, which frequently have more various objects and more detailed shapes than those in many general datasets. Feature extraction in the architecture of deep learning can be found in imaging applications. Different types of this architecture in deep learning that have frequently been applied in recent years are unsupervised pre-trained networks (UPNs), recurrent neural networks (RNNs) and convolutional neural networks (CNNs) [92]. An RNN has the advantage of processing time-series data and making decisions about the future based on historical data. An RNN has been proposed by Alibabaei et al. [93] to predict tomato yield according to the date, climate, irrigation amount and soil water content. RNN architecture consists of long-shot term memory (LSTM), gated recurrent units (GRUs), bidirectional LSTM (BLSTM) and bidirectional GRU (BGRU). The study shows that BLSTM is able to capture the relationship of the past and new observations and accurately predict the yield. However, the BLSTM model has a longer training time compared to implemented models. The authors also conclude that deep learning has the ability to estimate the yield at the end of the seasons.
A CNN is mainly used among deep learning architecture due to its high detection accuracy, reliability and feasibility [94]. CNNs or convNets are designed to learn the spatial features, for example edges, textures, corners or more abstract shapes. The core of learning these characteristics is the diverse and successive transformation of the input object, which is convolution at different spatial scales such as pooling operation. This operation identifies and combines both high-level concepts and low-level features [95]. This Depending on the approach utilized, various annotation types are used to annotate images. The popular image annotation techniques employed in agriculture based on deep learning are bounding box [83][84][85][86] and segmentation [87][88][89][90]. The study in [91] proposed the tools to boost the efficiency of identifying agriculture images, which frequently have more various objects and more detailed shapes than those in many general datasets. Feature extraction in the architecture of deep learning can be found in imaging applications. Different types of this architecture in deep learning that have frequently been applied in recent years are unsupervised pre-trained networks (UPNs), recurrent neural networks (RNNs) and convolutional neural networks (CNNs) [92]. An RNN has the advantage of processing time-series data and making decisions about the future based on historical data. An RNN has been proposed by Alibabaei et al. [93] to predict tomato yield according to the date, climate, irrigation amount and soil water content. RNN architecture consists of longshot term memory (LSTM), gated recurrent units (GRUs), bidirectional LSTM (BLSTM) and bidirectional GRU (BGRU). The study shows that BLSTM is able to capture the relationship of the past and new observations and accurately predict the yield. However, the BLSTM model has a longer training time compared to implemented models. The authors also conclude that deep learning has the ability to estimate the yield at the end of the seasons.
A CNN is mainly used among deep learning architecture due to its high detection accuracy, reliability and feasibility [94]. CNNs or convNets are designed to learn the spatial features, for example edges, textures, corners or more abstract shapes. The core of learning these characteristics is the diverse and successive transformation of the input object, which is convolution at different spatial scales such as pooling operation. This operation identifies and combines both high-level concepts and low-level features [95]. This method has been proven to be good in extracting abstract features from a raw image through convolutional and pooling layers [96]. The architecture of CNNs was introduced by Fukushima [97] who proposed the algorithm of supervised and unsupervised training of the parameter that learns from the incoming data. In general, a CNN receives the image data that form input layers and generates a vector of different characteristics assigned to object classes in the form of an output layer. There are hidden layers between the input and output layers consisting of a series of convolution and pooling layers and ending with a fully connected layer [98]. CNNs are widely used as a powerful class of models to classify images in a multiple problems in agriculture such as fruit classification, plant disease detection, weed identification and pest classification [99]. In addition, they can also detect and count the number of crops. Huang et al. [100] chose a CNN to classify green coffee beans because CNN characteristics are good at extracting image color and shape.
Two categories of object detection in deep learning are defined by drawing bounding boxes around the images and classifying the object's pixels. From a label perspective, drawing rectangular bounding boxes around the object is much easier compared to labeling the object's pixels by drawing outlines. However, from a mapping perspective, pixel-level object detection is more accurate compared to the bounding box technique [101]. According to Hamidinekoo et al. [102], it is challenging to segment and compute the detection of individual fruits from images. Therefore, the authors applied a CNN to classify various parts of the plant inflorescence and estimate fruit numbers from the images. CNNs are also used in detecting fruit and disease. Onishi et al. [103] proposed a high-speed and accurate method to detect the position of fruit and automated harvesting using a robot arm. The authors utilized a shot multibox detector (SSD) based on the CNN method to detect objects in an image using a single deep neural network. To achieve a high level of recognition accuracy, the SSD creates multiscale predictions from multiscale feature maps and explicitly separates the predictions based on ratio aspect. The image of fruit detection utilized in this method is shown in Figure 2. Other fruits and leaves occlude some apples, but the method can still detect the apples. The result of the study showed that the fruit detection using the SSD is 90% and this accuracy was achieved in only 2 s. by Fukushima [97] who proposed the algorithm of supervised and unsupervised tra of the parameter that learns from the incoming data. In general, a CNN receives the i data that form input layers and generates a vector of different characteristics assign object classes in the form of an output layer. There are hidden layers between the and output layers consisting of a series of convolution and pooling layers and ending a fully connected layer [98]. CNNs are widely used as a powerful class of models to sify images in a multiple problems in agriculture such as fruit classification, plant di detection, weed identification and pest classification [99]. In addition, they can also d and count the number of crops. Huang et al. [100] chose a CNN to classify green c beans because CNN characteristics are good at extracting image color and shape.
Two categories of object detection in deep learning are defined by drawing boun boxes around the images and classifying the object's pixels. From a label perspe drawing rectangular bounding boxes around the object is much easier compared to ing the object's pixels by drawing outlines. However, from a mapping perspective, level object detection is more accurate compared to the bounding box technique According to Hamidinekoo et al. [102], it is challenging to segment and compute th tection of individual fruits from images. Therefore, the authors applied a CNN to cla various parts of the plant inflorescence and estimate fruit numbers from the im CNNs are also used in detecting fruit and disease. Onishi et al. [103] proposed a speed and accurate method to detect the position of fruit and automated harvesting a robot arm. The authors utilized a shot multibox detector (SSD) based on the method to detect objects in an image using a single deep neural network. To achi high level of recognition accuracy, the SSD creates multiscale predictions from mult feature maps and explicitly separates the predictions based on ratio aspect. The ima fruit detection utilized in this method is shown in Figure 2. Other fruits and leaves oc some apples, but the method can still detect the apples. The result of the study sh that the fruit detection using the SSD is 90% and this accuracy was achieved in only Another major concern in the agriculture sector nowadays is that many patho and insects threaten many farms. Since deep learning can dive into deep analysi computation, this technique is one of the prominent methods for plant disease dete [104]. Many approaches help to monitor the health of the crop, from semantic segm tion to other popular image annotation techniques. When compared to labeling da classification, segmentation data are more challenging. Several image annotations b on supervised learning for object segmentation methods have been presented in r years for this reason. Sharma et al. [105] used image segmentation to detect disea employing the CNN method. In order to obtain maximum data on disease symptom image is segmented by extracting the affected parts of leaves rather than the whol ages. The quantifying result for each type of disease shows that the data are trained Another major concern in the agriculture sector nowadays is that many pathogens and insects threaten many farms. Since deep learning can dive into deep analysis and computation, this technique is one of the prominent methods for plant disease detection [104]. Many approaches help to monitor the health of the crop, from semantic segmentation to other popular image annotation techniques. When compared to labeling data for classification, segmentation data are more challenging. Several image annotations based on supervised learning for object segmentation methods have been presented in recent years for this reason. Sharma et al. [105] used image segmentation to detect disease by employing the CNN method. In order to obtain maximum data on disease symptoms, the image is segmented by extracting the affected parts of leaves rather than the whole images. The quantifying result for each type of disease shows that the data are trained very well and achieved that excellent result even under real conditions. Kang and Chen [106] performed detection and segmentation of apple fruit and branches as shown in Figure 3. As shown in Figure 3a well and achieved that excellent result even under real conditions. Kang and Chen performed detection and segmentation of apple fruit and branches as shown in Fig  As shown in Figure 3a-f, apples are drawn in distinct colors, and branches are dra blue. These detections and segmentations are recognized by utilizing a CNN. The e ment achieved 0.873 accuracy of instance segmentation of apple fruits and 0.794 acc of branch segmentation. Khattak et al. [107] proposed a CNN to identify fruits and leaves in healthy an eased conditions. The result shows that the CNN has a test accuracy of 94.55 pe making it a suggested support tool for farmers in classifying citrus fruit/leaf condit either healthy or diseased. In yield estimation, Yang et al. [108] trained a CNN to est corn grain yield. The experiment conducted by the authors produced 75.50% classifi accuracy of spectral and color images. Fuentes [109] successfully proved that the i mentation of a deep learning technique can detect disease and pests in tomato plan addition, the technique is able to deal with a complex scenario from the surroundin of the plant. The result obtained is shown in Figure 4   Khattak et al. [107] proposed a CNN to identify fruits and leaves in healthy and diseased conditions. The result shows that the CNN has a test accuracy of 94.55 percent, making it a suggested support tool for farmers in classifying citrus fruit/leaf condition as either healthy or diseased. In yield estimation, Yang et al. [108] trained a CNN to estimate corn grain yield. The experiment conducted by the authors produced 75.50% classification accuracy of spectral and color images. Fuentes [109] successfully proved that the implementation of a deep learning technique can detect disease and pests in tomato plants. In addition, the technique is able to deal with a complex scenario from the surrounding area of the plant. The result obtained is shown in Figure 4a-d, where the deep learning generates high accuracy in detecting disease and pests. The image from left to right for each sub-figure is the input image, annotated image and predicted results.
The architectures of CNNs have been classified gradually with the increasing number of convolutional layers, namely LeNet, AlexNet, Visual Geometri Group 16 (VGG16), VGG19, ResNet, GoogLeNet ResNext, DenseNet and You Only Look Once (YOLO). The differences between these architectures are the number of layers, non-linearity function and the pooling type used [110]. Mu et al. [111] applied VggNet to detect the quality of blueberry through the skin pigments during the seven stages of its maturity. The technique was used to solve the difficulty and identify the maturity and quality grade of the blueberry fruit measured by the human eye. In fact, the method has improved the accuracy and efficiency of detection of the quality of blueberry. Lee et al. [112] proposed three types of CNN architecture with different layers, namely, VGG16 with 16 layers, InceptionV3 with 48 layers and GoogLeNetBN with 34 layers. The InceptionV2 inspired GoogLeNetBN and InceptionV3 architecture and has the capability of improving the accuracy and reducing the complexity of computation. Batch normalization (BN) has been proven to be able to limit overfitting and speed up convergence. In a study by [113], three CNN architectures, AlexNet, InceptionV3 and SqueezeNet, were compared to assess their accuracy in evaluating tomato late blight disease. Among these architectures, AlexNet generates the highest accuracy in feature extraction with 93.4%. Gehlot and Saini [114] also compared the performance of CNN architectures in classifying diseases in tomato leaves. The architectures assessed in the study are AlexNet, GoogLeNet, VGG-16, ResNet-101 and DenseNet-121. The accuracy of all these architectures are almost equal. However, the size of DenseNet-121 is much smaller, at 89.6MB, and the largest size is 504.33 MB, obtained by ResNet-101. The architectures of CNNs have been classified gradually with the increasing number of convolutional layers, namely LeNet, AlexNet, Visual Geometri Group 16 (VGG16), VGG19, ResNet, GoogLeNet ResNext, DenseNet and You Only Look Once (YOLO). The differences between these architectures are the number of layers, non-linearity function and the pooling type used [110]. Mu et al. [111] applied VggNet to detect the quality of blueberry through the skin pigments during the seven stages of its maturity. The technique was used to solve the difficulty and identify the maturity and quality grade of the blueberry fruit measured by the human eye. In fact, the method has improved the accuracy and efficiency of detection of the quality of blueberry. Lee et al. [112] proposed three types of CNN architecture with different layers, namely, VGG16 with 16 layers, Incep-tionV3 with 48 layers and GoogLeNetBN with 34 layers. The InceptionV2 inspired Goog-LeNetBN and InceptionV3 architecture and has the capability of improving the accuracy and reducing the complexity of computation. Batch normalization (BN) has been proven to be able to limit overfitting and speed up convergence. In a study by [113], three CNN architectures, AlexNet, InceptionV3 and SqueezeNet, were compared to assess their accuracy in evaluating tomato late blight disease. Among these architectures, AlexNet generates the highest accuracy in feature extraction with 93.4%. Gehlot and Saini [114] also compared the performance of CNN architectures in classifying diseases in tomato leaves. The architectures assessed in the study are AlexNet, GoogLeNet, VGG-16, ResNet-101 and DenseNet-121. The accuracy of all these architectures are almost equal. However, the size of DenseNet-121 is much smaller, at 89.6MB, and the largest size is 504.33 MB, obtained by ResNet-101. Figure 5 presents the details on the image annotation and its deep learning approach technique. Low-level features are used to represent images in image classification and retrieval. The initial stage in semantic comprehension is to extract efficient and effective visual features from an image's unstructured array of pixels. The performance of semantic  Figure 5 presents the details on the image annotation and its deep learning approach technique. Low-level features are used to represent images in image classification and retrieval. The initial stage in semantic comprehension is to extract efficient and effective visual features from an image's unstructured array of pixels. The performance of semantic learning approaches is considerably improved by appropriate feature representation. Numerous feature extraction techniques, including image segmentation, color features, texture characteristics, shape features and spatial relationships, have been proposed [115]. There are five categories of image annotation methods, which are generative model-based image annotation, nearest neighbor-based image annotation, discriminative model-based image annotation, tag completion-based image annotation and deep learning-based image annotation [25,26]. In the past decade, tremendous progress has been made in deep learning techniques, allowing image annotation tasks to be solved using deep learning-based feature representation. The most recent advancements in deep learning enable a number of deep models for large-scale image annotation. A CNN is commonly used by deep learning-based approaches to extract robust visual characteristics. Several versions of CNN architecture, such as LeNet, VGG, GooLeNet, etc., have been proposed. The following section describes the most commonly employed CNN architectures. The four types of image annotation are image classification, object detection or recognition, segmentation and boundary recognition. All of these task types can be annotated using deep learning techniques. The training process of deep learning can be supervised, unsupervised or semi-supervised, depending on how the neural network is used. In most cases, supervised learning is used to predict a label or a number. Commonly used benchmarks for evaluating image annotation techniques are based on the performance metrics. Section 4.8 provides the specifics on performance evaluation metrics.
types of image annotation are image classification, object detection or recognition, seg-mentation and boundary recognition. All of these task types can be annotated using deep learning techniques. The training process of deep learning can be supervised, unsupervised or semi-supervised, depending on how the neural network is used. In most cases, supervised learning is used to predict a label or a number. Commonly used benchmarks for evaluating image annotation techniques are based on the performance metrics. Section 4.8 provides the specifics on performance evaluation metrics.

Deep Learning Architecture
A CNN is a special type of multilayer neural network used to recognize visual patterns directly from pixel images with minimal processing. The computer views an image as an array of numbers representing each pixel. Therefore, it is important that the relationship between the pixels persists even after the network has processed the image. To store the spatial relationship between pixels, a CNN is used, in which various mathematical operations are stacked on top of each other to create layers of the network [38].
The CNN architecture consists of convolutional layers, pooling layers and fully connected layers [116]. The basic architecture of a CNN is displayed in Figure 6.

Deep Learning Architecture
A CNN is a special type of multilayer neural network used to recognize visual patterns directly from pixel images with minimal processing. The computer views an image as an array of numbers representing each pixel. Therefore, it is important that the relationship between the pixels persists even after the network has processed the image. To store the spatial relationship between pixels, a CNN is used, in which various mathematical operations are stacked on top of each other to create layers of the network [38].
The CNN architecture consists of convolutional layers, pooling layers and fully connected layers [116]. The basic architecture of a CNN is displayed in Figure 6.

Convolutional Layer
In the feature learning process, the input image implemented with convolutional operation transfers the input matrices with convolutional kernels or can be understood as filters. These convolutional kernel operations, namely channels, kernel size, strides, padding and activation function, are used in a conventional image processing technique where the parameter needs to be set manually. These operations should be determined and optimized based on the practical problem [117]. Each kernel slides over the input images and extracts features from the images. The sliding filter or kernel that slides horizontally and vertically is known as convolutional operation [118]. Das et al. [119] have explained the convolution process of strides and padding, where the strides act to reduce

Convolutional Layer
In the feature learning process, the input image implemented with convolutional operation transfers the input matrices with convolutional kernels or can be understood as filters. These convolutional kernel operations, namely channels, kernel size, strides, padding and activation function, are used in a conventional image processing technique where the parameter needs to be set manually. These operations should be determined and optimized based on the practical problem [117]. Each kernel slides over the input images and extracts features from the images. The sliding filter or kernel that slides horizontally and vertically is known as convolutional operation [118]. Das et al. [119] have explained the convolution process of strides and padding, where the strides act to reduce the data size by slides in each step in feature maps. The dimensions for the feature map can be maintained through the padding process. Padding will add zeros to the input matrix symmetrically. The process of strides and padding are shown in Figure 7.

Convolutional Layer
In the feature learning process, the input image implemented with convolutional operation transfers the input matrices with convolutional kernels or can be understood as filters. These convolutional kernel operations, namely channels, kernel size, strides, padding and activation function, are used in a conventional image processing technique where the parameter needs to be set manually. These operations should be determined and optimized based on the practical problem [117]. Each kernel slides over the input images and extracts features from the images. The sliding filter or kernel that slides horizontally and vertically is known as convolutional operation [118]. Das et al. [119] have explained the convolution process of strides and padding, where the strides act to reduce the data size by slides in each step in feature maps. The dimensions for the feature map can be maintained through the padding process. Padding will add zeros to the input matrix symmetrically. The process of strides and padding are shown in Figure 7. The sliding process connects each neuron after the shift and provides a complete tiling for the input image. All the weights and biases for all neurons are combined to detect the same feature for all locations of the input image [120]. The output for the next layer, , for convolutional operation, is computed as follows: where σ is non-linearity introduced in the network, W is the filter or kernel that slides over the input image, X is the input that is provided to the layer and b is the bias term of the filter [121]. The sliding process connects each neuron after the shift and provides a complete tiling for the input image. All the weights and biases for all neurons are combined to detect the same feature for all locations of the input image [120]. The output for the next layer, a i,j for convolutional operation, is computed as follows: where σ is non-linearity introduced in the network, W is the filter or kernel that slides over the input image, X is the input that is provided to the layer and b is the bias term of the filter [121].

Activation Function
Rectified linear unit (ReLU) is a most notable non-saturated activation function used to enhance the performance of a CNN. The operation of ReLU is shown in Figure 8. It is defined as in (2), where z i,j,k is the activation function input at (i, j) on the kth channel. Max operation in the equation allows the computation to be faster than the activation function of the sigmoid of tanh and does not face a gradient vanishing problem like tanh and sigmoid functions. Moreover, it allows the network to easily achieve sparse representation while inducing sparsity in the hidden units [116,122].

Activation Function
Rectified linear unit (ReLU) is a most notable non-saturated activation function used to enhance the performance of a CNN. The operation of ReLU is shown in Figure 8. It is defined as in (2), where , , is the activation function input at (i, j) on the kth channel. Max operation in the equation allows the computation to be faster than the activation function of the sigmoid of tanh and does not face a gradient vanishing problem like tanh and sigmoid functions. Moreover, it allows the network to easily achieve sparse representation while inducing sparsity in the hidden units [116,122].

Pooling Layer
The pooling layer was firstly introduced in [123] in order to minimize the processing of the data. Pooling layers, also known as downsampling, generate smaller feature maps by reducing the parameter number and dimensionality in the input images. Even larger

Pooling Layer
The pooling layer was firstly introduced in [123] in order to minimize the processing of the data. Pooling layers, also known as downsampling, generate smaller feature maps by reducing the parameter number and dimensionality in the input images. Even larger images are shrunk down, and the most important features in the images are preserved. The maximum values from each patch are kept by preserving the best fit in the feature [124]. There are two commonly used pooling functions, which are average pooling and maximum pooling. The average pooling calculates the average value at each patch on the feature map, and the maximum pooling calculates the maximum value on the feature map. The example of these pooling operations are shown in Figure 9.
tation while inducing sparsity in the hidden units [116,122].

Pooling Layer
The pooling layer was firstly introduced in [123] in order to minimize the processing of the data. Pooling layers, also known as downsampling, generate smaller feature maps by reducing the parameter number and dimensionality in the input images. Even larger images are shrunk down, and the most important features in the images are preserved. The maximum values from each patch are kept by preserving the best fit in the feature [124]. There are two commonly used pooling functions, which are average pooling and maximum pooling. The average pooling calculates the average value at each patch on the feature map, and the maximum pooling calculates the maximum value on the feature map. The example of these pooling operations are shown in Figure 9.

Fully Connected Layer
The fully connected layer is the final layer after the convolutional and the pooling layers. Here, the data are transformed to a one-dimensional layer and each neuron is connected directly to a neuron in the previous layer. The structure for this layer may consist of one or more hidden layers. The softmax activation function is usually applied in a fully

Fully Connected Layer
The fully connected layer is the final layer after the convolutional and the pooling layers. Here, the data are transformed to a one-dimensional layer and each neuron is connected directly to a neuron in the previous layer. The structure for this layer may consist of one or more hidden layers. The softmax activation function is usually applied in a fully connected layer to classify the input by generating a probability between 0 and 1. A softmax activation function is defined as in Equation (3) [125].

Loss Function
In every CNN architecture, the last layer is called the output layer. The final classification occurs by calculating the prediction error produced by the CNN over the training data using a loss function. The loss function is the crucial component of the CNN to predict error through gradient calculation. Most of the studies on CNNs employ softmax or cross-entropy loss as the encoded output [126,127].

Improvement of CNN Architecture
CNNs received proper attention after the success of the AlexNet architecture in 2012 and this achievement was the start of the other CNN architectures [128]. The others CNN architectures are described in the next subsection.

LeNet
LeNet was the earliest CNN architecture, introduced by LeCun [129] in 1998. The structure consists of three convolutional layers and two fully connected layers. The architecture of LeNet is shown in Figure 10. The network contains five layers with learnable parameters, and combines and average pooling and three sets of convolutions layers. There are two fully connected layers after the convolution and pooling process. At the end, a softmax classifier sorts the images into their appropriate categories. The study presented in [130] employed this architecture to detect and identify plant disease of potato and tomato.
A batch size of 150 epochs was used to train the model and resulted in accuracy of detection and recognition of 99%.
LeNet was the earliest CNN architecture, introduced by LeCun [129] in 1998. The structure consists of three convolutional layers and two fully connected layers. The architecture of LeNet is shown in Figure 10. The network contains five layers with learnable parameters, and combines and average pooling and three sets of convolutions layers. There are two fully connected layers after the convolution and pooling process. At the end, a softmax classifier sorts the images into their appropriate categories. The study presented in [130] employed this architecture to detect and identify plant disease of potato and tomato. A batch size of 150 epochs was used to train the model and resulted in accuracy of detection and recognition of 99%. Figure 10. Architecture of LeNet [129].

AlexNet
AlexNet was proposed by Alex Krizhevsky [131] in 2012 during the ImageNet Large Scale Recognition Challenge and won the competition. The proposed architecture reduced error from 26% to 15.3% by utilizing the convolutional layers, max pooling layers, data augmentation, dropout, ReLU activations and SGD. AlexNet with 60 million parameters has eight layers, five convolutional layers and three fully connected layers. Every convolutional and fully connected layer used non-saturated ReLU gives the training response over tanh and sigmoid is improved [132]. Figure 11 shows the architecture of the AlexNet convolutional network that was proposed by Patino et al. [133] in classification of tropical fruits with 2633 images of fruits divided into 15 categories, including high variability and complexity. The authors of [134] employed AlexNet to train different datasets consisting

AlexNet
AlexNet was proposed by Alex Krizhevsky [131] in 2012 during the ImageNet Large Scale Recognition Challenge and won the competition. The proposed architecture reduced error from 26% to 15.3% by utilizing the convolutional layers, max pooling layers, data augmentation, dropout, ReLU activations and SGD. AlexNet with 60 million parameters has eight layers, five convolutional layers and three fully connected layers. Every convolutional and fully connected layer used non-saturated ReLU gives the training response over tanh and sigmoid is improved [132]. Figure 11 shows the architecture of the AlexNet convolutional network that was proposed by Patino et al. [133] in classification of tropical fruits with 2633 images of fruits divided into 15 categories, including high variability and complexity. The authors of [134] employed AlexNet to train different datasets consisting of vegetables images. According to the experiment, the accuracy rate reached 92.1% compared to the SVM method with 80.5%.

VGG
VGG architecture was first proposed by Simonyan and Zisserman [135] in 2014 by improving AlexNet by changing the kernel filter's size. At the same time, the generation of VGG aimed to improve the training time and reduce the number of parameters. It has been applied in various image classification tasks and was trained on more than 14 million images consisting of 1000 classes. It improved the AlexNet model that was considered the most popular image classifier and carried with it the ReLU tradition of AlexNet. There are many variants of VGGNet, including VGG-16, VGG-19, etc. The architecture of VGG-16 consists of a block of five convolutional layers and three fully connected layers containing 138 M parameters [136]. Figure 12 shows the architecture of VGG-16 as proposed by [137] in classification of jujube. Contrasting with AlexNet, VGG-16 has a deeper network and uniform structure consisting of 16 trainable layers containing 13 convolutional layers and three fully connected layers.

VGG
VGG architecture was first proposed by Simonyan and Zisserman [135] in 2014 by improving AlexNet by changing the kernel filter's size. At the same time, the generation of VGG aimed to improve the training time and reduce the number of parameters. It has been applied in various image classification tasks and was trained on more than 14 million images consisting of 1000 classes. It improved the AlexNet model that was considered the most popular image classifier and carried with it the ReLU tradition of AlexNet. There are many variants of VGGNet, including VGG-16, VGG-19, etc. The architecture of VGG-16 consists of a block of five convolutional layers and three fully connected layers containing 138 M parameters [136]. Figure 12 shows the architecture of VGG-16 as proposed by [137] in classification of jujube. Contrasting with AlexNet, VGG-16 has a deeper network and uniform structure consisting of 16 trainable layers containing 13 convolutional layers and three fully connected layers. most popular image classifier and carried with it the ReLU tradition of AlexNet. There are many variants of VGGNet, including VGG-16, VGG-19, etc. The architecture of VGG-16 consists of a block of five convolutional layers and three fully connected layers containing 138 M parameters [136]. Figure 12 shows the architecture of VGG-16 as proposed by [137] in classification of jujube. Contrasting with AlexNet, VGG-16 has a deeper network and uniform structure consisting of 16 trainable layers containing 13 convolutional layers and three fully connected layers.

GoogLeNet/Inception
GoogLeNet is based on the architecture of Inception and uses the module that allows the network to choose between multiple convolution filter sizes in each block. It was proposed by research at Google in 2014 and won the ILSVRC 2014 image classification challenge. The error rate generated by GoogLeNet showed a significant decrease compared to AlexNet. The architecture consists of a 22-layer deep network assessing the quality in detection and classification [138]. Then, the authors of [139] improved the architecture to InceptionV3 by updating the ImageNet classification accuracy. The updated Inception is referred to as InceptionVN. Then, in 2016, the architecture of Inception was updated to

GoogLeNet/Inception
GoogLeNet is based on the architecture of Inception and uses the module that allows the network to choose between multiple convolution filter sizes in each block. It was proposed by research at Google in 2014 and won the ILSVRC 2014 image classification challenge. The error rate generated by GoogLeNet showed a significant decrease compared to AlexNet. The architecture consists of a 22-layer deep network assessing the quality in detection and classification [138]. Then, the authors of [139] improved the architecture to InceptionV3 by updating the ImageNet classification accuracy. The updated Inception is referred to as InceptionVN. Then, in 2016, the architecture of Inception was updated to InceptionV4 by combining the architecture of Inception together with residual connection in research by Ni et al. [140].
Ni et al. [141] implemented GoogLeNet due to its superior performance in identification of fruit and vegetables. This architecture was used to monitor the change process of banana. The model was trained for 4320 iterations to recognize the freshness of banana. The model obtained recognition accuracy of 98.92%. Its architecture is illustrated in Figure 13. InceptionV4 by combining the architecture of Inception together with residual connection in research by Ni et al. [140]. Ni et al. [141] implemented GoogLeNet due to its superior performance in identification of fruit and vegetables. This architecture was used to monitor the change process of banana. The model was trained for 4320 iterations to recognize the freshness of banana. The model obtained recognition accuracy of 98.92%. Its architecture is illustrated in Figure  13.

Residual Network (ResNet)
ResNet is a specific type of neural network that was introduced by He et al. [142] in 2015 and won 1st place in the ILSVRC 2015 competition by achieving an error rate of 3.5%. It has the ability to train a network with 100 layers and 1000 layers. The layer in ResNet receives the input from the previous layer and its residual units. The architecture consists of 34 layers, starting with one additional maxpooling layer, and ends with one average pooling layer [143]. The architecture of ResNet is shown in Figure 14.

Residual Network (ResNet)
ResNet is a specific type of neural network that was introduced by He et al. [142] in 2015 and won 1st place in the ILSVRC 2015 competition by achieving an error rate of 3.5%. It has the ability to train a network with 100 layers and 1000 layers. The layer in ResNet receives the input from the previous layer and its residual units. The architecture consists of 34 layers, starting with one additional maxpooling layer, and ends with one average pooling layer [143]. The architecture of ResNet is shown in Figure 14.
ResNet is a specific type of neural network that was introduced by He et al. [142] in 2015 and won 1st place in the ILSVRC 2015 competition by achieving an error rate of 3.5%. It has the ability to train a network with 100 layers and 1000 layers. The layer in ResNet receives the input from the previous layer and its residual units. The architecture consists of 34 layers, starting with one additional maxpooling layer, and ends with one average pooling layer [143]. The architecture of ResNet is shown in Figure 14.

DenseNet
DenseNet refers to a densely connected convolutional network introduced by Huang et al. [144] and it has an interesting pattern of connections, in which each layer is connected to the others within a dense block. All previous layers are used as input and its own feature maps are used as the input for all subsequent layers. This means all layers are able to access the feature maps. DenseNet can alleviate the vanishing gradient problem, promote feature reuse, strengthen feature propagation and significantly reduce the number of parameters. The structure of DenseNet with five layers and expansion of four is shown in

DenseNet
DenseNet refers to a densely connected convolutional network introduced by Huang et al. [144] and it has an interesting pattern of connections, in which each layer is connected to the others within a dense block. All previous layers are used as input and its own feature maps are used as the input for all subsequent layers. This means all layers are able to access the feature maps. DenseNet can alleviate the vanishing gradient problem, promote feature reuse, strengthen feature propagation and significantly reduce the number of parameters. The structure of DenseNet with five layers and expansion of four is shown in Figure 15. The limitation of DenseNet is the large memory consumption. Therefore, Huang et al. [145] suggested CondenseNet to reduce the memory and speed it up by learning groups of convolution operations and pruning while training [146].

You Only Look Once (YOLO)
YOLO was developed by Redmon [147] in 2015 to reframe cognitive problems as regression problems rather than classification problems. YOLO uses a single neural network to predict the bounding box and assign class probabilities. The model of YOLO is simple and able to train directly from full images. The loss function that is trained by YOLO corresponds directly to detection performance and the entire model is trained together. The

You Only Look Once (YOLO)
YOLO was developed by Redmon [147] in 2015 to reframe cognitive problems as regression problems rather than classification problems. YOLO uses a single neural network to predict the bounding box and assign class probabilities. The model of YOLO is simple and able to train directly from full images. The loss function that is trained by YOLO corresponds directly to detection performance and the entire model is trained together. The architecture of YOLO is shown in Figure 16. It has 24 convolutional layers used to extract features from an image and ends with two fully connected layers that are used to predict the probabilities and coordinates of the output. There are many variants of YOLO that have been developed as an improvement of the previous version, namely YOLOv2 [148], YOLOv3 [149] and YOLOv4 [150]. Basically, the enhancement of the version is based on the framework design where the usual YOLO uses DarkNet that is trained on ImageNet. Then, the framework for YOLOv2 was improved to DarkNet-19, YOLOv3 with Darknet53 and YOLOv4 with DarkNet with CSPDarkNet53. Lippi et al. [151] preferred YOLO in early detection of pests as this model represented the fastest and most effective solution. Among these various versions of YOLO, the authors implemented YOLOv4 as it has been proven to outperform the previous ones in terms of accuracy and speed on the assorted standard dataset. YOLOv3 with Darknet53 framework has been employed by Chang et al. [152] to achieve real-time plant species recognition. The experiment's findings demonstrate that the deep classifier was able to identify three different plants. Gai et al. [153] improved YOLOv4 in a cherry fruit detection application. The model of YOLOv4 was improved by replacing its backbone network, CSPDarkNet53, with DenseNet. The improvement generated advanced feature extraction, deepened the network structure and provided higher speed detection than the previous YOLOv4. The average accuracy given by the improved YOLOv4 is 0.15 higher than YOLOv4. In 2020, Jocher [154] released YOLOv5, which has fast, accurate and easy to train characteristics. It is well known for successful real-time object detection trained on the COCO dataset. The backbone for YOLOv5 is cross stage partial network (CSPNet) that is used to extract rich informative features from an input image and, by utilizing deeper networks, the processing time has been improved. YOLOv5 is implemented in detecting wheat spikes using UAV [155], detecting maturity of strawberry fruit [156], detecting defects of kiwi fruit [157] and detecting apple fruit [158]. YOLOv5 is cross stage partial network (CSPNet) that is used to extract rich informative features from an input image and, by utilizing deeper networks, the processing time has been improved. YOLOv5 is implemented in detecting wheat spikes using UAV [155], detecting maturity of strawberry fruit [156], detecting defects of kiwi fruit [157] and detecting apple fruit [158]. Figure 16. The architecture of YOLO [147].
There are many more ConvNets that have been proposed. ConvNets have improved greatly over time, owing primarily to increased processing power, new concepts, experiments and worldwide interest in deep learning. Those ConvNets are summarized in Table  1. The accuracy values were taken from image classification on ImageNet, the database platform consisting of a large visual database intended for use in the development of visual object recognition software. There are many more ConvNets that have been proposed. ConvNets have improved greatly over time, owing primarily to increased processing power, new concepts, experiments and worldwide interest in deep learning. Those ConvNets are summarized in Table 1.
The accuracy values were taken from image classification on ImageNet, the database platform consisting of a large visual database intended for use in the development of visual object recognition software.
After creating a model based on a deep learning technique and receiving some output in the form of a class, the next step is to use test datasets to determine how effective the technique is. The most crucial aspect in data science research is to evaluate the model, which determines how accurate the prediction is. Deep learning algorithms are evaluated using a variety of performance metrics. Each deep learning result generates accuracy based on the percentage of accuracy, intersection of union (IoU), performance score (F 1 ), mean average precision (mAP) and correlation coefficient (R 2 ). This study is focused on these performance metrics that were employed in the previous studies in this article. It is crucial to pick the right metrics to evaluate the deep learning technique used. The metrics are used to determine how deep learning algorithm performance is evaluated and compared. The simplest intuitive performance metric is accuracy, which is just the ratio of properly predicted observations to the total observations. A high accuracy percentage shows which model is the best and how good the model has performed. IoU, also known as the Jaccard index, is the computation of the ratio of the intersection and the union of two sets. These region-based measures do not examine the accuracy of the segmented region boundaries, which is significant during automated tree training operations. In addition, IoU measurements are strict because they penalize false positives and favor regional uniformity over border accuracy [168,169]. The F 1 score computes the performance of detection by using recall and precision. Recall and precision measure the fraction of true-positive objects that are successfully detected and objects in the prediction [106]. The F 1 score is greatest at 1 (perfect precision and recall) and lowest at 0. In other words, recall is the number of well-predicted positives divided by the total number of positives. It shows the percentage of positives that are well predicted. Precision is similar to recall as it shows the number of positive predictions generated. It divides the number of predicted positives by all the positives predicted. If the value of precision is high, this means the majority of the positive predictions for the objects are correctly predicted as positive. The calculations of accuracy, precision, recall, F 1 score, IoU and class accuracy are shown in (4)- (8). True positive is the annotation that is correctly drawn with an IoU of > 0.5, true negative is every part of an image that does not predict an object, false positive is a missing annotation and false negative is an annotation that has an IoU score of < 0.5.

Results and Discussion
Deep learning architecture is adaptable, which means it may be used to solve new challenges in the future. Moreover, this method can be applied to a wide range of tasks and data types. Therefore, this study summarizes the previous studies implementing a deep learning algorithm in the agriculture sector, shown in Table 2. The application of deep learning in agriculture is to detect and classify crops, disease, yield estimation, border extraction, etc. The first step in those studies involves gathering a correctly annotated dataset that is large enough for a complex model to produce satisfactory results when trained on it. In order to perform successfully, a CNN requires a lot of training data. If the dataset is not particularly large, though, image augmentation can be employed to make a small dataset appear larger. It has been observed that augmentation of existing data, rather than collecting new data, enhances the classification accuracy of a deep learning model. In fact, the data augmentation technique is able to avoid overfitting problems and achieve high accuracy.
Basically, all the studies proposed smartphones to capture the object images due to the advancement of the technology's resolution, while also being simple and cost-effective. The smartphone's rapid evolution has elevated it to the foremost choice in the area of the agriculture industry. Another ability provided by the smartphone is to detect the object in real-time with the advancement of the deep learning employment method in the application of object detection. Most methods to extract the border are based on the utilization of satellite images due to the imagery captured, guaranteeing a powerful method without physical contact, wide views and consisting of a big data revolution. The image that is processed through a satellite is computationally intensive, therefore, deep learning is very helpful in analyzing the image provided. In image processing, the images most commonly used are red, green and blue (RGB). These images will generate a color image on the screen when RGB images are stacked on top of each other.
According to the previous studies, many of the datasets were utilized to train the images and annotated images of multiple regions have been proposed. Images are initially segmented or bounded into multiple regions such as color, shape and texture features and are extracted. Automatic image annotation aims to learn a semantic concept model from a large number of datasets or image samples and apply it to new images. After the training, images are labeled with semantic labels, and test images can be assessed using keywords, similar to how the training of image features. As presented in the summary table, many studies on automatic image annotation have applied deep learning-based approaches. Convolution procedures extract image features and a deep neural network training model establishes the connection between the images and labels. A deep learning algorithm is generated based on the image features primarily using the supervised technique. The algorithm is tested with sample images and the prediction performance is analyzed. The annotation result generates either a bounding box or a segmentation with the predicted performance value. Deep learning has made tremendous progress in image labeling by automatically annotating images to recognize the plant species, maturity and health. The proposed deep learning technique is tested with other techniques to test its performance. The best performance or higher accuracy results indicate that the proposed technique is a better model to employ when detecting or classifying an object. In addition, the capability of the proposed deep learning technique is tested on a variety of datasets. Even when using the same technique, various datasets generate different performances due to differences in dataset properties. However, during the training process, the model's performance can be improved by increasing the number of epochs and iteration value. As proposed in [170], different datasets are used to detect various diseases in different plants. The disease is successfully detected by implementing a VGG-16-based deep learning method, which is compared to InceptionV3 and GoogLeNetBN. VGG-16 outperforms the other techniques used with 98.8% accuracy. According to the study, pre-training with plant-specific tasks reduced the impact of overfitting for a deeper Inception model, but the VGG-16 model demonstrated better generalization when adapting to new data. In addition, the fact that VGG-16 outperforms the Inception technique is because of a lack of variability in the dataset, which limits the implementation of deeper architectures. The summaries of the various techniques from the previous studies provide a wide overview of the performance metric result when deep learning algorithms are used for the task. This critical information about performance metric can help researchers choose a suitable deep learning algorithm for their studies. In some circumstances, the model does not produce particularly accurate predictions. Training the method also takes a long time. As a result, it is crucial to improve accuracy while decreasing training time. The number of epochs, input size, network depth and width or slow weight update can contribute to training time taken.     Based on the result, most of the previous studies presented deep learning implemented for annotated image results as an accuracy percentage. This is because accuracy simply gives a ratio of correctly drawn annotations to total predicted annotations. Therefore, it is the most intuitive approach to evaluate a task's performance. Accuracy is not just a simple measurement, but it is also the least insightful when it comes to evaluating an annotation task's performance. In fact, most real-world annotation ignores false negatives and false positives which could lead to prejudice and inaccurate conclusions about the quality of the mission. In some cases, the authors prefer the F 1 score because it elegantly summarizes a model's prediction effectiveness by merging two competing metrics: recall and precision. The different usages of the performance for accuracy are employed when true positives and true negatives are more important, while when false negatives and false positives are critical, then the F 1 score is used. Another case is if the class distribution is similar, then the accuracy can be employed. Meanwhile, if there are imbalanced classes, then F 1 is a good choice. Overall, all these evaluation metrics help present the quality of the annotated images proposed using the deep learning technique.
Based on the previous studies, the image annotation implemented in agriculture is summarized in the pie chart shown in Figure 17. Most of the deep learning techniques are applied to detect and classify plants and their diseases. Crop detection and classification are difficult tasks due to the wide range of interclass forms, colors and textures. As a result of these limitations, there is a shortage of automated fruit classification systems for various groups. Crop detection and classification using an advanced information system could effectively identify the right fruit with the right nutrition. On the other hand, plant detection and classification are applied in harvesting robots to pick fruit and vegetables. Robotic harvesting has a high potential to be used in crop detection by reducing the cost of labor while improving fruit quality. In fact, the problems of plant diseases are a worldwide issue related to food production. Plant diseases adversely affect the economy and incur losses to farmers. Therefore, utilizing deep learning for annotating images in agriculture helps earlier disease detection and prevents the plants from becoming worse.
Agriculture 2022, 12, x FOR PEER REVIEW 29 Based on the previous studies, the image annotation implemented in agricultu summarized in the pie chart shown in Figure 17. Most of the deep learning technique applied to detect and classify plants and their diseases. Crop detection and classific are difficult tasks due to the wide range of interclass forms, colors and textures. As a r of these limitations, there is a shortage of automated fruit classification systems for ous groups. Crop detection and classification using an advanced information sy could effectively identify the right fruit with the right nutrition. On the other hand, p detection and classification are applied in harvesting robots to pick fruit and vegeta Robotic harvesting has a high potential to be used in crop detection by reducing the of labor while improving fruit quality. In fact, the problems of plant diseases are a w wide issue related to food production. Plant diseases adversely affect the economy incur losses to farmers. Therefore, utilizing deep learning for annotating images in culture helps earlier disease detection and prevents the plants from becoming worse

Conclusions
This study presented a comprehensive review of the application of image annot using the deep learning technique in the agriculture field. Image annotation is extre useful in the agriculture industry in increasing the crop production. It assists in reco ing and classifying the plants and their diseases. The employment of deep learnin image annotation generates a high performance on the dataset or image based on th curate prediction. In previous studies, bounding boxes were one of the most popular recognized image annotation methods. Annotators are expected to outline the obje bounding boxes in accordance with the specified deep learning requirements. In addi bounding boxes are also one of the least expensive and time-consuming annotation m ods available.

Conclusions
This study presented a comprehensive review of the application of image annotation using the deep learning technique in the agriculture field. Image annotation is extremely useful in the agriculture industry in increasing the crop production. It assists in recognizing and classifying the plants and their diseases. The employment of deep learning in image annotation generates a high performance on the dataset or image based on the accurate pre-diction. In previous studies, bounding boxes were one of the most popular and recognized image annotation methods. Annotators are expected to outline the object in bounding boxes in accordance with the specified deep learning requirements. In addition, bounding boxes are also one of the least expensive and time-consuming annotation methods available.
The application of deep learning in agriculture industries, as well as the technologies presented to improve agriculture productivity, are studied in this article. The CNN type of deep learning has been widely used in the agriculture sector since it promises to recognize features from images without the need for human interaction. Moreover, the technique provided superior classification and precision accuracy when compared to other techniques. In this article, the implementation of various architectures of deep learning is discussed in order to evaluate the method's effectiveness in terms of plant and disease detection, classification, counting and segmentation of plants. This review has found that deep learning has high performance in image processing techniques.
Despite the efforts of numerous researchers, the task of developing a rapid and reliable fruit detection system remains under investigation. This is due to a wide range of colors, shape, sizes, textures and reflectance qualities of the fruit in field settings. Many advancements of the architecture network were created to improve image detection, classification and segmentation accuracy. It is crucial to find a suitable deep learning architecture in order to produce high accuracy, low error rate and shorter training time. The architectures of deep learning that are commonly used are summarized in this study. The results reveal that ConvNet's accuracy has been gradually improving over time in most circumstances. Each learning-based algorithm evaluation is an important aspect of any project. The model will generate satisfactory results when tested using a performance metric. Performance metrics assist in identifying how effectively the model generalizes to new data. The majority of the previous studies in this article employed the accuracy metric to evaluate model performance. However, accuracy is inappropriate when interacting with imbalanced data, in which the number of points in one class is significantly more than the number of points in another. Other performance metrics, including the F 1 score, recall and precision, can help to overcome this accuracy issue. For recommendations, the training time and inaccuracy in future studies on deep learning in agriculture should be reduced by employing advancement of deep learning architecture.
The technique of deep learning can be used more widely in agriculture industries to improve plant productivity and quality. To encourage the usage of greater intelligence, deep learning can be integrated with other technologies such as robotics and the Internet of Things (IoT). The use of deep learning in robotic harvesting, planting and logistics could be beneficial. Using IoT technologies, farmers can boost yields by controlling every variable in crop production, such as moisture levels, soil conditions, pest stress and microclimates. Precision agriculture allows farmers to enhance efficiency and minimize expenses by providing more precise strategies for planting and growing crops. However, these technologies may include all levels of security. It also addresses new particular security concerns such as accuracy, device and data integrity. Security issues could consist of hijacking autonomous devices such as UAVs and robots. If a malicious agent hijacks an autonomous system, the hijacker can control and direct the device remotely without authorization. This type of violence could have several consequences, including the inability of the system to fulfill a task. The malfunction could result in significant losses due to incorrect crop management, equipment damage and the autonomous system itself. Security concerns must be integrated into the system to maximize their effectiveness. It is critical to create security schemes to detect incidents and avoid corrupt or inconsistent data. As a result, advancing to the next levels of robots and IoT technologies necessitates solutions with security measures that provide dependability and accuracy in implementing these systems.
In addition, the requirement for massive amounts of labeled data remains a major obstacle for supervised deep learning methods. This problem is particularly apparent in the agriculture industry, where hundreds of images must first be annotated by humans before training. In addition, the labeling process frequently needs the participation of some field experts who are in short supply. If there are not enough labeled data for supervised learning algorithms to work, semi-supervised learning can be utilized in the agriculture sector to solve real-world problems. Semi-supervised learning is an excellent compromise between supervised and unsupervised learning. Researchers have spent the majority of their time on organizing the data, therefore, semi-supervised learning allows working with limited data. This learning is typically employed when labeling or acquiring data is too complex or expensive. Additionally, it is also feasible to use it if the quality of labeled data is poor.