Enhancing Image Annotation Technique of Fruit Classification Using a Deep Learning Approach

: An accurate image retrieval technique is required due to the rapidly increasing number of images. It is important to implement image annotation techniques that are fast, simple, and, most importantly, automatically annotate. Image annotation has recently received much attention due to the massive rise in image data volume. Focusing on the agriculture ﬁeld, this study implements automatic image annotation, namely, a repetitive annotation task technique, to classify the ripeness of oil palm fruit and recognize a variety of fruits. This approach assists farmers to enhance the classiﬁcation of fruit methods and increase their production. This study proposes simple and effective models using a deep learning approach with You Only Look Once (YOLO) versions. The models were developed through transfer learning where the dataset was trained with 100 images of oil fruit palm and 400 images of a variety of fruit in RGB images. Model performance and accuracy of automatically annotating the images with 3500 fruits were examined. The results show that the annotation technique successfully annotated a large number of images accurately. The mAP result achieved for oil palm fruit was 98.7% and the variety of fruit was 99.5%.


Introduction
Annotation is a work that involves a lot of repetition when performed completely manually. Numerous artificial intelligence-related tasks require massive datasets. Although it could be possible for a person to annotate everything, doing so might not be desired. The automatic image annotation (AIA) technique is a technology that can annotate images automatically with its semantic tags. Significant features of AIA include image retrieval [1], classification [2], recognition [3], and medical diagnostics [4,5]. AIA is able to adapt to complex patterns as more training data become available, and deploys a common strategy to annotate a new image. In order to annotate, firstly, similar images from the training set are retrieved, and then labels are ranked based on their frequency in the retrieval set. The most frequent labels in the neighborhood are thus transferred to the test image to achieve automatic image annotation [6].
Computer vision in agriculture automation is challenging due to the considerable variation within a class of fruit species as well as their similarities in color, size, and shape. As a result, manually annotating fruit takes time and effort. The detection and accurate classification of fruits is a fascinating issue associated with enhancing the quality and economic potential of fruits, especially in an industrial field. The challenge become more significant when automating tasks such as matching fruit quality with other information such as nutritional facts and pricing [7]. The classification of oil palm fresh fruit bunch (FFB) ripeness is significant in ensuring the quality of the oil. The ripeness of oil palm fruits dictates the quality of palm oil produced and overall marketability. The color of the oil palm FFB may be used to estimate its ripeness. Color is one of the most important characteristics for determining fruit ripeness [8]. The color of an item is determined by the light reflected from it. Therefore, these changes serve as a foundation for image processing and analysis. The primary components of the color coding are red, green, and blue (RGB). The Malaysian Palm Oil Board [9] has classified unripe, underripe, ripe, and overripe based on color, as shown in Table 1. Table 1. Oil palm fruit ripeness and color classification. Manually grading oil palm fruit is a typical technique for identifying its quality, but this technique is time-consuming and may result in human error [10,11]. It is crucial to identify the ripeness of oil palm fruit. Incorporating artificial intelligence, computer technology has provided a variety of solutions to alleviate this dependence. Many researchers have recently used artificial intelligence techniques for object detection and classification challenges, with beneficial outcomes [12]. A reliable, fast, and accurate approach for detecting oil palm FFB ripeness is required. Therefore, AIA using a deep learning approach gives both academic and commercial applications that benefit greatly from this technique. The automatic annotating of oil palm fruit classification can assist farmers in increasing production and make work easier. Oil palm is often used to make margarine, candles, soaps, home cooking oil, and snacks, and it is Malaysia's main agricultural commodity export [13].

Category Figure Color
Despite the prevalent deep learning-based strategies for improving AIA framework implementation, AIA remains vulnerable to several critical issues. Among these difficulties is the need for a large number of data to make an accurate prediction. The control of inconsistent keyword distribution, as well as the selection of relevant characteristics, are the other two primary AIA problems [14]. With the development of artificial intelligence, deep learning is widely used in image annotation. Deep learning, which encompasses artificial neural networks and computational models, is a subset of the machine learning process. Its method is designed to replicate the topology of biological neural networks and mimic the function of the brain [14]. When the brain acquires new information, it seeks to make sense of it by comparing it to previously acquired knowledge. Deep learning decodes information using the same approach that the brain employs to categorize and identify items. Deep learning accelerates and simplifies this process, which is particularly beneficial to data scientists who are tasked with obtaining, analyzing, and interpreting huge volumes of data [15,16]. YOLO is a regression issue that combines target classification and localization. A YOLO network uses regression to recognize targets in an image without the need for RPN. The network has the ability of the human visual system to recognize objects instantly [17]. Moreover, YOLO is extremely efficient and works impressively well for real-time object detection [18]. Nowadays, there are several YOLO variants with various architectures. The original YOLO has 24 convolutional layers preceded by two fully connected layers.
Annotation strategies that are fast and simple to use are recommended for effectively overcoming such obstacles. AIA approaches aim to develop a model from the training data and then use the trained model to automatically give semantic labels to the new image. With the recent attention and development of AIA in contributing to significant tasks, this study is about enhancing automatically annotated image techniques, namely, repetitive annotation tasks. This AIA method-enhancing technique contributes to solving the problem of massive image data and thus contribute to the time consumption and human energy needed to manually annotate an image. A repetitive training task to annotate images and implement deep learning techniques will increase the accuracy and efficiency of the AIA technique. The proposed repetitive annotation technique can be applied in various deep learning methods to automatically annotate the object. However, to evaluate the effectiveness of the proposed technique, this study chooses YOLOv5 as the algorithm platform to generate accurate predictions, as YOLOv5 generates high accuracy and fast performance. The annotation of oil palm FFB using a repetitive annotation task assists farmer with identifying the ripeness of oil palm FFB from the process of harvesting until the milling process.

Related Works
In recent decades, computer vision researchers have successfully endeavored to invent computer systems capable of imitating this human skill. AIA is a step ahead in this approach, detecting each item in an image and assigning appropriate tags to explain its content. AIA has made breakthroughs in the agricultural industry through numerous advanced equipment systems and procedures, making this field more productive and profitable. Various works presented in the literature address the technique of AIA in agriculture. Nemade and Sonavane [19] examined the annotation of fruit by deploying co-occurrence patterns. Identifying fruit quality categories and combination attributes that contribute to co-occurrence patterns can be accomplished with the aid of this. The findings indicate that, for the fruit categories, the co-occurrence pattern using SVM yields an overall accuracy of 97.3%. Instead of the traditional two-step procedure of acquisition followed by human annotation, Samiei et al. [20] evaluated the value of several egocentric vision approaches for performing joint acquisition and AIA. This approach is used in automatic apple segmentation and obtained high performance in annotating images by implementing a machine learning application. The review of image annotation techniques in the agriculture field has been proposed by Mamat et al. [21]. The study summarized the implementation of deep learning techniques, the image annotation approach, and the various applications of deep learning techniques in the agriculture industry.
A lack of accessibility to efficient categorization systems might be a problem for farmers. The texture, shape, and color of a fruit are used to grade its ripeness, which may lead to variations and inefficiency in grading. Many methods have been introduced to address the obstacle and implement deep learning techniques to categorize the ripeness of oil palm fruit. Jamil et al. [22] established the first artificial intelligence system for oil palm fruit ripeness classification in 2009. Their AI system uses a Neuro-Fuzzy model that had been trained on color data collected from 90 images. The algorithm correctly classified 45 test photos with 73.3% accuracy [23]. Using the deep learning method in the agriculture field, Khamis et al. [24] proposed YOLOv3, Elwirehardja and Prayoga [10] deployed MobileNetv1, Liu et al. [25] deployed YOLOv4-tiny, Janowski et al. [26] implemented YOLOv5 in detecting apples, and Herman [27] used DenseNet to classify the ripeness of oil palm fruit in their study. The application of AIA techniques is useful in increasing the harvesting fruit process. The implementation of a harvesting robot [28] using computer vision was used to pluck fruit from the tree used on farmers' requirements. Furthermore, these AI-enabled computers are developed using training datasets generated by image annotation. Tang et al. [29] reviewed all the applications of fruit-picking robots using machine vision and related developing technology that have enormous promise in sophisticated agriculture applications.
YOLO update version 4, commonly known as YOLOv4, was released in early 2020 by Alexey Bochkovskiy [30], a Russian developer who produced the first three versions of YOLO utilizing Joseph Redmon's [31] Darknet architecture. Glenn Jocher [32] and his ultralytics LLC research division, who developed YOLO algorithms using the PyTorch framework, released YOLOv5 a month after YOLOv4. YOLOv5 is simple and efficient. It requires far fewer CPU resources than other designs while producing equivalent results and performing significantly faster than previous YOLO versions [33]. The significance of YOLOv5 makes it widely used in agricultural areas [34,35]. Wang et al. [36] detected real-time apple stems by deploying YOLOv5. The study was first conducted by figuring out the hyper-parameter and using transfer learning as a training approach to achieve stronger detection performance. Next, networks with different depths and widths were trained to find the baseline detection. Subsequently, the YOLOv5 was optimized for this task by using the detection of head searching, layer, and channel pruning. The results from the study showed that YOLOv5 was easier to use under the same setting and could be chosen as the baseline network based on how well it detected things. Other applications of YOLOv5 in agriculture has been proposed in crop detection by Yan et al. [37], classification by Wang et al. [38], disease recognition by Chen et al. [39], and counting by Lyu et al. [40].
Inspired by the previous research, this study chooses YOLOv5 as the method to investigate the proposed repetitive annotation task technique, since this method is founded on excellence in the detection of an object. YOLOv5 is compared to other variations of YOLO, which are YOLOv3 and YOLOv4, to evaluate its performance.

Dataset
The images of oil palm fruit were collected in the orchards located in Felda Tenang, Jerteh, Terengganu, Malaysia. A total of 400 images of oil palm FFB were collected for four different categories, which are unripe, underripe, ripe, and overripe. These images were then expanded to 600 images using the data-augmentation method. The collected images were captured by using a smartphone and unmanned aerial vehicle (UAV) with DJI Phantom 4 with 3472 × 4640 pixel and 3840 × 2160 pixel resolution. The image was taken with red, green, and blue (RGB) colors to identify the ripeness of the fruit. Figure 1 shows a drone used to capture the image of a tall oil palm tree. All the image sizes were resized to 416 × 416 pixel resolution to fit the size of the common required deep learning algorithm. This work was implemented in Python in the Google Collaboratory platform running on Windows 10. Utilizing Google's environment offers free access to the graphics processing unit and requires some configurations. The system configuration has a 16 GB RAM Intel(R) Core(TM) i5 processor. At first, only 152 images of oil palm fruit were annotated manually by using the LabelImg tool. The categories were drawn manually and classified using bounding boxes. A variety of fruit, consisting of rambutan, dragon fruit, pineapple, and mangosteen, were downloaded from Google Image and Kaggle datasets. The variety of these fruits was to evaluate the performance of the capability of automatically annotating the image in large datasets. A total of 3400 fruit images were used and only 400 images were firstly manually annotated. The dataset used in this study is elaborated in Tables 2 and 3. -- Figure 1. Drone used to capture images and samples images from the drone dataset. The YOLO neural network architecture predicts a set of bounding boxes and class probabilities. Figure 2 is an illustration of the YOLO framework. The fundamental idea is to split the input image into S × S grid cells and perform detections in each grid cell. Each cell predicts B bounding boxes as well as their confidence. The confidence may indicate whether or not an item exists in the grid cell, as well as the intersection over union (IoU) of the ground truth and predictions. Equation (1) is utilized to express confidence [41]. Pr signifies the probability that the cell contains an object within the predicted bounding box and IoU is the intersection of the predicted bounding box and the ground truth. YOLO is a one-stage object detector that detects objects quickly from beginning to end. Images are downsized to a reduced resolution in YOLO algorithms and then a single CNN runs on the images, returning detection results based on the model's confidence threshold. YOLO's first version was developed to reduce the sum of squared errors (loss function). This optimization improves identification speed but decreases accuracy in comparison to state-of-the-art object detection models. YOLO comes in a variety of forms. The feature extraction backbone of Darknet19, which struggled with detecting small objects in YOLOv3, was changed to Darknet53 in YOLOv3. Residual blocks, skip connections, and up-sampling were introduced in that work, significantly improving the algorithm's accuracy. The feature extractor's backbone was changed to CSPDarknet53 in YOLOv4, which significantly improved the algorithm's speed and accuracy. YOLOv5 is the lightest version of previous YOLO algorithms and it employs the PyTorch framework rather than the Darknet framework. YOLOv5 is mainly utilized in this study for object identification and categorization.

Network Architecture
The entire architecture of YOLOv5, which includes the backbone, detection neck, and detection head, is depicted in Figure 3. (i) Backbone model The model backbone is mostly utilized to extract important information from an input image. The focusing layer is the first layer of the backbone network and is used to simplify the model calculation and speed-up training. Second, concatenation is employed to integrate the four segments in depth. The output feature map is then created using a convolutional layer comprised of 32 convolution kernels. Finally, the results are fed into the next layer through the batch normalization layer and the activation functions. The bottleneckCSP module is the third layer of the backbone network, and it is intended to efficiently extract the image's detailed information. BottleneckCSP is essentially composed of a bottleneck module, which is a residual network architecture that connects a convolutional layer. The bottleneck module's complete output is the sum of this part's output of the beginning input through the residual structure. The spatial pyramid pooling (SPP) module is the backbone network's ninth layer, and it is intended to boost the network's receptive field by adapting any size of the feature map into a fixed-size feature vector. After being subsampled via three concurrent max-pooling layers, this feature map and the output feature map are coupled in-depth [37].
(ii) Neck model The model neck is primarily utilized in the generation of feature pyramids. It is formed of a feature-pyramid network (FPN) and a path-aggregation network (PAN). When it comes to object scaling, feature pyramids help models generalize effectively. As a consequence, it makes it easier to identify the same object in different sizes and scales [35].
(iii) Head model The front segment of the network accomplishes the entire fusion of low-level features and high-level features via the feature pyramid structure and PAN to generate rich feature maps. The final detection stage is essentially the responsibility of the model head, which employs anchor boxes to create final output vectors containing class probabilities, accuracies, and bounding boxes.
YOLOv5 ′ s loss function is the sum of the regression function for the bounding box, the confidence loss, and the classification loss. It is determined as Equation (2), where l bx is the regression function for the bounding box, l j is the confidence loss function, and l s is the classification loss function [42]. The variables of l bx , l s , and l j are calculated as shown in Equations (3)- (5). h ′ and w ′ are the height and width of the target, y i and x i are the correct coordinates of the target, λ cd is the indicator function of whether cell i contains an object, λ s is the classification loss function, λ noj is the category loss coefficient, c is the confidence score, and c l is the class. (3)

Transfer Learning
Deep learning has a complicated structure. Overfitting and performance issues arise as training data decrease. Its performance improves as the number of training data increase. As a result, in various deep learning applications, a transfer learning approach that trains information from a particular field with a pre-trained system in advance by abundant data in a related field is extensively utilized [43]. The initial layers in the convolutional process extract the general characteristics, and, as the process to the final layers, the transition to features that are more specialized to the dataset is trained on. Transfer learning has evolved as a result of these layer feature transfers. As a consequence, the model's characteristics learned on the main task are used in transfer learning for an unrelated next task [44]. During deep learning training, the model is fed a large number of data and accumulates model weight and bias. These weights are then used to test different network models. The new model can begin with weights that have already been trained [45]. Figure 4 shows the process of transfer learning. Transfer learning is a handy technique for fast retraining of a model on fresh data without retraining the whole network. Instead, a portion of the initial weights is held constant, while the remainder of the weights are utilized to calculate loss and are updated by the algorithm.

Automatic Image Annotation
AIA has been a prominent study area in recent years since it has the ability to annotate enormous datasets. To address the so-called semantic gap issue, AIA approaches are developed. In contrast to content-based image retrieval, automated annotations may benefit from image search by using high concepts automatically. The AIA approach is considered to be a method that is quick for text-based image retrieval. However, this approach is not sophisticated enough to extract complete semantic meanings. Many researchers have analyzed image annotation techniques in response to the increasing need for image annotation. Therefore, this study proposed a repetitive annotation method to automatically annotate large datasets, as shown in Figure 5. The first dataset is processed to generate transfer learning to annotate new dataset images. Next, the test image is automatically annotated and repeated in the system and combined as a new dataset. This process increases the accuracy performance in the annotation of an object. The process is repeated until optimum accuracy and high efficiency are obtained.

Performance Metrics
Several parameters may be utilized to evaluate the effectiveness of the YOLO algorithm. The average precision (AP), recall, and mean average precision (mAP) are the performance metrics assessed in this study.
The expression of these evaluations is described as follows: The average accuracy when the P index is integral to the R index or the area under the P-R curve is denoted by AP, and mAP is the average accuracy of the mean calculated by diving the total of AP values for all categories. The mAP calculates a score based on how accurately the detected bounding box matches the ground-truth box. In this study, evaluating mAP is denoted by the notation of mAP@0.5, meaning that mAP is calculated at an IoU threshold 0.5.

Results and Discussion
Three versions of YOLO were developed to evaluate the training performance for the oil palm fruit dataset. Table 4 shows the results obtained for all version models after the first training dataset including precision, recall, mAP, and training time. The accuracy generated by YOLOv5 is higher compared to the other versions. In fact, the training time produced is faster, with 0.609 h compared to YOLOv3 and YOLOv4 with 0.896 and 0.876, respectively. Figure 6a-c show the detection results of oil palm FFB for these three versions of YOLO for the classification of the ripeness of oil palm FFB. All the YOLO version's learning rates were set at 0.01 and the model training batch size was set to 32. The value of the IoU was set to 0.2. At the optimized rapid performance, the training epoch value was set to 100. The model was continuously trained and performed effectively. The last weight result for the model was stored after training and the test set of 1000 images was used to assess the model's performance. Next, the test images were deployed as a new dataset and combined with the first trained dataset. This method was utilized to increase the annotation accuracy for further test images. The YOLOv5 model, which was trained on the custom dataset, was fine-tuned. The first test dataset, consisting of 150 images, was used to classify their ripeness using previously trained oil palm fruit detection algorithms. Precision, recall, and mAP@50 were used in the comparison. Furthermore, annotation speed was measured in frames per second (FPS) for each model to investigate the feasibility of using previously trained models in real-time applications. As the test images were unfamiliar to the training models, the metrics produced on this test dataset varied from the previously calculated metrics.
The repetitive annotation method at the second annotation generated 98.7% for oil palm FFB and was then tested on another 1000 new images, and the results are shown in Figure 7a-h. The ripeness classification of oil palm fruit was successfully automatically annotated with a bounding box and accuracy value. The algorithm also trained with 20, 40, 60, 80, and 100 epochs to examine the accuracy performance and the efficiency of the model. The results obtained for each epoch and each performance are shown in Figure 8a-f. The TensorBoard tool was used to visualize all of the network's statistical data. According to the figures, the accuracy value for mAP@50 at 100 epochs achieves a training accuracy of nearly 100%. Moreover, the network from which we observed the validation loss graph decreases concurrently with the training loss. Given higher accuracy and lower losses at 100 epochs, this study fixed the training epochs at 100 epochs to generate high efficiency in object detection and annotation tasks. Table 5 shows the outcomes of the annotation precision, recall, mAP, and time comparison for the training, second annotation, and third annotation process using repetitive annotation tasks. Each image's annotation time was calculated using all of the annotation methods. There were statistically significant differences between the second annotation process, training process, and second annotation task. The average detection speed for the ripeness classification for the training process, first train, and second train were 0.55 ms, 0.43 ms, and 0.3 ms, respectively. The training time for the annotation process increased to generate a better result, however, the test speed FPS outcome was faster. A faster the test dataset is significant in the application of real-time capturing images and harvesting robots.   The technique of repetitive annotation was then evaluated with the larger dataset, tested on a variety of fruits consisting of rambutan, dragon fruit, pineapple, and mangosteen. The epoch was set to 30 for the training task. The annotation results with the bounding box obtained after the second annotation process are shown in Figure 9.The performance curves for mAP, precision, recall, bounding box regression loss and classification loss depicted by red lines are shown in Figure 10a-f. The outcomes of the annotation precision, recall, mAP, and time comparison for the various fruit dataset are shown in Table 6. The accuracy recorded for the second training for a variety of fruit was 99.5%. The accuracy obtained was better compared to the accuracy of oil palm fruit due to the large volume of the dataset used with the variety of fruit, thus producing better predictive performance. Moreover, a larger dataset enhances the probability that the data may include relevant information. There are unstable values for precision and recall. However, in the detection case, most of the cases are evaluated based on the mAP due to its value produced by calculating the average precision for each class and then averaging across several classes. Moreover, mAP takes into consideration both false positives (FP) and false negatives (FN), and reflects the trade-off between accuracy and recall. Based on this feature, mAP is a good measure for most detection applications. There is no accuracy improvement for the first and second annotations, which may occur because the model eventually reaches a point where increasing a dataset will not improve the accuracy. At this point, the model can be playing around with the learning rate or epoch values. Even though there is no enhancement accuracy, the time required to generate an annotation for a new test image is decreased. This benefit may lower the time required to classify further huge numbers of images. Since the accuracy value achieved is almost 100%, this result obtains satisfactory performance shown in employing the repetitive annotation task method. The average detection speed for the fruit classification for the training process, first annotation, and second annotation recorded are 0.44 ms, 0.32 ms, and 0.25 ms.   Based on the findings, it can be demonstrated that the approach technique of repetitive annotation tasks in automatic image annotation has effectively annotated new images with high accuracy. With accurately annotated data, computer vision systems can identify and classify a variety of objects in a huge number of images. In contrast, the proposed method based on the YOLOv5 architecture performed well with the provided dataset. The classification of oil palm fruit maturity or ripeness determines the quality of palm oil produced and its overall marketability. Using this proposed method, the classification of FFB could be employed to address an obstacle in fruit processing for oil production.

Conclusions
In the agricultural sector, robotics, drones, and AI-enabled machines are employed to accomplish a variety of jobs. All of this equipment is based on computer vision technology. When image annotation is performed for the agriculture industry, numerous crops and plants are annotated according to model requirements, such as their ripeness and disease. Therefore, this study proposed an automatic image annotation advancement approach that employs repetitive annotation tasks to automatically annotate an object. This study's dataset includes oil palm FFB and a variety of fruits, with a vast number of data. The YOLOv5 model, a deep learning approach, is chosen for automatically annotating images using the repetitive annotation task technique. The developed method was tested on a large dataset to determine its performance and accuracy in the annotation. The findings reveal that the trained network can correctly classify an object in an image. Furthermore, to demonstrate the superiority of the suggested technique, two alternative YOLO versions, YOLOv3 and YOLOv4, were trained and evaluated on the same dataset, and their results were compared to those obtained by the proposed approach. The comparative results demonstrated the proposed method's efficacy and superiority for the task of fruit categorization. In addition, the repetitive annotation task method is able to increase efficiency in automatically annotating an object in an image. The accuracy for the last training dataset achieves 98.7% for oil palm fruit and 99.5% for a variety of fruit. Therefore, the design of this method is proven fast in annotating a new image and successfully achieves high accuracy. Additionally, this automated method can greatly reduce the amount of time required to classify fruit, while also addressing the difficulty caused by a massive number of unlabeled images. Other than YOLO, the proposed repetitive annotation task technique is recommended to be deployed in any deep learning technique as the advancement of deep learning evolves.