Ginger Seeding Detection and Shoot Orientation Discrimination Using an Improved YOLOv4-LITE Network

: A consistent orientation of ginger shoots when sowing ginger is more conducive to high yields and later harvesting. However, current ginger sowing mainly relies on manual methods, seriously hindering the ginger industry’s development. Existing ginger seeders still require manual assistance in placing ginger seeds to achieve consistent ginger shoot orientation. To address the problem that existing ginger seeders have difﬁculty in automating seeding and ensuring consistent ginger shoot orientation, this study applies object detection techniques in deep learning to the detection of ginger and proposes a ginger recognition network based on YOLOv4-LITE, which, ﬁrst, uses MobileNetv2 as the backbone network of the model and, second, adds coordinate attention to MobileNetv2 and uses Do-Conv convolution to replace part of the traditional convolution. After completing the prediction of ginger and ginger shoots, this paper determines ginger shoot orientation by calculating the relative positions of the largest ginger shoot and the ginger. The mean average precision, Params, and giga Flops of the proposed YOLOv4-LITE in the test set reached 98.73%, 47.99 M, and 8.74, respectively. The experimental results show that YOLOv4-LITE achieved ginger seed detection and ginger shoot orientation calculation, and that it provides a technical guarantee for automated ginger seeding. L.F., H.G., H.Z., and J.H.; J.H. Y.W.; data curation, L.F.; preparation, L.F. J.H.; and J.H. R.X.; visualization, L.F.; J.H.; J.H.


Introduction
Ginger is a perennial herb whose roots are often made into spices and herbs [1,2]. It originated in Asia and is now widely grown in various regions, of which China is the world's most productive country for ginger [3,4]. Before sowing, growers need to break the ginger and cultivate the ginger shoots, thus retaining one to two ginger shoots for each ginger seed [5]. After plowing, fertilizing, and trenching, the ginger seeds are placed in a trench [6,7], while ensuring consistent ginger shoot orientation. In China, ginger shoots generally face southwest, which is beneficial for ginger shoots to accumulate temperature. This is because ginger is a crop with a high cumulative temperature, and effective temperature accumulation is more conducive to high ginger production [8]. Furthermore, consistent ginger shoot orientation ensures that all gingers grow parallel to each other, effectively avoiding neighboring ginger seeds crowding together and, thus, affecting the quality and yield of ginger. Ginger sowing mainly relies on manual labor [9], while mechanical sowing is less prevalent and still requires manual assistance. At present, ginger mechanical seeding is commonly realized in the following way: growers first place the ginger seeds in the ginger seed holding device and then make them fall into the trench with the help of different forms of conveying device.
In addition, we introduce Do-Conv (depth-wise over-parameterized convolution) [48] to speed up the network's training and facilitate the convergence of the model. Finally, the recognition targets of this network are ginger shoots and ginger, and the difference in recognition difficulty and target size between them is large. Therefore, we introduce focal loss to solve the problems of positive and negative sample imbalance and simple and difficult sample imbalance. The above improvements provide a technical guarantee for the fast and accurate discrimination of the orientation of ginger shoots in ginger seed images. The rest of the paper is organized as follows: Section 2 describes the creation of the dataset and the improvements based on the YOLOv4 network; Section 3 describes the tuning of the model parameters and the experimental validation of the proposed method; and Section 4 describes the conclusions of this work.

Data Acquisition and Annotation
This paper uses ginger seed samples collected from a ginger plantation in Anqiu, Shandong, China (36.47847 • N, 119.2189 • E) on 25 April 2021. The ginger seeds are of the 'Baby Ginger' variety and were germinated for 15 days. Ginger seed images were captured to accelerate the training and debugging of the ginger seed recognition model, by using the device shown in Figure 1, which includes a CMOS industrial camera, fill light, camera stand, computer, etc. The camera model was a MV-EM1400C manufactured by Micro-vision, with a resolution of 3288 × 3288 pixels; the fixed-focus lens was M1620-MPW2, the shooting distance was 30 cm, and a total of 500 images in "JPG" file format were stored. In addition, high luminance, regular luminance, and low luminance images were acquired separately to test the model recognition ability, for a total of 100 images. Meanwhile, 500 ginger seed images from a previous study by Hou et al. [27] were used to enrich the ginger dataset, with an image size of 5472 × 3672 pixels.
Agronomy 2021, 11, x FOR PEER REVIEW  3 of 21 ginger, and, thus, we introduce a simple and efficient CA (coordinate attention) [47] mechanism. In addition, we introduce Do-Conv (depth-wise over-parameterized convolution) [48] to speed up the network's training and facilitate the convergence of the model. Finally, the recognition targets of this network are ginger shoots and ginger, and the difference in recognition difficulty and target size between them is large. Therefore, we introduce focal loss to solve the problems of positive and negative sample imbalance and simple and difficult sample imbalance. The above improvements provide a technical guarantee for the fast and accurate discrimination of the orientation of ginger shoots in ginger seed images. The rest of the paper is organized as follows: Section 2 describes the creation of the dataset and the improvements based on the YOLOv4 network; Section 3 describes the tuning of the model parameters and the experimental validation of the proposed method; and Section 4 describes the conclusions of this work.

Data Acquisition and Annotation
This paper uses ginger seed samples collected from a ginger plantation in Anqiu, Shandong, China (36.47847° N, 119.2189° E) on 25 April 2021. The ginger seeds are of the 'Baby Ginger' variety and were germinated for 15 days. Ginger seed images were captured to accelerate the training and debugging of the ginger seed recognition model, by using the device shown in Figure 1, which includes a CMOS industrial camera, fill light, camera stand, computer, etc. The camera model was a MV-EM1400C manufactured by Micro-vision, with a resolution of 3288 × 3288 pixels; the fixed-focus lens was M1620-MPW2, the shooting distance was 30 cm, and a total of 500 images in "JPG" file format were stored. In addition, high luminance, regular luminance, and low luminance images were acquired separately to test the model recognition ability, for a total of 100 images. Meanwhile, 500 ginger seed images from a previous study by Hou et al. [27] were used to enrich the ginger dataset, with an image size of 5472 × 3672 pixels.  As shown in Figure 1, LabelImg (https://github.com/tzutalin/labelImg, accessed on 9 August 2021) was used to label the ginger shoots and ginger separately in "xml" file format, to determine the orientation of the ginger shoots; and it is worth noting that the labeled boxes are tightly aligned with the edges of the ginger and ginger shoots. In addition, the annotation information of each image was stored in a "txt" file, including image path, annotation box coordinates (image coordinates of the upper left and lower right corners), and object category. After image annotation, 1000 images were randomly divided into training and validation sets in the ratio of 80% and 20%, and the remaining 100 images were used for the model testing. Among them, the validation set was for adjusting the hyperparameters and monitoring the model for overfitting, and the test set was used for model evaluation, with no duplication between the above two, to ensure the accuracy of the model evaluation results. Finally, the ginger images, annotation files, and category labels were stored in PASCAL VOC format for training the ginger seed recognition network.

Data Enhancement
This paper used online data enhancement for expanding the original ginger seed images, to improve the model generalization ability and compensate for the insufficient number of samples. This means that before each batch training, the data-enhanced images were scaled to 416 × 416 pixels, and then four images were randomly cropped and stitched into one image using the Mosaic algorithm, thus, serving as training data. Mosaic greatly enriches the image background and also reduces the demand for GPU memory. The data enhancement methods are specified as follows: (1) Horizontal flip, mirror flip, and affine transformation were performed on images, with a 0.5 probability of reducing the effect of different ginger positions on the recognition results. (2) Image brightness was increased by 1.2 times or decreased by 0.8 times, with a 0.5 probability of reducing the effect of different illumination levels on the field on recognition results. (3) Image contrast was increased by 1.2 times or reduced by 0.8 times, with a 0.5 probability of better expressing the grayscale, sharpness, and texture details of the ginger images.

Overall Technical Route
To achieve accurate real-time detection of ginger and ginger seeds, the technical solutions proposed in this study are as follows: 1.
Construction and training of a YOLOv4-LITE network. This study used the Mo-bileNetv2 network to replace the original CSPDarknet53, to solve the model redundancy caused by the more complex backbone network.

2.
The introduction of an attention mechanism and Do-Conv convolution. This study introduced an attention mechanism and Do-Conv into YOLOv4-LITE, to improve the recognition of smaller ginger shoots.

3.
Model performance analysis and experimental validation. The performance of the improved model was tested, and the improvements proposed in this study were verified and analyzed sequentially.

YOLOv4 Model
As an end-to-end one-stage object detection algorithm based on regression theory, a YOLO network can directly predict the bounding box and class of an object. The YOLOv4 network is based on the original YOLO and is optimized for data processing, backbone network, activation function, loss function, and other aspects to improve the detection performance and inference speed of the model. Its training process is shown in Figure 2, which includes the following five parts:

1.
Based on Darknet53, CSPDarknet53 borrows the cross-stage partial (CSP) from CSP-Net and adds a CSP on each of the five residual blocks, which enhances the learning ability of CNN and can maintain a high performance while lowering the weight of the network. CBL (convolution, batch normalization, and Leak-ReLU) is the most common module in YOLOv4 and includes convolutional (Conv) layers, batch normalization layers, and activation layer constructs.

2.
This paper adds a spatial pyramid pooling (SPP) structure after CSPDarknet53, which effectively increases the perceptual field of the backbone network. It uses the maximum pooling operations with convolution kernels of 1 × 1, 5 × 5, 9 × 9, and 13 × 13, respectively, to obtain four feature maps in different scales, and then fuses them in a concatenated manner. 3.
In CNN networks, shallow features contain richer target location information, such as contours and textures, and less semantic information. However, the deeper features contain richer semantic information, and the object location information is coarse. Therefore, our network adopts a feature pyramid network (FPN) structure, which passes the deep semantic information through up-sampling, thus fusing the shallow layers' semantic information and location information.

4.
Borrowing from the bottom-up path augmentation method in PANet [49], two-path aggregation network (PAN) structures are added after FPN, which transmits the underlying location information by down-sampling, thus fusing location information with the semantic information of higher levels.

5.
YOLOv4 loss function includes bounding regression loss (L coord ), based on the complete intersection over union CIoU (L CIoU ), confidence loss (L conf ), and classification loss (L cls ). The loss function is formulated as follows: where λ coord and λ noobj are penalty coefficients; s 2 is the number of grids in the feature map; B is the number of anchor boxes per grid; i is the i-th grid and j is the j-th anchor box; (w i , h i ) and (ŵ i ,ĥ i ) are the coordinates of the ground true and the prediction, respectively; BCE(·) represents the binary cross-entropy loss; I Based on the features extracted by the backbone network, the YOLO network predicts object bounding boxes and categories. Moreover, YOLOv4 uses CSPDarknet53, after removing the final pooling layer, fully connected (FC) layer, and softmax layer, as the backbone of the feature extraction network. However, with YOLOv4 it is difficult to

YOLOv4-LITE Network Design
Based on the features extracted by the backbone network, the YOLO network predicts object bounding boxes and categories. Moreover, YOLOv4 uses CSPDarknet53, after removing the final pooling layer, fully connected (FC) layer, and softmax layer, as the backbone of the feature extraction network. However, with YOLOv4 it is difficult to achieve a high inference speed in embedded devices, due to its network layer count of 104; requiring a lightweight network to replace the original complex backbone network. Therefore, this paper designed a YOLOv4-LITE network, based on the YOLOv4. In addition, MobileNetv1 is a lightweight network proposed by Google, and it uses depth-wise separable convolution instead of traditional convolution. Hence, this paper used MobileNetv2 as the backbone network of YOLOv4-LITE, which reduces the model size, while maintaining its performance. The network parameters of YOLOv4-LITE are shown in Table 1. As can be seen from Table 1, MobileNetv2 mainly consists of two forms of inverse residual block (IRB) that use depth-wise (DW) convolution and point-wise (PW) convolution to extract the image depth features, thus, greatly reducing the time complexity and space complexity of the convolution operations. Figure 3 is a schematic diagram of the improved backbone network. As shown in Figure 3, when stride = 1 in DW convolution, the block is the inverse residual block 1 (IRB 1 ); when stride = 2 in DW convolution, the block is the inverse residual block 2 (IRB 2 ). Since the convolutional layer in IRB 2 also has a down-sampling function, shortcut is not used to keep the output dimension consistent. The above two kinds of IRBs are composed of PW 1 convolution, DW convolution, and PW 2 convolution. Among them, PW 1 convolution consists of 1 × 1 convolution, BN, and ReLU6, and it maps the feature dimension, from low-dimensional space to high-dimensional space, which is beneficial for feature extraction; DW convolution is composed of 3 × 3 convolution, BN, and ReLU6, and realizes feature extraction; the PW 2 convolution is composed of 1 × 1 convolution and BN to map the high-dimensional space to the low-dimensional space. As the ReLU6 activation function would destroy the features learned by the CNN in the low-dimensional space, the PW 2 convolution is not followed by an activation function.

Coordinate Attention Module
In this paper, an attention [50] mechanism is introduced to improve the mo accuracy, by selecting the most critical information for the current recognition task fr a large amount of feature information. This is essentially similar to human selective visi in that it quickly scans the global image to obtain the information that needs to be focu on, while suppressing information that is not helpful for the current task. Therefore, attention mechanism is applied after DW convolution in the IRB of MobileNetv2.

Coordinate Attention Module
In this paper, an attention [50] mechanism is introduced to improve the model accuracy, by selecting the most critical information for the current recognition task from a large amount of feature information. This is essentially similar to human selective vision, in that it quickly scans the global image to obtain the information that needs to be focused on, while suppressing information that is not helpful for the current task. Therefore, the attention mechanism is applied after DW convolution in the IRB of MobileNetv2. Figure

Coordinate Attention Module
In this paper, an attention [50] mechanism is introduced to improve the model accuracy, by selecting the most critical information for the current recognition task from a large amount of feature information. This is essentially similar to human selective vision, in that it quickly scans the global image to obtain the information that needs to be focused on, while suppressing information that is not helpful for the current task. Therefore, the attention mechanism is applied after DW convolution in the IRB of MobileNetv2. Figure  4 shows the schematic diagram of different attention mechanisms, with an input feature map of size: Height (H) × Width (W) × Channel (C).    Figure 4b shows a schematic structure of SE attention. As the convolution operation only integrates the information of the spatial dimension and channel dimension within a local perceptual field, it does not obtain enough information between global channels. Therefore, first, global average pooling (GAP) is used to compress feature 1 (H × W × C), which has global spatial information, into feature 2 (1 × 1 × C), which has a global receptive field. Second, two FC layers are used to reduce the complexity and improve the generalization ability of the network; where the first FC is used to reduce the dimensionality of the feature map, and the second FC is used to recover the feature dimensionality. Third, feature 3 (1 × 1 × C) is obtained after the sigmoid activation function, which characterizes the importance of each channel. Finally, the channels of feature 3 are multiplied one by one with feature map 1 to obtain the final output re-weight (H × W × C), which is equivalent to adding a weight to each channel of feature 1, thus giving greater weight to information helpful for the task at hand. In conclusion, SE attention improves the sensitivity of the network to channel features and contributes a performance improvement by lowering the computation needed, but it ignores the importance of the location feature information. Figure 4c shows a schematic structure of CBAM attention, including the channel attention module (CAM) and spatial attention module (SAM). On the one hand, CAM is similar to SE attention, in that it first compresses the input feature 1 (H × W × C) into feature 2 (1 × 1 × C) using GAP and global max pooling (GMP), which adds a layer of GMP with respect to SE attention; thus, increasing the feature dimension once again. Second, feature 2 is first reduced in the channel dimension to C/16 using a convolution layer, and then its dimension is raised using 1 × 1 convolution to obtain feature 3 (1 × 1 × C). Third, feature 4 (H × W × 1) is gained after the sigmoid activation function, which characterizes the importance of the channel feature information. Finally, the Re-weight 1 is obtained by multiplying feature 4 with feature 1. On the other hand, in the SAM, first, Re-weight 1 is compressed into feature 5 (H × W × 1) and features 6 (H × W × 1) along the channel direction, using GAP and GMP, respectively, and then they are concatenated based on the channel direction to obtain feature 7 (H × W × 2). Second, the channel dimension of feature 7 is reduced to 1 using 7 × 7 convolution, resulting in feature 8 (H × W × 1). Third, feature 9 (H × W × 1) is gained after sigmoid activation, and this characterizes the importance of the location feature information. Finally, the final output Re-weight 2 (H × W × C) is obtained by multiplying feature 9 with Re-weight 1 , which is equivalent to adding a weight to the location features of Re-weight 1 , so that the location helpful information for the current task has greater weight.
The above two attention mechanisms are widely used in lightweight networks and have achieved good results. However, SE attention only considers the channel feature information and ignores the location information, and CBAM attention only introduces local location information through global pooling. Therefore, this paper presents a coordination attention mechanism, in which location information is embedded into channel attention to avoid adding a large amount of additional computational overheads, while ensuring better attention results for MobileNetv2. Figure 4d shows a schematic structure of CA attention, consisting of coordinate information embedding and coordinate attention generation. Each channel is first encoded along two spatial directions, vertical and horizontal, using GAP with pooling kernel sizes (H, 1) and (1, W), respectively, to avoid a possible loss of valuable location information by global pooling in channel attention. The above enables the input feature map 1 (H × W × C) to be compressed into a pair of direction-aware features, including feature 2 (H × 1 × C) and feature 3 (1 × W × C), and they have global receptive field and precise location information. Second, feature 4 (1 × (H + W) × (C/16)) is obtained after concatenating feature 2 with 3 and reducing the feature dimension using 1×1 convolution. Then, feature 4 is decomposed into feature 5, 6 along the spatial dimension, and their feature dimensions are elevated using 1 × 1 convolution, and the above operation greatly reduces the model complexity and computational overhead. Finally, features 5 and 6 are multiplied with feature 1 after sigmoid activation to obtain the final output Re-weight 2 (H × W × C).

Do-Conv Convolution
In general, the network depth is usually increased by combining linear convolutional layers and nonlinear network layers to increase the network expressiveness, since successive linear layers increase the overfitting phenomenon of the network and can be replaced by a linear layer. This paper replaces part of the traditional convolution in FPN + PANet with Do-Conv convolution, speeding up the network training and promoting the model convergence.
The operation of Do-Conv is shown in Figure 5, where * denotes conventional convolution and • denotes depth-wise convolution. In model training, the depth-wise convolution of weight D T ∈ R (M×N)×C in and weight W ∈ R D mu ×C in ×C out are first computed to obtain the new weight W ∈ R M×N×C in ×C out , and then the conventional convolution of weights W and input features P is calculated to get the final output O, and it should be noted that D mul ≥ M × N. On the basis of traditional convolution, Do-Conv adds an additional depth-wise convolution, to form an over-parameterized convolution layer, which increases the number of parameters compared to conventional convolution. Although the number of parameters increases, the multi-layer linear operations used in over-parameterized convolution can be combined into a single-layer convolution operation during model inference, because both conventional convolution and depth-wise convolution are linear operations, thus speeding up the inference.

Focal Loss Function
In the YOLOv4-LITE network training, it is necessary to first set a suitable intersection over union (IoU) threshold. When the IoU between the anchor box and all targets ground truth is less than the IoU threshold, this anchor box is regarded as a negative sample; and when the target centroid falls in a grid, the anchor box in the grid that has the maximum IoU with the target is a positive sample. In one-stage object detection, the loss function is dominated by many negative samples due to the imbalance between positive and negative samples during training, so the network cannot measure the prediction results. Therefore, this paper introduces a focal loss function to solve the problems of unbalanced positive and negative samples, and unbalanced simple samples and difficult samples. The focal loss function is calculated as follows: where y is the category label and pt (pt ∈ [0, 1]) is the probability that the t-th sample is y.
Considering the imbalance of positive and negative samples in the ginger image, most of the area is the background, and the number of positive samples (ginger and ginger shoots) is much lower than the negative samples (background). Thus, the paper adds a weighting factor α (α ∈ [0.5, 1)) to the cross-entropy loss function so that a smaller number of positive samples take up more weight, and thus the model can learn more helpful information. On the other hand, considering the imbalance between simple and difficult samples, ginger samples are more facile to identify than ginger shoot samples. Focal loss combines the

Focal Loss Function
In the YOLOv4-LITE network training, it is necessary to first set a suitable intersection over union (IoU) threshold. When the IoU between the anchor box and all targets ground truth is less than the IoU threshold, this anchor box is regarded as a negative sample; and when the target centroid falls in a grid, the anchor box in the grid that has the maximum IoU with the target is a positive sample. In one-stage object detection, the loss function is dominated by many negative samples due to the imbalance between positive and negative samples during training, so the network cannot measure the prediction results. Therefore, this paper introduces a focal loss function to solve the problems of unbalanced positive and negative samples, and unbalanced simple samples and difficult samples. The focal loss function is calculated as follows: where y is the category label and p t (p t ∈ [0, 1]) is the probability that the t-th sample is y.
Considering the imbalance of positive and negative samples in the ginger image, most of the area is the background, and the number of positive samples (ginger and ginger shoots) is much lower than the negative samples (background). Thus, the paper adds a weighting factor α (α ∈ [0.5, 1)) to the cross-entropy loss function so that a smaller number of positive samples take up more weight, and thus the model can learn more helpful information. On the other hand, considering the imbalance between simple and difficult samples, ginger samples are more facile to identify than ginger shoot samples. Focal loss combines the idea of OHNM, by adding a weighting factor (1 − p t ) γ to the loss function, and γ can be used to reduce the loss of simple samples by adjusting the variation range of weighting factor (1 − p t ) γ , and its value range is generally [0, 5]. For instance, when y t = 1, the p t of the simple sample is close to 1, so (1 − p t ) γ is close to 0. In contrast, (1 − p t ) γ of the difficult sample is close to 1. The above description implies that the addition of (1 − p t ) γ makes the difficult samples have a more significant impact on the loss function. If γ is too small, it will not increase the loss of difficult samples. On the contrary, if γ is too large, it is not conducive to model training. In the end, γ = 2 and α = 0.75.

Identification Method of Ginger Shoot Orientation
First, to discriminate the orientation of ginger shoots, the location of the ginger shoots and ginger is predicted using the ginger identification network. Second, this paper uses the area of the ginger shoot prediction box as the criterion to select ginger shoot and only selects the largest one to discriminate the orientation of ginger shoot. As shown in Figure 6, a right-angle coordinate system is established with the center point O of the ginger prediction frame as the origin, the center point of the ginger prediction frame is A (dx, dy), and the orientation angle of the ginger shoot is θ, where "+" indicates counterclockwise rotation and "−" indicates clockwise rotation.
Agronomy 2021, 11, x FOR PEER REVIEW 11 of 21 idea of OHNM, by adding a weighting factor (1 − pt) γ to the loss function, and γ can be used to reduce the loss of simple samples by adjusting the variation range of weighting factor (1 − pt) γ , and its value range is generally [0, 5]. For instance, when yt = 1, the pt of the simple sample is close to 1, so (1 − pt) γ is close to 0. In contrast, (1 − pt) γ of the difficult sample is close to 1. The above description implies that the addition of (1 − pt) γ makes the difficult samples have a more significant impact on the loss function. If γ is too small, it will not increase the loss of difficult samples. On the contrary, if γ is too large, it is not conducive to model training. In the end, γ = 2 and α = 0.75.

Identification Method of Ginger Shoot Orientation
First, to discriminate the orientation of ginger shoots, the location of the ginger shoots and ginger is predicted using the ginger identification network. Second, this paper uses the area of the ginger shoot prediction box as the criterion to select ginger shoot and only selects the largest one to discriminate the orientation of ginger shoot. As shown in Figure  6, a right-angle coordinate system is established with the center point O of the ginger prediction frame as the origin, the center point of the ginger prediction frame is A (dx, dy), and the orientation angle of the ginger shoot is θ, where "+" indicates counterclockwise rotation and "−" indicates clockwise rotation. x y x y x y x y x y x y

Method of Discriminating Ginger Shoot Orientation
The paper uses precision (P) and recall (R) as evaluation criteria to assess the model performance. In addition, the F1 score can be used to equalize the precision and recall. They are defined as shown in Equations (5)-(7).

Method of Discriminating Ginger Shoot Orientation
The paper uses precision (P) and recall (R) as evaluation criteria to assess the model performance. In addition, the F1 score can be used to equalize the precision and recall. They are defined as shown in Equations (5)- (7).
where true positive (TP) means that the prediction result and ground truth are both positive samples; false positive (FP) indicates that the prediction result is positive and ground truth is negative; and false negative (FN) means the prediction result is negative and ground truth is positive. However, depending on different task requirements, precision and recall can be adjusted to various values during model testing by adjusting different confidence thresholds, and average precision (AP), as the average of precision under different recalls can be used to measure the inherent model properties. In this study, since there are two categories, the ginger shoot and ginger seeds, mean average precision (mAP) was adopted to measure the model performance. The equations of AP and mAP are as follows: where m is the number of categories and R is the integral variable used to calculate the region's area under the P-R curve. AP 50 is the AP value when the IoU threshold is 0.5; therefore, mAP 50 is the average of AP 50 for all categories. Similarly, mAP 75 is the average of AP 75 of all categories, mAP 50:95 is the average of AP 50:95 , and AP 50:95 is the average of ten values of AP 50 , AP 55 , AP 60 , . . . , AP 95 .
In addition, the model performance was measured using model size, Params, and giga Flops (GFlops) [50,51], where Params is the total number of parameters required to train the network, and GFlops is the amount of computation in the network. The lower the GFlops, the less computation and execution time needed for the model.

Results and Discussion
The experimental environment of the YOLOv4-LITE network during model training is shown in Table 2. In addition, the model optimizer was SGD (stochastic gradient descent), the momentum was 0.95, the weight attenuation coefficient was 5 × 10 −3 , the batch size was 16, the trained epochs were 200, and the model weight was reserved once for every 10 epochs. At the beginning of the network training, the learning rate was increased linearly from 0 to 1 × 10 −4 in the first 20 epochs, to make the network converge to a better initial state quickly, and it was then reduced to 1 × 10 −6 by using the cosine annealing decay method; the formula and diagram of the learning rate are shown below. lr = lr max T warm t t ≤ T warm lr min + 1 2 (lr max − lr min ) 1 + cos( t−T warm T total ) t > T warm (10) where t and T warm are the current epochs and warmup epochs, respectively; lr min and lr max represent the minimum and maximum values of the learning rate, respectively; T cur and T total represent the current and total epochs, respectively. As shown in Figure 7b, the dimensions of the labeled boxes were clustered using the K-means algorithm before network training. K-means uses an IoU-based metric with the objective function of minimizing the distance between the labeled boxes and the clustered boxes, and resulted in nine clustered boxes, (24,24), (33,42), (39,57), (51,64), (46,84), (217, 172), (287, 204), (237, 252), and (329, 285), which were then used to initialize the anchor boxes in the ginger recognition network. When network training, a multi-scale training method is used to improve the model generalization, which means randomly training the model with images of different sizes every 10 batches, while ensuring that the image edge length is a multiple of 32. Moreover, this paper also used mixed-precision training, based on single-precision and half-precision, to speed up the network training and reduce the GPU memory usage.

Result Analysis
As is well-known, the loss function evaluates a model by measuring the error between the predicted and the true values. Figure 8a shows the loss value change curve of YOLOv4-LITE with a total training time of 3.5 h. As seen in Figure 8a, the loss value dropped rapidly from 1441.33 to 3.71 in the first 40 epochs and then slowly oscillated down and stabilized as the epochs increased. Eventually, the loss value stabilized at around 1.80, and the model converged at the same time point. Figure 8b shows the P-R curves for an IoU threshold of 0.5. Both the P-R curves for ginger shoots and ginger enclosed almost the entire parameter space, which indicated that the model had achieved a sufficient average precision. It can also be clearly seen from Figure 8b that YOLOv4-LITE performed better in ginger recognition than the ginger shoots of smaller targets. dropped rapidly from 1441.33 to 3.71 in the first 40 epochs and then slowly oscillated down and stabilized as the epochs increased. Eventually, the loss value stabilized at around 1.80, and the model converged at the same time point. Figure 8b shows the P-R curves for an IoU threshold of 0.5. Both the P-R curves for ginger shoots and ginger enclosed almost the entire parameter space, which indicated that the model had achieved a sufficient average precision. It can also be clearly seen from Figure 8b that YOLOv4-LITE performed better in ginger recognition than the ginger shoots of smaller targets. The test results of the validation set were analyzed, and the confidence threshold (conf-thresh) was taken as 0.5, and the IoU threshold was taken as 0.5. The test results are shown in Table 3, and the number of the ground truth in the test set was 435. The improved model had improvements in terms of precision, recall, and F1-score. Specifically, the precision increased by 0.49%, recall increased by 1.15%, and F1-score increased by 0.82%. The analysis of TP, FP, and FN in the test results revealed that the improved model performance mainly depended on the increase of TP and the decrease of FP and FN. As shown in Figure 9a,b, this paper compared the original and manually labeled ginger seed images to better evaluate the recognition effect of the YOLOv4-LITE network. The final recognition results are as shown in Figure 9c, where the sizes of the test images are all 416 × 416 pixels, and the green and white rectangular boxes represent the predicted boxes for the ginger and ginger shoots, respectively. As can be seen in Figure 9, the ginger seed images were well recognized. The image above had only one ginger shoot, and the coordinates of the center points of the prediction boxes for ginger and shoot were (204, 207) and (220, 317), respectively. After the calculation of Equations (3) and (4), δ = −81.7 • , the ginger seed was rotated clockwise by 81.7 • to ensure consistent ginger shoot orientation. The below image had two ginger shoots, as shown by the red arrows, only the ginger shoot with the larger prediction box was chosen. The coordinates of the center points of the prediction boxes for ginger and shoot are (191,192) and (270, 238), respectively. After the calculation of Equations (3) and (4), δ = −30.2 • , the ginger seed should be rotated clockwise by 30.2 • .
Due to the very irregular shape of the ginger seeds and the fragility of the ginger shoots, we designed a ginger seed transport channel. After placing the ginger seeds on the channel, detection of the seeds and the orientation of the ginger shoots was achieved using an image acquisition device and a mobile terminal device. Next, an end-effector with vacuum suction cups was used to pick up the center of the ginger prediction box and adjust the orientation of the ginger shoot in real-time to ensure that the ginger shoots were facing the same direction. 207) and (220, 317), respectively. After the calculation of Equations (3) and (4), δ = −81.7°, the ginger seed was rotated clockwise by 81.7° to ensure consistent ginger shoot orientation. The below image had two ginger shoots, as shown by the red arrows, only the ginger shoot with the larger prediction box was chosen. The coordinates of the center points of the prediction boxes for ginger and shoot are (191,192) and (270, 238), respectively. After the calculation of Equations (3)

Discussion of the Improved Algorithm
This paper conducted the following three comparison experiments [52] to demonstrate the contribution of the proposed improved refinements to the YOLOv4 network: comparison experiments after replacing the feature extraction network, experiments with different attention mechanisms, and comparison experiments after adding Do-Conv convolution.

Performance Comparison of Feature Map Extraction Network
The test results of the network are as shown in Figure 10, after replacing the original backbone network of YOLOv4 with the MobileNetv2 network. This strategy achieved good results, as mAP 50 was reduced by only 0.37%. Moreover, compared to the original CSPDarknet53, the network computation was greatly reduced after using MobieNetv2 as the backbone network. As shown in Table 4, the model size, Params, and GFlops before and after the improvement of the backbone network were compared, and, notably, when calculating Params and GFlops, the network input images were of the same size, and 416 × 416 pixels was chosen for this paper. As can be seen from Table 4, YOLOv4-LITE using MobieNetv2 as the backbone network was much smaller than the original network, in terms of model size, Params, and GFlops; reducing these to 149.6 MB, 15.95 M, and 21.14, respectively, which indicated that the improved network had a lower computational time and spatial complexity. and after the improvement of the backbone network were compared, and, notably, when calculating Params and GFlops, the network input images were of the same size, and 416 × 416 pixels was chosen for this paper. As can be seen from Table 4, YOLOv4-LITE using MobieNetv2 as the backbone network was much smaller than the original network, in terms of model size, Params, and GFlops; reducing these to 149.6 MB, 15.95 M, and 21.14, respectively, which indicated that the improved network had a lower computational time and spatial complexity.

Different Attention Mechanisms Comparative Experiment
In this paper, SE, CBAM, and CA attentions were inserted in the IRB module of MobileNetv2 to verify the effect of CA attention by comparing their test results. The test results shown in Figure 11 indicated that after using three kinds of attention modules, the AP 50 of ginger shoots increased by 2.64%, 3.52%, and 5.95%, respectively, while the AP 50 of ginger remained basically unchanged. It is worth mentioning that the AP of ginger shoots improved most significantly after using CA attention, with AP 50 increasing from 91.5% to 97.45% and AP 50:95 rising from 41.63% to 52.32%, indicating that the addition of CA attention to YOLOv4-LITE effectively improved the detection accuracy of ginger shoots.

Different Attention Mechanisms Comparative Experiment
In this paper, SE, CBAM, and CA attentions were inserted in the IRB module of MobileNetv2 to verify the effect of CA attention by comparing their test results. The test results shown in Figure 11 indicated that after using three kinds of attention modules, the AP50 of ginger shoots increased by 2.64%, 3.52%, and 5.95%, respectively, while the AP50 of ginger remained basically unchanged. It is worth mentioning that the AP of ginger shoots improved most significantly after using CA attention, with AP50 increasing from 91.5% to 97.45% and AP50:95 rising from 41.63% to 52.32%, indicating that the addition of CA attention to YOLOv4-LITE effectively improved the detection accuracy of ginger shoots.

Analysis of Do-Conv Convolution
This paper conducted a comparison experiment with Do-Conv instead of the conventional convolution in the FPN + PANet structure, to study the effect of Do-Conv convolution on the ginger seed recognition network. The loss curves of the YOLOv4-LITE network are shown in Figure 12, which indicate that the network had a faster convergence rate after using Do-Conv convolution in the training process. Moreover, the network test results after using Do-Conv are shown in Table 5, which show that the AP50 of ginger shoots improved by 2.18%. Furthermore, although the Do-Conv convolution layer added an extra depth-wise convolution to the conventional convolution, it did not increase the Params and GFlops. The reason for this was as follows: during the model training,  and     ) with the same shape as the conventional

Analysis of Do-Conv Convolution
This paper conducted a comparison experiment with Do-Conv instead of the conventional convolution in the FPN + PANet structure, to study the effect of Do-Conv convolution on the ginger seed recognition network. The loss curves of the YOLOv4-LITE network are shown in Figure 12, which indicate that the network had a faster convergence rate after using Do-Conv convolution in the training process. Moreover, the network test results after using Do-Conv are shown in Table 5, which show that the AP 50 of ginger shoots improved by 2.18%. Furthermore, although the Do-Conv convolution layer added an extra depth-wise convolution to the conventional convolution, it did not increase the Params and GFlops. The reason for this was as follows: during the model training, D and W were folded into W (W = D T • W) with the same shape as the conventional convolution kernel that they replaced, so the Params and GFlops of the model did not change in the inference.   Figure 13 shows the mAP of the various improved algorithms. Compared with YOLOv3 and YOLOv3-tiny [53], YOLOv4 had a better target detection performance, with a mAP50 of 99.1%, which was 1% and 1.8% higher than the others, respectively. When replacing the backbone network with MobileNetv3 [54], Ghost-Net [55], and MobileNetv2, without using other improved strategies, the network detection performance dropped dramatically, with a mAP50 of only 92.75%, 93.13%, and 93.54%. After using CA attention or Do-Conv based on MobileNetv2, the mAP50 reached 97.63% and 97.25%, respectively; while, after using CA attention and Do-Conv at the same time, mAP50 reached 98.73%. In summary, this study effectively enhanced the network performance through the series of improvements mentioned above.   Figure 13 shows the mAP of the various improved algorithms. Compared with YOLOv3 and YOLOv3-tiny [53], YOLOv4 had a better target detection performance, with a mAP 50 of 99.1%, which was 1% and 1.8% higher than the others, respectively. When replacing the backbone network with MobileNetv3 [54], Ghost-Net [55], and MobileNetv2, without using other improved strategies, the network detection performance dropped dramatically, with a mAP 50 of only 92.75%, 93.13%, and 93.54%. After using CA attention or Do-Conv based on MobileNetv2, the mAP 50 reached 97.63% and 97.25%, respectively; while, after using CA attention and Do-Conv at the same time, mAP 50 reached 98.73%. In summary, this study effectively enhanced the network performance through the series of improvements mentioned above.

Performance Comparison of the Overall Algorithm
For the recognition of ginger seed images, we tried to use the traditional color difference segmentation method to segment ginger shoots and recorded the color components of ginger seed images using RGB (red green blue) and HSV (hue saturation value). It was found that the H component is less influenced by the illumination and can achieve the segmentation of ginger shoots. However, the ginger seeds were treated with drugs resulting in unstable color characteristics of the ginger shoots, thus making the error rate of identification high; therefore, the color difference segmentation method is not very reliable for the segmentation of ginger shoots. Agronomy 2021, 11, x FOR PEER REVIEW 18 of 21 For the recognition of ginger seed images, we tried to use the traditional color difference segmentation method to segment ginger shoots and recorded the color components of ginger seed images using RGB (red green blue) and HSV (hue saturation value). It was found that the H component is less influenced by the illumination and can achieve the segmentation of ginger shoots. However, the ginger seeds were treated with drugs resulting in unstable color characteristics of the ginger shoots, thus making the error rate of identification high; therefore, the color difference segmentation method is not very reliable for the segmentation of ginger shoots.
Unlike the traditional color difference segmentation method, Hou et al. proposed a fast recognition method for ginger shoots based on YOLOv3. The method avoided the manual design of feature extractors and had good robustness, and the AP of ginger shoots reached 98.2%. However, it only identified ginger shoots and did not identify ginger, resulting in a complex calculation process for ginger shoot orientation. In addition, its backbone network was complex and contained a large number of redundant parameters, making it difficult to deploy on ginger seeders and also restricting the development of automated ginger seeders to some extent.
Based on the constructed ginger seed dataset, the backbone network of YOLOv4 was replaced by MobileNetv2, which greatly reduced the network parameters and computational effort. Meanwhile, CA attention and Do-Conv convolution were added to the backbone network to improve the detection of ginger shoots and the convergence speed of the model. The experimental results showed that mAP50 reached 98.72% and mAP75 reached 82.46%.

Conclusions
To achieve ginger seed detection and ginger shoot orientation discrimination, this paper introduced an improved YOLOv4-LITE network to detect ginger shoots and ginger in ginger seed images and then discriminated ginger shoot orientation by calculating the position of the largest ginger shoot relative to the ginger. First, this paper replaced the original CSPDarknet53 backbone network with MobileNetv2, which significantly reduced the network parameters and computation; thus, facilitating migration of the network to a mobile terminal. Second, a coordinate attention mechanism was added into the backbone Unlike the traditional color difference segmentation method, Hou et al. proposed a fast recognition method for ginger shoots based on YOLOv3. The method avoided the manual design of feature extractors and had good robustness, and the AP of ginger shoots reached 98.2%. However, it only identified ginger shoots and did not identify ginger, resulting in a complex calculation process for ginger shoot orientation. In addition, its backbone network was complex and contained a large number of redundant parameters, making it difficult to deploy on ginger seeders and also restricting the development of automated ginger seeders to some extent.
Based on the constructed ginger seed dataset, the backbone network of YOLOv4 was replaced by MobileNetv2, which greatly reduced the network parameters and computational effort. Meanwhile, CA attention and Do-Conv convolution were added to the backbone network to improve the detection of ginger shoots and the convergence speed of the model. The experimental results showed that mAP 50 reached 98.72% and mAP 75 reached 82.46%.

Conclusions
To achieve ginger seed detection and ginger shoot orientation discrimination, this paper introduced an improved YOLOv4-LITE network to detect ginger shoots and ginger in ginger seed images and then discriminated ginger shoot orientation by calculating the position of the largest ginger shoot relative to the ginger. First, this paper replaced the original CSPDarknet53 backbone network with MobileNetv2, which significantly reduced the network parameters and computation; thus, facilitating migration of the network to a mobile terminal. Second, a coordinate attention mechanism was added into the backbone network to improve the detection of ginger shoots. Third, Do-Conv was adopted to replace some traditional convolutions, thus improving the model convergence speed. Finally, the paper also used focal loss to solve the imbalance between positive and negative samples and the imbalance between simple and difficult samples in the ginger dataset.
The experimental results showed that the mAP 50 of the proposed improved YOLOv4-LITE network reached 98.73%. Compared with the original YOLOv4, its Params and GFlops decreased by 15.95 M and 21.14, respectively, while the mAP 50 was only reduced by