Metaheuristic Optimization for Improving Weed Detection in Wheat Images Captured by Drones

: Background and aim: Machine learning methods are examined by many researchers to identify weeds in crop images captured by drones. However, metaheuristic optimization is rarely used in optimizing the machine learning models used in weed classiﬁcation. Therefore, this research targets developing a new optimization algorithm that can be used to optimize machine learning models and ensemble models to boost the classiﬁcation accuracy of weed images. Methodology: This work proposes a new approach for classifying weed and wheat images captured by a sprayer drone. The proposed approach is based on a voting classiﬁer that consists of three base models, namely, neural networks (NNs), support vector machines (SVMs), and K-nearest neighbors (KNN). This voting classiﬁer is optimized using a new optimization algorithm composed of a hybrid of sine cosine and grey wolf optimizers. The features used in training the voting classiﬁer are extracted based on AlexNet through transfer learning. The signiﬁcant features are selected from the extracted features using a new feature selection algorithm. Results: The accuracy, precision, recall, false positive rate, and kappa coefﬁcient were employed to assess the performance of the proposed voting classiﬁer. In addition, a statistical analysis is performed using the one-way analysis of variance (ANOVA), and Wilcoxon signed-rank tests to measure the stability and signiﬁcance of the proposed approach. On the other hand, a sensitivity analysis is performed to study the behavior of the parameters of the proposed approach in achieving the recorded results. Experimental results conﬁrmed the effectiveness and superiority of the proposed approach when compared to the other competing optimization methods. The achieved detection accuracy using the proposed optimized voting classiﬁer is 97.70%, F-score is 98.60%, speciﬁcity is 95.20%, and sensitivity is 98.40%. Conclusion: The proposed approach is conﬁrmed to achieve better classiﬁcation accuracy and outperforms other competing approaches.


Introduction
Climate change and worldwide population expansion are exerting significant pressure on agriculture to expand food production in terms of quality and quantity. Because the global population is expected to grow to nine billion people by 2050, agricultural production will need to quadruple to keep up [1]. Plant diseases, pests, and weed infestation pose enormous problems to agriculture [2][3][4][5]. Weeds are unwelcome plants that take nutrients from the soil, compete with profitable crops for light, water, and other resources, and spread by seeds or rhizomes. Weeds, pests, and diseases diminish crop yields and quality, reducing the amount of food, fiber, and biofuel that can be produced. Losses might be sudden or long-term, but on average, 42% of a few key food crops' productions is lost.
To achieve reasonable weed control and increased crop output, farmers invest billions of dollars every year in weed management. It is, therefore, critical to managing weeds in horticulture crops, as failure to do such results in lower yields and product quality [6]. If not handled properly, the employment of chemical and cultural control methods might negatively affect the ecosystem. Weed control will be more successful and long-lasting with low-cost technology for identifying and mapping weeds early in their life cycle. Crop diseases and pests can be reduced, and crop yields can be increased by as much as 34% when early weed management is used [7]. Weeds may be managed in various ways, all of which take environmental considerations into account. Image processing is one of the most promising of these methods. Unmanned aerial vehicles (UAVs) are used in image processing to monitor crops and capture images of potential weeds in the fields. Due to their capacity to cover enormous areas quickly, UAVs have been proven to be helpful in agriculture because they do not create soil compaction or damage in the fields [8]. It is still a challenge to turn data gathered by UAVs into relevant information. Due to the manual labor required for segment size tweaking, feature selection, and rule-based classifier building, conventional data gathering and classification cannot be automated.
With the goal of increasing crop productivity while decreasing the prevalence of unwanted weeds, agricultural mechanization has emerged as the leading research field [9]. The intelligent spraying system relies heavily on accurately identifying weed plants to maximize agricultural yield [10]. Many machine learning-based algorithms have been developed for weed identification, making it a promising area of study for data scientists [11]. Many scientists used computer vision algorithms to categorize agricultural, and weed plants [12]. Various deep learning and hand-crafted models have also been published and have made substantial contributions [13]. Color classification strategies for perennial weed identification [12], CNN-based method/approach for distinguishing sugar beet plants from weeds [14], deep convolutional neural network (DNN) [15], Gabor wavelets and neural network [16], hyperspectral imaging with wavelet analysis [17], decision trees, and artificial neural network [18] have all been proposed for the classification of weeds. The agriculture industry has benefited greatly from these strategies, which have produced extraordinary results. Better weed plant classification, however, requires more advanced and efficient methodologies to boost the accuracy of weed detection.
In this work, a publicly available wheat images dataset is employed as the overarching inspiration for this research. This dataset is utilized for training a deep neural network through transfer learning and feature extraction. In addition, to boost the classification accuracy of weed images, a new optimization algorithm is proposed to optimize the parameters of a new voting ensemble classifier composed of neural network (NN), knearest neighbors (KNN), and support vector machine (SVM) machine learning models. Moreover, a binary optimizer is proposed to optimize the feature selection process to select the best set of features. To evaluate the performance of the proposed methodology, a set of evaluation criteria is adopted to assess the effectiveness of the feature selection algorithm and an optimized voting ensemble model. On the other hand, statistical tests, such as the one-way analysis of variance and the Wilcoxon signed-rank test, are conducted to evaluate the significance and statistical difference of the proposed methodology. The recorded results are compared to those of other algorithms to show the superiority of the proposed approach.
This paper is structured in terms of five sections. Section 1 presents the introduction of the problem addressed in this paper and a summary of the proposed solution. Section 2 discusses the main milestones in the literature related to the task of weed detection. The materials and methods employed in the proposed solution are presented in Section 3. The proposed algorithms and solutions are explained in Section 4, and the experimental results are discussed in Section 5. Finally, the conclusions are presented in Section 6.

Literature Review
Weed identification using machine learning and image analysis has been increasingly popular in recent years, and the research presented here examines some of the most notable examples. Weed maps may be generated using several classification methods using UAV images [19][20][21][22]. However, recent state-of-the-art publications [23] reveal that machine learning algorithms are superior to traditional parametric methods in terms of accuracy and efficiency when dealing with complex data. The random forest (RF) classifier is one of the most well-liked machine learning algorithms for use in remote sensing [24]. This is because of its high generalization performance and fast processing time. The classification of highresolution UAV images and agricultural mapping with RF has been proven beneficial. SVM is another well-known machine learning classifier [25][26][27][28], and it has been widely used to categorize weeds and crops. Meanwhile, the authors in [29] employed the KNN method to identify spreading thistles in sugar beet fields. New efforts on a machine learning-based approach for weed detection are summarized in Table 1. In [25], the authors create a land cover map of the Riverina region in New South Wales, Australia, covering a total area of 6200 km 2 , to identify and categorize perennial crops in this vast region. They used object-based image analysis with supervised support vector machine classification to improve precision. After analyzing the data, they determined that the accuracy for a total item count using all twelve classes was 84.80%, but it increased to 90.20% when weighted by object area. The outcomes proved the feasibility of employing a succession of medium-resolution remote sensing images to generate comprehensive land cover maps over extensive perennial cropping regions. With an RF classifier, the authors of [3] created a real-time computer vision-based system to identify weeds in agricultural fields. The classification model was trained using the authors' dataset, and then field data were used to verify its accuracy. They also created a fluid flow control system based on pulse width modulation, which uses the information provided by the vision system to regulate the spraying of an agrochemical. As a result, the authors proved the utility of their pesticide spraying method based on real-time vision.
In [34], a support vector machine (SVM) technique is used to detect weeds in chili field images. Examining how well the SVM classifier functions within a comprehensive weed-control strategy was the focus of their study. Five distinct types of weeds were depicted in the images they took of Bangladeshi chili crops. The authors used a global thresholding-based binarization algorithm to segment the images, separating the plants from the ground to extract features. Fourteen features were extracted from each image and sorted into color, shape, and moment invariants. Eventually, a support vector machine classifier was utilized to search for weeds. Their experiments determined that the SVM was 97% accurate over a set of 224 images. The authors of [27] presented a method for weed identification in sugar beet cultivation by utilizing a combination of numerous form elements to establish patterns that would be used to distinguish between sugar beets and weeds, which are visually quite similar. Images of sugar beet farms at Shiraz University served as the basis for this study. These images were altered using the MATLAB toolbox. Shape factors, moment invariants, and Fourier descriptors were among the properties of geometric space that the authors investigated to establish a distinction between weeds and sugar beets. Next, the authors utilized KNN and SVM classifiers, whose combined accuracy was 92.92% and 95%, respectively.
A color-index-based histogram is utilized to distinguish between weed, soybean, and soil classes, and a monochrome image is produced, as described in [28]. After scaling the image to a range of 0-255, greyscale images were obtained by creating and normalizing image histograms, which were then utilized for training BPNN and SVM classifiers. This study set out to find an alternate feature vector that will guarantee a high weed identification rate while also being computationally straightforward. In total, this method yielded accuracies of 96.60% for BPNN and 95.08% for SVM. The authors presented an automated weed identification system in [31] that could identify weeds at different developmental stages. In this case, sensors mounted on an unmanned aerial vehicle (UAV) were used to acquire color, multispectral, and thermal imagery. Using color images as the ground truth, researchers manually drew bounding boxes around plant bulbs and labeled them by hand. Next, they turned the gathered images into normalized difference vegetation index (NDVI) images using image processing techniques. At last, they used machine learning techniques to sort the weeds from the useful plants.
Images were obtained from a plant laboratory in Belgium, and the authors of [5] studied how well a hyperspectral snapshot mosaic camera worked at identifying weeds and maize. The calibration formula reflectance was obtained after these raw images were processed for the band features. One hundred eighty-five features were discovered across reflectance, NDVI, and RVI in the VB and NIR spectrums. To further streamline the process, the authors turned to a principal component analysis-based feature reduction technique. This data was then fed into feature selection algorithms, which were used to isolate relevant features. In the end, an RF classifier was employed to distinguish between weeds and crops. Accuracy for identifying various weeds was up to 81% overall. At an early stage in the development of herbaceous crops, the authors of [30] proposed an automated, RFbased image processing method for weed detection. This method combines digital surface models (DSMs) with orthomosaic methods using images captured by unmanned aerial vehicles (UAVs). After that, an RF classifier was utilized to differentiate between weeds and crops/soil, with results of 87.90% for sunflower fields and 84% for cotton fields, respectively. Obtaining radiometrically calibrated multispectral imaging, segmenting images, and employing a machine learning model are the essential components of a straightforward methodology presented in [24] for monitoring emerging and submerged invasive water soldiers. The eBee mapping drone was used to get the imagery. Pix4Dmapper Pro 3.0 was used to create the orthomosaic from the multispectral images.
The authors of [4] presented a method that uses UAVs to precisely predict when avocado plants are at particular stages of development. They shot the multispectral images with a Parrot Sequoia camera. After separating the digital terrain model from the digital surface model, a canopy height model was utilized to determine the height of the trees. Then, they used orthomosaic at-surface reflectance images and a variety of vegetation indices depending on the brightness of the plants in the red edge and NIR bands. The final step was implementing an RF method, which ended up being 96% accurate. UAV images from sunflower and maize fields were utilized in a weed mapping strategy proposed for precision agriculture in [33]. Object-based image analysis (OBIA) with a support vector machine (SVM) method linked with feature selection approaches was utilized to solve the spectral similarity problem for crop and weed pixels in the early growth stage. Spain's private farms, La Monclova and El-Mazorcal, had images of sunflower and corn fields captured by the UAV. After that, the images were mosaicked with the help of the Agisoft Photoscan program, and then the items in the subsample were labeled using unsupervised feature selection approaches. At the same time, the automatic labeling was done under human oversight. These items were categorized as color histograms and data features based on remote-sensing measurements (first-order statistics, textures, etc.). The results showed that this SVM-based method had an overall accuracy of about 95.50%.

Materials and Methods
In this section, the dataset employed in this study is presented along with the key machine learning techniques, such as baseline classification models and ensemble approaches. In addition, the basics of grey wolf and sine cosine optimization methods forming the basis of the proposed optimization algorithm are presented in this section.

Data Collection
Field crops can be captured using sensors and UAVs equipped with cameras. In this work, the wheat crop images are captured using an autonomous sprayer drone, and the dataset is freely available on Kaggle [35]. Sample images in this dataset are shown in Figure 1. The collection dataset consists of 1176 wheat images and 4851 weed images in the training set. The testing set is composed of 130 wheat images and 540 weed images.

Pre-Trained AlexNet
Convolutional neural networks (CNNs) are a subset of multi-layer neural networks that extract information from images by analyzing their pixels [36]. Convolution, pooling, and fully linked layers are conventional CNN's three fundamental building blocks. Convolution layers do the bulk of a CNN's computations and are the most important building blocks. In other words, it filters the input using a convolutional filter and sends the result to the following layer. The input is filtered by the applied filter, which also serves as a feature identifier, yielding a feature map. The pooling layer's job is to lower the space needed for the spatial representation and the computations that follow each successive convolution. Each sliced input is pooled in the pooling layer, lowering the computational burden of the subsequent convolution layer. Extraction and reduction of features from input images is achieved by applying convolution and pooling layers. When the fully linked layer is used, an amount of output proportional to the number of classes is produced. The layers that make up a CNN architecture are layered versions of each other. Despite some subtle differences, all CNNs are built on the same basic structure. In this work, the AlexNet pre-trained architecture is employed to extract useful features for classifying wheat and weed images.

Grey Wolf Optimizer
Grey wolf optimizer (GWO) motions are based on those of genuine wolves while they are on the prowl or hunting. Wolves tend to live in packs of varied sizes. A pack of wolves has a minimum of five members and a maximum of twelve members. There are four distinct varieties of wolves, each with a specific function within the pack. These are known as alpha, beta, omega, and delta [37]. Alpha-type wolves often make decisions on when and where to go for a stroll, hunt, and sleep with the assistance of beta-type wolves in the pack. It is generally accepted that alpha wolves are dominant wolves, with beta wolves serving as their subordinates. There are few better candidates to take over for the alpha wolf when they die. When alphas make judgments, the betas are there to support them and provide feedback to the alphas so they may make better decisions in the future; when it comes to the wolves of types alpha and beta, the delta wolves are typically subservient to the omegas. Delta wolves are divided into five groups: caretakers, hunters, elders, and scouts. In the group, each category has a distinct purpose. As the group's "scapegoats," the omega-type wolves had to submit to all the other wolves in the pack.
The grey wolf optimizer uses alpha, beta, and delta agents to lead the search for the optimum solution. In contrast, omega agents follow these three agents in the quest for the best solution. The alpha solution is considered the best-fitting solution in the grey wolf optimizer. On the other hand, the solutions of type beta and delta signify the second and third most suitable solutions.
Mathematically, the first, second, and third fittest solutions are denoted by (P α ), (P β ), and (P δ ), respectively, whereas (P ω ) refers to all other solutions. The update process of the GWO algorithm is depicted in Figure 2. In this figure, the gamma wolves and other hunters are guided by the alpha, beta, and delta wolves, to efficiently manage the hunting process. The position updating is performed as follows.
where P is the wolf's current location and t is the number of iterations the search algorithm has gone through. The prey's location is denoted by P s (t), and the coefficient vectors A and C are defined as follows. There are two sets of random values for the vectors r 1 and r 2 , and the values of a are chosen from the range [0, 2] in descending order. The updated values of the vector a govern the balance between the exploitation and exploration operations [37]. The following formula calculates the most recent change to this vector.
where M t is the number of possible iterations. These positions are utilized to lead the other solutions, given by the symbol P ω , to move in the direction of the prey, as seen in the search process in Figure 2. The three best-fitting solutions are P α , P β , and P δ . The process of updating the positions of the wolves is described using the following equations by substitution of P s (t) in Equation (1) by P α , P β , and P δ .
The calculations of A 1 -A 3 and C 1 -C 3 are performed by Equations (2) and (3), respectively. The population's new position is calculated as follows.

Sine Cosine Algorithm
In [38], the sine cosine algorithm (SCA) was presented for the first time. The sine cosine oscillation function plays a crucial role in identifying the best possible solution locations, as shown in Figure 3. A set of random variables are used to represent the steps of the SCA's operation [39,40].

•
The movement location. • The motion direction. • Swapping between the sine and cosine components. • Emphasizing/de-emphasizing the destination effect.
The update process of the candidate solutions is performed using the following equation.
where t is the iteration number, the positions of the current and previous solutions at iterations t + 1 and t are denoted by S(t + 1), and S(t), respectively. The position of the best solution is referred to as P. The values of [0 − 1] are allocated to the random variables r 2 , r 3 , and r 4 . The equation shows, for instance, that the positions of the optimal solutions affect the location of the current solution, making it simpler to reach the optimal solution. The following equation expresses the dynamic change in the value of r 1 .
where a is a constant, t and t max represent the current and maximum iterations, respectively. Due to its reliance on a single optimal solution to guide the other solutions, the SCA algorithm is more robust than many other meta-heuristic algorithms presented in the literature [39,40]. The convergence speed and memory usage of this approach are relatively low when compared to other algorithms. However, as the number of locally optimal solutions increases, the algorithm's performance degrades. To avoid being stuck in a local optimum, the proposed new algorithm incorporates the SCA optimizer and the GWO algorithm, taking advantage of their rapid convergence rates and memory efficiency and ensuring a balanced set of exploration and exploitation activities.

Baseline Machine Learning Models
This paper employs three baseline machine learning models to form the proposed ensemble voting approach. These base models are neural networks, k-nearest neighbors, and support vector machines. In this section, the basics of these models are presented briefly. On the other hand, ensemble methods cover two main types of ensemble methods: bagging classifier with a random forest as a type of averaging technique and Adaboost with a voting ensemble as a type of boosting technique. An introduction to these types of ensemble models is also presented in this section.

Neural Networks (NN)
Two or more layers of neurons and connections in the neural network structure allow it to learn a non-linear decision boundary. The term "processing elements" refers to what these neurons are often known as (PEs). The PEs use special training algorithms (such as ADAM and SGD) to try to mimic the operation of the human nervous system [42]. Input and output data can be separated by a "hidden layer", a layer between the two that is not visible to the user. The calculation of the node output value weighted total is as follows: in which I i is the input variable i, and w ij represents the hidden layer connection weight between I i and neuron j its β j , the bias. The sigmoid activation function may be used to define the node j output as follows: The value of f j (S j ) is used to define the network output using the previously hidden layer neurons as: The weights between neurons in the hidden layer and the output node are represented by w jk and β k , respectively.

K-Nearest Neighbors (KNN)
The K-nearest-neighbors (KNN) method is a non-parametric supervised classification technique that is both straightforward and useful in many contexts. Among classifiers used for pattern recognition, the KNN classifier is recommended due to its simple implementation, high accuracy, and speed of results [25]. It has various applications, including pattern recognition, machine learning, text classification, data mining, and object recognition. The KNN method employs a technique known as "classification by analogy", wherein an unknown data item is compared to its neighbors in the training set. The Euclidean distance is the standard for comparing two samples' degrees of similarity. Attributes with broader ranges are not given more weight than attributes with lower ranges by normalizing their attribute values. Using KNN, the most common category is used to categorize an unknown pattern. The following Euclidean distance equation is used to measure the distance between known/unknown data to determine the best category of the unknown label.
To classify dataset samples using the KNN approach, the nearest samples are considered to determine the final decision [43]. This approach depends mainly on the value of K, which represents the number of neighbors considered in classifying the dataset samples in terms of the following Euclidean distance.
This technique is used in conjunction with the NN in the proposed voting ensemble classifier for boosting the classification accuracy of wheat and weed images.

Support Vector Machine (SVM)
Support vector machine (SVM) is one of the effective machine learning models that can achieve promising performance when combined with deep networks and other machine learning models [44]. The basic formula of SVM is presented in the following.
where a is the input variable, w is the weight vector, and d is the model's error value. The discrepancy between anticipated and actual values can be reduced using SVM. According to the error indicator, SVM predicts the output label using an error reduction strategy based on the following optimization model.
Minimize : There are data violations whose varied values are larger than , the acceptable range with observable values, which are denoted by the coefficient of punishment C, the weight of the variables, the input variable, and the target observation (w i ), a i and b i . The values of the variables in Equations (15) and (16) are estimated to be used in Equation (14). It is possible to use a kernel function in SVM to describe the high-dimensional feature space of the input data points. The kernels are equipped to deal with a wide range of problems. Sigmoid, linear, polynomial, and radial basis functions (RBF) are among the four well-known SVM kernels employed. Because the RBF kernel has been shown to be capable of generalizing well to varied datasets, it was utilized in this investigation. As a result, Equation (13) may be interpreted as follows: where H(a, a i ) is the kernel function and γ is the parameter of this function. Unknown values of SVM parameters such as C and are used as decision variables. The optimization procedure must thus include them.

Ensemble Models
Because of the reduced variance, the ensemble approaches aim to combine the outputs of machine learning (ML) classifiers. Several classifiers are built independently, and their results are then averaged (e.g., bagging classifier, random forest, and voting techniques) (soft and hard). Overfitting is less of an issue with these methods. One of the most often used and effective ensemble techniques for classification and regression is random choice forests (RF). Compared to single classifiers, ensemble models have attracted much attention because of their accuracy, and noise tolerance [45]. The average ensemble classifier results from the output predictions using the following formula.
Using a set of weak classifiers, the boosting approach of ensemble models creates a robust classifier. If you want to forecast unexpected observations accurately, this technique uses a set of classifier weights and a training data case for each iteration. Any machine learning approach that takes weights on the training set can be used as a base estimator. Various classifiers are trained on randomly generated training sets to generate the final result. The test sample is classified by combining the outputs of all models using uniform averaging or voting procedures over class labels by all classifiers in the ensemble. Because of the randomization in its structural approach, this methodology may be utilized to reduce variance and subsequently be used to form an ensemble. Machine learning estimators are combined, and a majority vote (i.e., the output of each estimator) is termed hard voting in this approach. The class label is returned as the argmax of the sum of the predicted probabilities via soft voting or average predicted probabilities on the other hand [46]. For each classifier, the anticipated class probabilities are gathered. An average of the weights assigned to each classifier is then calculated. The class label determines the final class label (i.e., the highest average probability). ML classifiers of equivalent performance can use this strategy to counteract one other's flaws.

The Proposed Methodology
In this section, the proposed methodology is explained. The methodology starts with extracting the features of the input images using the deep network through transfer learning. The extracted features are then processed to select the most relevant features that boost the classification accuracy. The selected features are then used to learn three base models: NN, KNN, and SVM. These models are employed in a voting classifier optimized using the proposed optimization algorithm. The steps of the proposed methodology are depicted in Figure 4. The next section discusses the main steps of the proposed methodology.

Transfer Learning
In deep learning applications, the process of transfer learning is widely used [47], which is beneficial in the case of a limited dataset. To learn a new classification task, a pre-trained network is considered, such as AlexNet. In this work, we adopted AlexNet, which is trained using a large dataset, ImageNet. In the transfer learning process, the three connected layers of the AlexNet are replaced with the proposed voting classifier. The transfer learning process employed in this work is depicted in Figure 5.

Feature Extraction
Processing raw data to extract additional variables that aid machine learning algorithms is the focus of the feature extraction process. This work adopts AlexNet [48] for feature extraction. Figure 4 shows how AlexNet enlarges the input image to a fixed size of 227 × 227 × 3, using a 256-layer convolution filter with a window shape of 11 × 11, a 256-layer filter with a window shape of 5 × 5, and then 384, 384, and 256-layer convolution filters with a window size of 3 × 3 for the remaining three layers of the process. After the first, second, and final convolutional layers, the network has a maximum number of 3 × 3 pooling layers with a stride of 2. There are two fully connected layers with 4096 neuron outputs following the fifth convolutional layer in addition to these five layers. Afterward, there is a single completely linked output layer at the end of the network, which originally had 1000 output classes. In the end, Dropout, ReLU, and preprocessing are critical if you want top results in computer vision applications. This work replaces the last three layers with the proposed optimized voting ensemble model, which classifies only two classes (wheat and weed).

The Proposed Optimization Algorithm
The optimization of the proposed voting ensemble classifier is performed in terms of a new optimization algorithm based on the SCA and GWO optimization algorithms. The proposed optimization algorithm is referred to as the adaptive dynamic sine cosine fitness grey wolf optimization (ADSCFGWO) algorithm, with the steps listed in Algorithm 1.

Exploration/Exploitation Balance
The proposed ADSCFGWO algorithm automatically strikes the right balance between exploration and exploitation by dividing the population into smaller groups. Exploration and exploitation groups make up 70% of the population in this algorithm, which divides the population into two categories. An early exploration group with a high number of participants helps to identify new and intriguing search regions. Overall fitness grows, but the number of exploration group members drops from 70% to 30% due to more exploitative people gaining fitness values. If a better solution cannot be identified, an elitist technique maintains convergence by keeping the process leader in place in subsequent populations. ADSCFGWO can expand the size of the exploration group at any point if the fitness of the group's leader has not increased sufficiently throughout three iterations.

Fitness Function
The following equation is used to assess the quality of the solutions discovered by the optimization algorithms.
H n = αError(P) + β |S| |A| (20) where P stands for the model's variables. It can be noted that a certain trait is important to the population by looking for the values of α ∈ [0 to 1] and β = 1 − α. The total number of features is denoted by |S|.

Feature Selection
Features that meet particular criteria, such as originality, consistency, and meaningfulness, are selected and identified throughout the feature selection process. Two binary values (0 and 1) are utilized in the feature selection procedure to limit the search space. Therefore, continuous values-based optimizers require an update to deal with this problem effectively. This is the essential phase in feature engineering since it allows optimizers to choose the most optimal features for maximum performance. There are several ways to think about selecting features, such as a binary vector, in which each feature has an equal chance of being included in the solution or not [49]. Random populations of vectors with random features can be utilized as a starting point for meta-heuristic algorithms. This is followed by an iterative process of exploring and exploiting to identify the best collection of features [50]. To determine whether a feature is relevant, the search space is confined to binary values (0 and 1) alone. The proposed binary ADSCFGWO (bADSCFGWO) method is proposed to transform the continuous values from the continuous ADSCFGWO algorithm into binary {0, 1} values to fit the feature selection procedure. The Sigmoid function used to convert the continuous solution to binary values is represented by the following equation, and the steps of the proposed binary optimization algorithm are listed in Algorithm 2. Run the proposed ADSCFGWO algorithm 6: Convert solutions to binary using Equation (21) 7: Measure fitness function 8: Update the algorithm parameters 9: end while 10: Return P * where at iteration t, m refers to the best answer.

Experimental Results
To evaluate the proposed approach of weed detection, a set of experiments were conducted to assess the performance of the process of the proposed approach. The assessment included the steps of feature selection, classification methods, and the proposed optimized voting classifiers approach. The coming sections present the details of the achievements.

Configuration Parameters
The first set of experiments was conducted to determine the best collection of values assigned to the configuration parameters. Table 2 presents the set of values of the parameters of the optimization of the feature selection process. In addition, the configuration parameters of the grey wolf and other optimization algorithms are presented in Table 3. The values of these parameters are employed in the proposed optimization algorithm and the algorithms used in the comparison experiments.

Evaluation Metrics
The metrics used to assess the feature selection approach are presented in Table 4. These criteria include the average fitness size, average error, standard deviation, best fitness, worst fitness, and mean error. In their criteria, M refers to the number of runs of the optimizer, j refers to the run number, g * j is the best solution at run number j, and the size of the best solution vector is referred to as size(g * j ), the number of points in the test set is denoted by N. The output class label is C i for the data point i corresponding to the label L i . D denotes the total number of features.
On the other hand, the metrics used to assess the proposed voting classifier are presented in Table 5. These metrics include F1-score, specificity, accuracy, sensitivity, Nvalue, and Pvalue. The true positive, true negative, false positive, and false negative measures used in these metrics are denoted by TP, TN, FP, and FN, respectively. Table 4. Evaluation metrics used in assessing the feature selection approach.

Metric Formula
Average fitness size Table 5. Evaluation metrics used in assessing the proposed optimized voting classifier.

Feature Extraction Results
Deep learning is one of the main approaches to extracting effective features from images. In this work, to select the deep network that selects the best features from the given images, an experiment is conducted to study the effectiveness of the features extracted from three deep networks, namely, VGGNet, ResNet-50, and AlexNet. The results are recorded and presented in Table 6. As presented in this table, the best results are achieved using AlexNet, and thus this network is adopted for the rest of the conducted experiments.

Evaluating the Proposed Feature Selection Method
The feature selection applied to the weed features is performed using the proposed binary ADSCFGWO. The achieved results are compared with other state-of-the-art binary optimization techniques; namely, grey wolf optimizer (GWO) [51], hybrid GWO and PSO (bGWO_PSO) [52], particle swarm optimization (PSO) [53], whale optimization algorithm (WOA) [54], firefly algorithm (FA) [55], and genetic algorithm (GA) [56]. The evaluation results are presented in Table 7. As presented in the table, the proposed binary ADSCFGWO algorithm achieved the best average error (0.69504) compared to the other optimization algorithms. In addition, average select size, mean fitness, worst fitness, and standard deviation are superior for the proposed algorithm compared to the other algorithms.

Evaluating The Proposed Optimized Voting Classifier
Three baseline models have been experimented with and evaluated separately. These models are KNN, SVM, and NN. The recorded results are presented in Table 8. In this table, the accuracy achieved by KNN, SVM, and NN is 89.1%, 92.1%, and 93.5%, respectively. From these results, it can be noted that the NN baseline model achieves the best performance. The proposed ADSCFGWO algorithm is used to optimize the parameters of a voting ensemble model composed of the three baseline models. To prove the proposed approach's superiority, the results are compared to those of four other optimization algorithms, namely, WOA, GWO, GA, and PSO. Table 9 presents the recorded results. In this table, it can be noted that the results achieved by the proposed optimized voting ensemble are better than those achieved by optimizing the voting ensemble using other optimization methods.
On the other hand, a set of experiments is conducted to analyze the proposed approach's performance statistically. Table 10 presents the statistical analysis results. This table shows that based on 20 random samples, the mean accuracy is 97.74% and the standard deviation is relatively tiny (0.0004894), indicating the proposed approach's robustness. When these results are compared to those of the other optimization algorithms, the superiority of the proposed approach is obvious. The significance and stability of the proposed approach are studied in terms of the analysis of variance (ANOVA) and Wilcoxon signed-rank tests. The results are shown in Tables 11 and 12. The measured p-value of the ANOVA and Wilcoxon tests is (p < 0.0001), which indicates the significance of the proposed approach.  A visual representation of the results achieved by the proposed approach is shown in Figure 6. This figure shows the residual plot with residual error in the range of (−0.015 to 0.010). The homoscedasticity and QQ plots show a robust prediction of the class labels. On the other hand, the heatmap indicates a promising performance using the proposed ADSCFGWO algorithm, which is better than the other optimization algorithms. The receiver operating characteristic (ROC) plot depicted in Figure 7 shows robust detection results. In addition, this figure's accuracy and histogram plots show a promising performance that outperforms the results of the other optimization algorithms.

Sensitivity Analysis of the Proposed Approach
One-at-a-time (OAT) sensitivity analysis was used to study the sensitivity analysis. Regarding sensitivity analysis, OAT is one of the most straightforward methods. To test the algorithm's performance, one parameter at a time is changed while the other parameters remain the same. As the values of various factors were varied, the convergence time and ADSCFGWO's fitness values changed accordingly (as presented in Tables 13-16). When evaluating each parameter, 20 values are selected in that parameter's interval by multiplying the interval's length by 5%. As a result, the algorithm ran ten times for each number; the results are shown in the table below. It took 100 runs of ADSCFGWO for each parameter.  The one-way analysis of variance (ANOVA) is performed to assess the significant difference between the proposed approach and other approaches. While modifying the settings of ADSCFGWO, two ANOVA tests are applied to both the convergence time and the fitness values. Table 17 shows the ANOVA test results for ADSCFGWO's convergence time and lowest fitness. Table 18 shows that p-values are less than (0.05) and F is larger than the F-critical level. Because of this, there is a statistically significant difference between the average of each parameter's five groups of convergence time. When each parameter's value is changed, a statistically significant difference can be seen between the means of all five minimal fitness groups. ANOVA does not tell which groups have statistical significance. As a result, a post hoc analysis is carried out between each pair of groups. A significance threshold of (0.05) was used for this purpose in a one-tailed t-test. Table 19 tested the algorithm's parameters using the t-test based on the convergence time and minimum fitness of ADSCFGWO. There is a statistically significant difference between groups, with p-values less than (0.05) according to the table. As for convergence time, there is a t-test between the exploration percentage and mutation rate that is statistically significant (0.05). This proposes that no statistically significant difference exists between their impacts on the time of convergence. The number of iterations or the mutation rate does not affect the minimal fitness. A visual representation of the study of the sensitivity of the algorithm parameters is represented by the plots shown in Figure 8. In this figure, the residual plot and homoscedasticity show the stability of the parameters. In addition, the QQ and heatmap plots show the robustness of the optimized parameters.   8. Residual, homoscedasticity, and QQ plots and heatmap of ADSCFGWO's parameters (r 1 , r 2 , r 3 , r 4 , A 1 , A 2 , A 3 , C 1 , and C 2 ) based on convergence time.
The histogram depicted in Figure 9 shows the convergence time of the parameters of the proposed algorithm. In this figure, it can be noted that some parameters converge faster than others. For example, r 3 , r 2 , A 2 , and C 1 converge faster than the other parameters. However, all the parameters converge in 12.4 s. In addition, the histogram of the convergence time of the proposed ADSCFGWO is depicted in Figure 10.  A study of the sensitivity of the fitness of the proposed approach is conducted, and the results are recorded in Tables 20-22. These tables represent the ANOVA test, the Wilcoxon test, and the statistical analysis of the achieved results. It can be noted from these tables that the parameters of the proposed algorithm as the p-value < 0.0001. The fitness of the parameters of the proposed approach is analyzed, and the results are presented in Tables 20-22 in terms of the statistical analysis using Wilcoxon and ANOVA tests. These results show the effectiveness of the analyzed parameters in solving the optimization problem.
More investigation of the effectiveness of the parameters of the proposed approach is performed using the set of plots depicted in Figures 11-13. These figures show the significance of the parameters in the optimization problem and the convergence of the fitness.  Figure 11. Convergence fitness of ADSCFGWO's parameters (r 1 , r 2 , r 3 , r 4 , A 1 , A 2 , A 3 , C 1 , and C 2 ). Figure 12. Residual, homoscedasticity, and QQ plots and heatmap of ADSCFGWO's parameters (r 1 , r 2 , r 3 , r 4 , A 1 , A 2 , A 3 , C 1 , and C 2 ) based on convergence fitness. Figure 13. Histogram of convergence fitness of ADSCFGWO's parameters (r 1 , r 2 , r 3 , r 4 , A 1 , A 2 , A 3 , C 1 , and C 2 ).

Discussion and Ranking of Parameters
The parameters of the proposed ADSCFGWO algorithm can be ordered according to their effect on the fitness values as follows: C 2 , A 1 , r 4 , r 1 , C 1 , A 3 , A 2 , r 2 , and r 3 . It is also possible to rank the following variables in the order of their impact on the convergence time: r 1 , A 2 , C 2 , r 3 , r 4 , A 1 , C 1 , and r 2 . The ADSCFGWO algorithm's convergence time is influenced by r 2 . The value of r 2 has the least impact on the algorithm's convergence for all of these reasons. The ADSCFGWO algorithm's convergence time is strongly influenced by r 1 , r 2 , r 3 , and A 2 values. The ADSCFGWO algorithm has a convergence time sensitive to exploration percentages larger than 25%. In terms of fitness, C 2 and A 1 significantly impact the algorithm's performance.

Discussion
A set of experimental setups is used to evaluate the effectiveness of the proposed methodology in identifying wheat/weed images. Firstly, positive results indicate the effectiveness of the features derived from the AlexNet architecture through transfer learning. The features retrieved from AlexNet are then used in a feature selection scenario. In the second scenario, the proposed ADSCFGWO algorithm proves both stable and dependable in its quest to identify the best possible collection of features in a reasonable period. In addition, the Wilcoxon rank-sum test highlights the relevance of the proposed AD-SCFGWO algorithm by demonstrating its statistical significance. On the other hand, other experiments are conducted to demonstrate that the proposed optimized voting classifier outperforms the competing methods when classifying the input crop image, with a mean accuracy of (97.75%). A sensitivity analysis is carried out to ensure the proposed method is reliable. Testing and results show that the proposed technique is highly effective in classifying wheat/weed images.

Conclusions
This paper proposes a new approach to classify wheat and weed in drone-captured images based on metaheuristic optimization and machine learning. The proposed approach is based on a new optimized voting classifier that could efficiently classify the features extracted using AlexNet. To boost the classification accuracy, the extracted features are optimized to select the significant features based on a new binary optimization algorithm. The optimization of the voting classifier and the binary optimization algorithm developed for feature selection are based on the GWO and SCA optimization algorithms in a new hybrid optimization algorithm referred to as the ADSCFGWO algorithm. The proposed voting classifier comprises three machine-learning models: NN, SVM, and KNN. These classifiers' contribution to the final results is optimized using the proposed optimization algorithm. The proposed approach's efficiency was evaluated using various metrics, including accuracy, precision, recall, false positive rate, and kappa coefficient. In addition, the ANOVA and Wilcoxon signed-rank tests are used to assess the reliability and validity of the proposed methodology. Moreover, a sensitivity analysis is carried out to investigate the impact of varying parameters of the proposed approach on the observed outcomes. The proposed methodology was superior to existing optimization strategies in a series of experiments with a detection accuracy of 97.70%, an F-score of 98.60%, a specificity of 95.20%, and a sensitivity of 98.40%. From the statistical analysis, the ANOVA and Wilcoxon signed-rank tests showed the value of p as (p < 0.005), indicating the proposed approach's statistical difference.