Image-to-Image Translation-Based Data Augmentation for Improving Crop/Weed Classiﬁcation Models for Precision Agriculture Applications

: Applications of deep-learning models in machine visions for crop/weed identiﬁcation have remarkably upgraded the authenticity of precise weed management. However, compelling data are required to obtain the desired result from this highly data-driven operation. This study aims to curtail the effort needed to prepare very large image datasets by creating artiﬁcial images of maize ( Zea mays ) and four common weeds (i


Introduction
The potential of deep-learning algorithms has been demonstrated in almost all stages of agricultural activities, paving the way for efficient handling and non-destructive evaluation [1][2][3][4][5][6][7].One of the agricultural domains that could benefit from these algorithms is weed management.It is well-known that efficient weed control is one of the inevitable contributing factors towards sustainable agriculture as it can positively contribute to plant growth, yield, and quality while minimizing the need for weedicides.However, manual and traditional weed removal methods have been labor-intensive and inefficient.In this regard, scholars have developed numerous deep-learning models based on convolutional neural networks (CNNs) to classify various crops and weed species [8][9][10][11].Moreover, machinery based on machine vision has been developed to provide profound solutions for weed management [12][13][14][15].Although deep-learning networks have enhanced the authenticity of automated crop/weed classification algorithms, the technique suffers from mining large amounts of data that are collected from various geographic conditions.Furthermore, a majority of in-field weed identification tasks require pixel-level annotations [16][17][18].Overall, acquiring huge amounts of data and the preparation of ground truth is a tedious task, especially for precision agriculture applications [19].
Though many open-source agriculture datasets have been available in recent years, the quality and amount of data do not meet the requirements of researchers [19,20].In addition, models trained with such data fail to generalize and are not robust enough to be used in diverse practical environments [21].One way to overcome these difficulties is by adopting image geometric-and intensity-based data augmentation [22].In addition, when CNNs are employed for machine vision tasks, transfer learning is preferred [23,24], where a pre-trained deep-learning model is fine-tuned with an available dataset for a particular task [25].This approach has seen a lot of utilization for in-field weed identification [26][27][28].For instance, Espejo-Garcia et al. developed a solution based on feature extraction from deep layers of various transfer-learned CNN models for automated crop and weed identification [26].Chen et al. performed a similar study based on transfer learning for identifying weeds in cotton production systems [27].Both of the above studies recorded classification accuracies greater than 95%.However, such traditional image augmentation techniques and transfer learning provide highly correlated images and only little additional information to the deep-learning model.This not only reduces the ability of the model to generalize but leads to over-fitting problems.
In recent years, another advancement in deep learning, in the form of generative adversarial networks (GANs), has proven to be very efficient for data augmentation and image enhancement [29].GANs can generate artificial-realistic images using existing image data.The combination of these artificial and original images could enhance the development of subsequent models.GANs have been effectively applied to various tasks, such as human identification [30], organ segmentation [31], and emotion classification [32].These models have also been used for machine-vision applications in agriculture, such as generating images of specific plants [33,34], plant disease recognition [35], grain quality analysis [4], and for synthesizing images of plant seedlings [36].A few studies have also utilized GANs to assist in deep-learning-based operations in precision weed management (Table 1).With numerous architectures of GANs available, a performance comparison study was performed on the different combinations of a GAN model and a CNN-based classification model for designing a crop/weed classification pipeline tested on images of tomato crops and black nightshade [37].The authors obtained the highest accuracy of 99.07% and firmly concluded that GANs improve the classification performance of CNN networks.A few other studies used GANs to generate multi-spectral images of crops and weeds [38].In all the discussed works, GANs were used to synthesize the entire crop/weed/agricultural field image without any attention to the location and shape of the desired object in the image.It was observed that the generalizability of such networks over the texture and morphology-based features of the target classes was sub-optimal.
Henceforth as an improvement, in this study, we performed image generation using a conditional GAN (cGAN) based on the image-to-image translation concept [40].The primary objective here was to synthesize the images by preserving (conditioning) the original footprint of the objects in the real image, such as the shape of the plants.The real images of a particular class, along with their pixel-wise labels, were combinedly and fed into the GAN model to train it and, eventually, to obtain the artificial images of the respective classes.The image synthesis network exploited here is similar to the pix2pix conditional adversarial network, a very commonly used model for image translation tasks [40].Secondly, the validity of a classification task using the newly derived dataset was assessed for the two commonly adopted techniques, i.e., transfer learning [26,28] and the feature extraction method [41][42][43][44].For the above tasks, a pre-defined, state-of-the-art CNN architecture, the AlexNet [45], was employed.In the feature extraction technique, features from deep layers of the AlexNet were extracted to develop machine learning models using the support vector machines (SVM) and linear discriminant analysis (LDA) classifiers.Hence, the major objectives of this work are (i) the implementation of cGAN as a data augmentation approach to synthesize realistic plant images and analyze cGAN performance and (ii) to study the combination of cGANs and the classification algorithms for improving crop/weed species identification.

Tomato
Conventional GANs F1-score of 0.86 was obtained when GAN-based augmentation was performed, compared to 0.84 without the artificial dataset.[37] Generation of multi-spectral images of agricultural fields for semantic segmentation of crop/weeds.

Sugarbeet Conditional GAN (cGAN)
Intersection over union (mIoU) value was improved to 0.98 from 0.94 for background class and to 0.89 from 0.76 for vegetation.[38] Artificial data were generated using UAV-acquired images for supporting crop/weed species identification at an early stage.

Dataset and Pre-Processing
The dataset consisted of five classes, including maize (Zea mays) and four weed species commonly identified in maize production systems, namely, Charlock (Sinapis arvensis), Fat Hen (Chenopodium album), Shepherd's purse (Capsella bursa-pastoris), and Small-flowered Cranesbill (Geranium pusillum).The dataset was derived from Kaggle's image data of the crop and weed seedlings at different growth stages, a public image dataset offered by Giselsson et al. [46].Each class contained 200 RGB images at various growth stages (5-8 weeks) and illumination effects.These images were manually and binary segmented at pixel level using the Image Segmenter app of MATLAB R2020a and the Image Processing toolbox to distinguish the vegetation from the background.These semanticsegmented images were arranged in class-wise folders, as such so that the sequence of images matched the corresponding real image folder.This allowed for the easier pairing of real and segmented images, which was necessary during the cGAN training process.Figure 1 shows some sample images from the dataset along with their binary-segmented counterparts.The images were resized to a size of 256 × 256 pixels.
Processing toolbox to distinguish the vegetation from the background.These semanticsegmented images were arranged in class-wise folders, as such so that the sequence of images matched the corresponding real image folder.This allowed for the easier pairing of real and segmented images, which was necessary during the cGAN training process.Figure 1 shows some sample images from the dataset along with their binary-segmented counterparts.The images were resized to a size of 256 × 256 pixels.

Image Synthesis through GAN
The size of the dataset used here is small when compared to the ones generally employed in learning-based machine-vision tasks.Hence, augmentation through the adversarial networks was performed to increase the size of the dataset.Typically, a GAN architecture comprises a generator network that generates artificial images and a discriminator that aims to differentiate these artificial images from the real images [29].Both components are simultaneously trained in an adversarial manner, in which the generator aims to entrap the discriminator using its artificial images.The first proposed GAN models did not have control over any auxiliary information on the data that were being synthesized.Later, researchers introduced a conditional variable into the network's objective functions that contained the network over a particular attribute to synthesize images with the desired features [47].For instance, GANs were conditioned on text descriptions for text-to-

Image Synthesis through GAN
The size of the dataset used here is small when compared to the ones generally employed in learning-based machine-vision tasks.Hence, augmentation through the adversarial networks was performed to increase the size of the dataset.Typically, a GAN architecture comprises a generator network that generates artificial images and a discriminator that aims to differentiate these artificial images from the real images [29].Both components are simultaneously trained in an adversarial manner, in which the generator aims to entrap the discriminator using its artificial images.The first proposed GAN models did not have control over any auxiliary information on the data that were being synthesized.Later, researchers introduced a conditional variable into the network's objective functions that contained the network over a particular attribute to synthesize images with the desired features [47].For instance, GANs were conditioned on text descriptions for text-to-image synthesis and on class labels to generate MNIST dataset digits [48].Image conditional GAN was first studied by Isola et al. [40] for image-to-image translations.
In cGANs, the generator and discriminator networks are conditioned on the class label y, i.e., mapping to y is learned from the input image (or source image) x and the random vector z.The objective function can be given as: The cGAN architecture employed here is very similar to the model proposed in its original work for image-to-image translation, called the pix2pix GAN [40].The model is trained with paired images, i.e., the real and binary analog, in order to learn to map the features of these images.The attributes of the output image are conditioned by the source images (here, the binary images act as the source images).Suppose T ∈ € w×h is the binary mask of an image with width w and height h pixels, the network's goal is to make the model learn a mapping function that converts I into a photo-realistic image.Figure 2 shows the image generation workflow.The generator follows the U-Net framework [49], and the discriminator classifier is based on the PatchGAN [50].The U-Net is an encoder-decoder network where the input is first down-sampled to a bottleneck layer and then up-sampled from this point.Moreover, skip connections (which concatenate the channels for the two layers) are added between the i-th and n − i-th layers (n is the total number of layers).The PatchGAN discriminator classifies every patch in the image as real or artificial and the final output is determined by the average response.Overall, the generator model used here is a set of convolutional down-sampling layers and transpose convolutional up-sampling layers that are blended through a bottle-neck layer.The discriminator consists of six convolutional layers, such as an 8 × 8 pixel patch, which is obtained at the end.From this patch, the binary classification result (real image or generated image) is acquired.
tional GAN was first studied by Isola et al. [40] for image-to-image translations.
In cGANs, the generator and discriminator networks are conditioned on the class label , i.e., mapping to  is learned from the input image (or source image)  and the random vector .The objective function can be given as: The cGAN architecture employed here is very similar to the model proposed in its original work for image-to-image translation, called the pix2pix GAN [40].The model is trained with paired images, i.e., the real and binary analog, in order to learn to map the features of these images.The attributes of the output image are conditioned by the source images (here, the binary images act as the source images).Suppose  ∈ €  ×  is the binary mask of an image with width  and height ℎ pixels, the network's goal is to make the model learn a mapping function that converts  into a photo-realistic image.Figure 2 shows the image generation workflow.The generator follows the U-Net framework [49], and the discriminator classifier is based on the PatchGAN [50].The U-Net is an encoderdecoder network where the input is first down-sampled to a bottleneck layer and then upsampled from this point.Moreover, skip connections (which concatenate the channels for the two layers) are added between the -th and  − -th layers ( is the total number of layers).The PatchGAN discriminator classifies every ℎ in the image as real or artificial and the final output is determined by the average response.Overall, the generator model used here is a set of convolutional down-sampling layers and transpose convolutional up-sampling layers that are blended through a bottle-neck layer.The discriminator consists of six convolutional layers, such as an 8 × 8 pixel patch, which is obtained at the end.From this patch, the binary classification result (real image or generated image) is acquired.To monitor the fidelity of the generated images after each iteration, the t-distributed stochastic neighbor embedding (t-SNE) visualization is used.The t-SNE algorithm presents the similarities between the samples by iteratively comparing the probability distribution of the different data points in high-and low-dimensional spaces [51].By applying t-SNE to the real and generated images, the similarities and variances of the images can be further analyzed.Once the training is complete, new images are generated and amassed to analyze through the classifiers (see Sections 2.3 and 2.4).An Acer Nitro 5 Intel To monitor the fidelity of the generated images after each iteration, the t-distributed stochastic neighbor embedding (t-SNE) visualization is used.The t-SNE algorithm presents the similarities between the samples by iteratively comparing the probability distribution of the different data points in high-and low-dimensional spaces [51].By applying t-SNE to the real and generated images, the similarities and variances of the images can be further analyzed.Once the training is complete, new images are generated and amassed to analyze through the classifiers (see Sections 2.3 and 2.4).An Acer Nitro 5 Intel Core i5 9th Generation Laptop (32GB/1 TB HDD/Windows 10 Home/GTX 1650 Graphics) was used to run the MATLAB application.

Classification through Transfer Learning
In this study, we focus on a popular CNN architecture-AlexNet [45], which was designed in the context of the "Large Scale Visual Recognition Challenge" (ILSVRC) [52] for the ImageNet dataset [53].AlexNet effectively comprises five convolution layers, three fully connected (FC) layers, and a Softmax layer.The first, second, and fifth convolution layers are followed by a max-pooling layer with a pool size of 3 × 3 and strides of 2 × 2.
The convolution layers were furnished with half-padding and ReLU activation function layers.The details on the number of filters and the layer-wise operations are presented in Figure 3.To implement transfer learning, the last three layers of the network-an FC layer configured for 1000 classes; a Softmax layer; and the final classification layer were all replaced with an FC layer for 5 classes, followed by a Softmax layer and a classification layer, with their weights initialized through the Glorot normal method.
was used to run the MATLAB application.

Classification through Transfer Learning
In this study, we focus on a popular CNN architecture-AlexNet [45], which was designed in the context of the "Large Scale Visual Recognition Challenge" (ILSVRC) [52] for the ImageNet dataset [53].AlexNet effectively comprises five convolution layers, three fully connected (FC) layers, and a Softmax layer.The first, second, and fifth convolution layers are followed by a max-pooling layer with a pool size of 3 × 3 and strides of 2 × 2. The convolution layers were furnished with half-padding and ReLU activation function layers.The details on the number of filters and the layer-wise operations are presented in Figure 3.To implement transfer learning, the last three layers of the network-an FC layer configured for 1000 classes; a Softmax layer; and the final classification layer were all replaced with an FC layer for 5 classes, followed by a Softmax layer and a classification layer, with their weights initialized through the Glorot normal method.In order to fit AlexNet's input size, the images were resized to a dimension of 227 × 227 pixels.The evaluation was performed in two steps: firstly, the model was trained only with the real images, and then the real and artificial images were simultaneously used for training.Additional augmentations, such as image rotations, translations, and reflections along the x-and y-axes were specified for both cases.Regarding the training options, the gradient descent with momentum (sgdm) was chosen as the optimizer with an initial learning rate set to 0.001, a momentum of 0.9, and a weight decay factor of 0.0001.The training was limited to a maximum of 1000 epochs, with a mini-batch size of 32.The results of this transfer-learning model on the training and test sets are presented in Section 3.2.

Classification through Feature Extraction Technique
The convolutional layers in CNN summarize the features associated with each class through a set of filters, carrying the aspects of the input image to the subsequent layers [54].In the feature extraction method, the features were derived from the deep layers of a CNN, and a machine learning-based model was developed based on these features [55].An activation map was derived from the first convolution layer of the CNN and is represented in Figure 4.In this study, the features from the global pooling layer of AlexNet (pool5 layer) were extracted, which provided a vector of 9216 features.Due to a very high- In order to fit AlexNet's input size, the images were resized to a dimension of 227 × 227 pixels.The evaluation was performed in two steps: firstly, the model was trained only with the real images, and then the real and artificial images were simultaneously used for training.Additional augmentations, such as image rotations, translations, and reflections along the x-and y-axes were specified for both cases.Regarding the training options, the gradient descent with momentum (sgdm) was chosen as the optimizer with an initial learning rate set to 0.001, a momentum of 0.9, and a weight decay factor of 0.0001.The training was limited to a maximum of 1000 epochs, with a mini-batch size of 32.The results of this transfer-learning model on the training and test sets are presented in Section 3.2.

Classification through Feature Extraction Technique
The convolutional layers in CNN summarize the features associated with each class through a set of filters, carrying the aspects of the input image to the subsequent layers [54].
In the feature extraction method, the features were derived from the deep layers of a CNN, and a machine learning-based model was developed based on these features [55].An activation map was derived from the first convolution layer of the CNN and is represented in Figure 4.In this study, the features from the global pooling layer of AlexNet (pool5 layer) were extracted, which provided a vector of 9216 features.Due to a very high-dimensional feature map, the principal component analysis was applied to select only the components that explained 97% of the total variance.dimensional feature map, the principal component analysis was applied to select only the components that explained 97% of the total variance.The entire workflow is depicted in Figure 5.After deriving these features, two classifiers, namely, SVM and LDA, were adopted for classification purposes.These classifiers were chosen due to their exceptional performances in many agricultural datasets over other machine learning algorithms [42,56].The performance of the developed models was analyzed using precision, recall, and F1-score metrics, given by:

Support Vector Classification
Support vector machines (SVMs) have been widely used as a classifier for weed identification.Wu and Wen [57] performed crop/weed classification on a dataset of maize crops and four weed species images using SVM on image color and texture features.Later, they also included shape features in the SVM model and tested their performance using three different kernel functions (polynomial, sigmoid, and RBF) [58].According to Wong The entire workflow is depicted in Figure 5.After deriving these features, two classifiers, namely, SVM and LDA, were adopted for classification purposes.These classifiers were chosen due to their exceptional performances in many agricultural datasets over other machine learning algorithms [42,56].The performance of the developed models was analyzed using precision, recall, and F1-score metrics, given by:

Recall = True positives Actual number o f samples
(3) Algorithms 2022, 15, x FOR PEER REVIEW 7 of 18 dimensional feature map, the principal component analysis was applied to select only the components that explained 97% of the total variance.The entire workflow is depicted in Figure 5.After deriving these features, two classifiers, namely, SVM and LDA, were adopted for classification purposes.These classifiers were chosen due to their exceptional performances in many agricultural datasets over other machine learning algorithms [42,56].The performance of the developed models was analyzed using precision, recall, and F1-score metrics, given by:

Support Vector Classification
Support vector machines (SVMs) have been widely used as a classifier for weed identification.Wu and Wen [57] performed crop/weed classification on a dataset of maize crops and four weed species images using SVM on image color and texture features.Later, they also included shape features in the SVM model and tested their performance using three different kernel functions (polynomial, sigmoid, and RBF) [58].According to Wong

Support Vector Classification
Support vector machines (SVMs) have been widely used as a classifier for weed identification.Wu and Wen [57] performed crop/weed classification on a dataset of maize crops and four weed species images using SVM on image color and texture features.Later, they also included shape features in the SVM model and tested their performance using three different kernel functions (polynomial, sigmoid, and RBF) [58].According to Wong et al. [59], multi-class classification using SVMs generates the best probabilistic output.They trained an SVM model to differentiate the monocotyledon weeds, Ageratum conyzoides, and Amaranthus palmeri weeds from other weeds for selective spraying.Many other studies have also utilized the different versions of SVMs and discussed their advantages [60,61].
In SVM, the classification is performed by identifying a hyper-plane that differentiates the classes very well.The algorithm aims to maximize the minimum distance between a point and the discriminating hyper-plane [57].In this study, the radial basis function (RBF) was used to transform the feature space.This function computes the element (i, j) of the Gram matrix G as: where, x i and x j are the i-th and j-th observations of the training set.

Linear Discriminant Analysis
Discriminant analysis is based on the principle that different classes generate data based on various Gaussian distributions (multi-dimensional and normal distributions).Being a supervised technique, it collects information from all the variables and plots a new margin so that the classification outcome is at its best.In LDA, the attributes are assumed to be a Gaussian mixture distribution with different means but with a common covariance matrix.To recall, this matrix contains the variance of the data along the diagonal and covariance along with the corresponding off-diagonal elements.The center of the distribution is determined by the mean, and the shape is determined by the covariance matrix.Once the distributions are fitted, the boundaries are estimated by determining the points around them where the probabilities are similar.
Assuming there are C classes (all having a multivariate normal distribution), let Σ and µ c (c = 1, 2, . . ., C) be the covariance matrix and the mean vector of the distribution of the samples in the c classes.Say, x i,c is the i-th sample in class c, the objective of LDA is to assign this observation to class ĉ, minimizing the function h given by The mathematics and computations behind the discriminant analysis and its regularized version can be further explored in [62] and [63].These classifiers have also been extensively used for classification tasks in precision agriculture applications [56,64,65].

Evaluation of Generated Images
Before evaluating the results of the classification task, we assessed the fidelity of the generated images.The real and generated images for each class are shown in Figure 6.One can see that after around 60 iterations, the model started producing plausible artificial images.To give a fair insight into the image impression, a t-SNE method of visualization was adopted for 100 real and generated images for each class (see Figure 7).The dimensionreduction technique was used to plot the data points in a two-dimensional plot.Some outliers were identified in the t-SNE plot for Charlock.However, the synthetic Charlock images closely exhibit the shape and color features of the original images.For other classes, a similar distribution of the points corresponding to the actual and artificial images denoted that pertinent features are adequately learned and produced through the GAN.The artificial images preserved the key features of the real images and widened the coverage of the training dataset.Hence, realistic images were generated with the help of GAN, which could augment the existing crop/weed dataset.The advantage of GAN-based augmentation includes a reduced annotation workload since the generated images can be associated with the same segmentation mask created earlier.Moreover, the classification model can generalize better when trained with a dataset comprising GAN-generated images, especially on shapebased features.Apart from this, GANs can also be used to enhance image clarity, which was observed in the case of some real images, especially of the maize crop.The GANsynthesized images have the potential to replace erroneous and ill-advised real data.In addition, some real images contained irrelevant objects (such as the labels, referring to the Hence, realistic images were generated with the help of GAN, which could augment the existing crop/weed dataset.The advantage of GAN-based augmentation includes a reduced annotation workload since the generated images can be associated with the same segmentation mask created earlier.Moreover, the classification model can generalize better when trained with a dataset comprising GAN-generated images, especially on shape-based features.Apart from this, GANs can also be used to enhance image clarity, which was observed in the case of some real images, especially of the maize crop.The GAN-synthesized images have the potential to replace erroneous and ill-advised real data.In addition, some real images contained irrelevant objects (such as the labels, referring to the image of Fat Hen in Figure 6) in the background, and the model was successful in replacing them with the ground appearance, thus exhibiting its potential to create a variety of environmental and background conditions.
One drawback to this GAN model is its inability to learn and reproduce textural features, though it performs exceptionally well in acquiring the shape and color attributes.Notably, in the images of Shepherd's purse, where the textural appearance of the weed was quite imperative, the model could not fabricate them into artificial images.This might pose a problem for classification when the crops and weeds have a similar physical appearance.For further operations, 200 images were generated for each class through the developed GAN to boost the training dataset.

Performance Analysis of Transfer Learning Method
As a means of performance comparison to the transfer learning approach with and without GAN-based data augmentation, the results of the AlexNet model that was trained using the real images were initially compiled.The dataset was geometrically augmented by random rotations, translations, and reflections.Later, the model was trained again from the initial condition with both the real and generated images to analyze the potency of image data augmentation through GAN.Henceforth, 200 new images were generated for each class to support the training set.Briefly, the combined dataset utilized for the final model had 2000 images in total (400 images per class), out of which 75 real images from each class were reserved for testing purposes.Table 2 clearly summarizes the image distribution for the training and testing.After training with the original (real) dataset, the CNN model produced a classification F1-score of 0.970.After adding artificial images, the F1-score of the CNN (denoted as GAN-TL) improved remarkably and reached a value of 0.986.The statistical classification results on the test set have been recorded in Table 3.In addition, the accuracy improved to 98.40% from the previously attained 97.07%(without GAN augmentation) in the test data.A remarkable increment in the performance metrics was observed for Shepherd's Purse and Fat Hen, while the results remained unchanged for the maize and Cranesbill classes (Table 3).Overall, image augmentation with the help of a conditional GAN resulted in an improved classification result through the transfer learning method.The feature extraction-based classification models were developed using the activations derived from the global pooling layer of the CNN.Again, the models were developed in two stages, first on the dataset of real images and then on the combined dataset.Since the feature vector obtained from AlexNet had 9216 activations, it offered a wide range of features for classification.The important reason behind choosing AlexNet over other state-of-the-art models was its small convolution kernel sizes and network architecture, which supported the extraction of fine-grain details in the images.The performance of the models trained through SVM and LDA classifiers were compared.
The classification results on the test data have been recorded in Tables 4 and 5, containing the mean precision, recall, and F1 scores for the five independent runs.The overall accuracy registered by LDA (GAN-LDA) and SVM classifiers (GAN-SVM) was 96.0%.In the training data, LDA performed slightly better than SVM (94.3% and 92.4%).As anticipated, the synthetic images enhanced the performance of both classifiers.The F1-score of the SVM model increased from 0.935 to 0.960, and that of the LDA model increased from 0.943 to 0.959.Tables 4 and 5 demonstrate that the GAN-based augmentation method can provide an excellent performance boost to different classifiers, especially when developed using a limited dataset.Furthermore, Figure 8 presents the best testing confusion matrices upon using the original and GAN-augmented images.In the case of LDA, the performance of certain classes, such as maize and Charlock did not change much on applying GAN-based augmentation.However, the results of classes, such as Cranesbill and Fat Hen, improved significantly.This is because Cranesbill and Fat Hen are relatively more complicated in shape, requiring more data by the network to learn the features.In contrast, the features of maize and charlock are simple and distinct; hence, they are easier for the classifiers to perform the classification task.From the F1 scores of all the classes, it can be observed that the GAN-based image augmentation provided more information and enhanced the performance of transfer learning, as well as the feature extraction techniques for the crop/weed classifications.
In previous works, classification accuracies greater than 90% have been achieved using SVM and LDA classifiers, especially for crop/weed classifications.Accuracies between 92 and95% were achieved using SVM on the color and texture features for identifying four common weed seedlings in the maize production systems [57].When morphological features were added to the feature space, an improved accuracy of 96.5% was obtained using RBF-SVM [58].In another study, local binary pattern-based texture features yielded a 98.5% accuracy with RBF as the kernel function [66].Siddiqi et al. used the stepwise LDA to classify weeds into three classes: broad weed, narrow weed, and other weed species [67].Their method accorded 98.1% overall accuracy on a database of 1200 images.In the case of deep-learning-based classifications, most studies used a transfer learning approach rather than training the CNN from scratch.For identifying weeds in cotton and tomato fields, the performances of seven state-of-the-art CNNs were evaluated [26].All the models registered classification F1-scores greater than 88%.
Moreover, the fine-tuning method was compared with a feature-extraction approach for all the adopted CNNs [26].They observed that most of the networks gave better results through the feature-extraction approach-a similar inference from this study as well.Similarly, the Alexnet CNN architecture was transfer-learned with potato and sugar beet plant image datasets for binary classification [68].The model's accuracy was 98.0%, with an average prediction time of fewer than 0.1 s, supporting real-time applications.As an improvement, Chen et al. evaluated 35 CNN architectures for classifying 15 weed species in cotton production systems, for which ten of them achieved an F1 score greater than 95% [27].These results show that the classification models in this study have provided performances comparable to those previously developed.
Overall, the results indicate that data augmentation through GANs can increase the training resources needed for classifiers, enabling researchers to develop better imagingbased predictors.The authors believe that the proposed methodology can revolutionize intelligent crop/weed classifiers.An interesting topic for future work could be to examine the capability of the proposed approach on other machine vision-based applications, such as fruit maturity detection [69,70], fruit grading [71], agri-food product microstructural evaluation [65,72,73], crop disease identification [74], and crop growth and yield monitoring [75][76][77].

Conclusions
This study explored the potential of cGAN-based data augmentation techniques for improving imaging-based crop/weed classification.Using cGAN, artificial images were generated to double the training data of the available classes.The t-SNE method was used for the fidelity inspection of the new images, and the t-SNE plots showed high similarities between the feature distributions of real and artificial images.The performance of crop/weed classification with and without the artificial images was examined via two approaches viz.transfer learning and feature extraction.The obtained results confirmed the capability of the cGAN-based technique to improve the performance of crop/weed classifiers.Overall, this study opens a new pathway for implementing GANs, not only for crop/weed classification but also for the development of other machine vision-based precision agriculture systems.

Figure 2 .
Figure 2. Training procedure for image generator through L1 and GAN loss functions.

Figure 2 .
Figure 2. Training procedure for image generator through L1 and GAN loss functions.

Figure 4 .
Figure 4. Visualization of activations of the first Conv-layer of Alexnet.

Figure 5 .
Figure 5. Workflow for the artificial image synthesis through adversarial network and crop/weeds classification.

Figure 4 .
Figure 4. Visualization of activations of the first Conv-layer of Alexnet.

Figure 4 .
Figure 4. Visualization of activations of the first Conv-layer of Alexnet.

Figure 5 .
Figure 5. Workflow for the artificial image synthesis through adversarial network and crop/weeds classification.

Figure 5 .
Figure 5. Workflow for the artificial image synthesis through adversarial network and crop/weeds classification.

Figure 6 .
Figure 6.Sample ground truth images and generated images at different epochs during GAN training.Column-wise from left to right: Charlock, Fat Hen, Shepherd's purse, Small-flowered Cranesbill, Maize.

Figure 6 .
Figure 6.Sample ground truth images and generated images at different epochs during GAN training.Column-wise from left to right: Charlock, Fat Hen, Shepherd's purse, Small-flowered Cranesbill, Maize.

Table 1 .
Summary of previous studies on the application of GANs for crop/weeds identification tasks.

Table 2 .
Summary of data distribution for each class.

Table 3 .
Analysis of the classification results based on the transfer learning method.TL and GAN-TL refer to the models trained with the real image data, and combined real and artificial datasets, respectively.

Table 4 .
Analysis of classification results of SVM on deep features of AlexNet.The SVM and GAN-SVM refer to the SVM models trained with the real image data, and combined real and artificial datasets, respectively.

Table 5 .
Analysis of classification results of LDA on deep features of AlexNet.The LDA and GAN-LDA refer to the LDA models trained with the real image data, and combined real and artificial datasets, respectively.