Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images

Crop and weed monitoring is an important challenge for agriculture and food production nowadays. Thanks to recent advances in data acquisition and computation technologies, agriculture is evolving to a more smart and precision farming to meet with the high yield and high quality crop production. Classification and recognition in Unmanned Aerial Vehicles (UAV) images are important phases for crop monitoring. Advances in deep learning models relying on Convolutional Neural Network (CNN) have achieved high performances in image classification in the agricultural domain. Despite the success of this architecture, CNN still faces many challenges such as high computation cost, the need of large labelled datasets, ... Natural language processing's transformer architecture can be an alternative approach to deal with CNN's limitations. Making use of the self-attention paradigm, Vision Transformer (ViT) models can achieve competitive or better results without applying any convolution operations. In this paper, we adopt the self-attention mechanism via the ViT models for plant classification of weeds and crops: red beet, off-type beet (green leaves), parsley and spinach. Our experiments show that with small set of labelled training data, ViT models perform better compared to state-of-the-art CNN-based models EfficientNet and ResNet, with a top accuracy of 99.8\% achieved by the ViT model.


I. INTRODUCTION
Agriculture is at the heart of scientific evolution and innovation to face major challenges for achieving high yield production while protecting plants growth and quality to meet the anticipated demands on the market [1]. However, a major problem arising in modern agriculture is the excessive use of chemicals to boost the production yield and to get rid of unwanted plants such as weeds from the field [2]. Weeds are generally considered harmful to agricultural production [3]. They compete directly with crop plants for water, nutrients and sunlight [4]. Herbicides are often used in large quantities by spraying all over agricultural fields which has however shown various concerns like air, water and soil pollution and promoting weed resistance to such chemicals [2]. If the rate of usage of herbicides remains the same, in the near future, weeds will become fully resistant to these products and eventually destroy the harvest [5]. This is why weed and crop control management is becoming an essential field of research nowadays [6]. Automated crop monitoring system is a practical solution that can be beneficial both economically and environmentally. Such system can reduce labour costs by making use of robots to remove weeds and hence minimising the use of herbicides [7]. The foremost step to an automatic weed control system is the detection and mapping of weeds on the field which can be a challenging part as weeds and crop plants often have similar colours, textures, and shapes [4]. The use of Unmanned Aerial Vehicles (UAVs) has proved significant results for mapping weed density across a field by collecting RGB images ( [8], [9], [10], [11], [12]) or multispectral images ( [13], [14], [15], [16], [17]) covering the whole field. As UAVs fly over the field at an elevated altitude, the images captured cover a large ground surface area and these large images can be split into smaller tiles to facilitate their processing ( [18], [19], [20]) before feeding them to learning algorithms to identify and classify a weed from a crop plant.
In the agricultural domain, the main approach to plant detection is to first extract vegetation from the image background using segmentation and then distinguish crops from the weeds [21]. Common segmentation approaches use multispectral information to separate the vegetation from the background (soil and residuals) [22]. However, weeds and crops are difficult to distinguish from one another even while using spectral information because of their strong similarities [23]. This point has also been highlighted in [6], in which the authors reported the importance of using both spectral and spatial features to identify weeds in crops. Recently, deep learning (DL) became an essential approach in image classification, object detection and recognition [24], [25] notably in the agricultural domain [26]. DL models with CNN-like architectures, ruling in computer vision tasks so far, have been the standard in image classification and object detection [27], [28], [29]. CNN uses convolutional filters on an image to extract important features to understand the object of interest in an image with the help of convolutional operations covering key properties such as local connection, parameters (weight) sharing and translation equivariance [30], [24]. Most of the papers covering weed detection or classification make use of CNN-based model structures [31], [32], [33] such as AlexNet [28], VGG-19, VGG-16 [34], GoogLeNet [35], ResNet-50, ResNet-101 [29] and Inception-v3 [36].
On the other hand, attention mechanism has seen a rapid development particularly in natural language processing (NLP) [37] and has shown spectacular performance gains when compared to previous generation of state-of-the-art models [38]. In vision applications the use of attention mechanism has been much more limited, due to the high computational cost as the number of pixels in an image is much larger than the number of units of words in NLP applications. This makes it impossible to apply standard attention models to images. A recent survey of applications of transformer networks in computer vision can be found in [39]. The recently-proposed vision transformer (ViT) appears to be a major step towards adopting transformerattention models for computer vision tasks [40]. Considering image patches as units of information for training is a groundbreaking method compared to CNN-based models considering image pixels. ViT incorporates image patches into a shared space and learns the relation between these patches using selfattention modules. Given massive amounts of training data and computational resources, ViT was shown to surpass CNNs in image classification accuracy [40]. Vision transformer models have not been used yet for weeds and crops classification of high resolution UAV images.
In this paper, we adopt the self-attention paradigm via vision transformers for the classification of images of weeds and different crops: red leaves beet, green leaves beet, parsley and spinach, taken by a high resolution digital camera mounted on a UAV. We make use of the self-attention mechanism on small, labelled plant images using the convolutional-free ViT model, showing its outperformance compared to current state-of-theart CNN-based models ResNet and EfficientNet. Furthermore, we show that when training with fewer number of labelled images, the ViT model performs better than the CNN-based models. The rest of the paper is organised as follows. Section 2 presents the materials and methods used as well as a brief description of the self-attention mechanism and the vision transformer model architecture. The experimental results and analysis are presented in Section 3. Section 4 reflects the conclusion and work perspectives.

II. MATERIALS AND METHODS
This part outlines the acquisition, preparation and manual labelling of the dataset acquired from a high resolution camera mounted on a UAV. It also presents a brief description of the self-attention paradigm and the vision transformer model architecture.

A. Image collection and annotation
The study area is located at the agricultural fields of beet, parsley and spinach present in the region of Centre Val de Loire, in France. It is also a region with many pedo-climatic advantages: the region has limited rainfall and clay-limestone soils with good filtering capacity. Irrigation is also offered on 95% of the plots.
To survey the study areas, a "Starfury", Pilgrim technologies UAV was equipped with a Sony ILCE-7R, 36 mega pixels camera as shown in Figure 1. The camera is mounted on a 3axis stabilised brushless gimbal on the drone in order to keep the camera axis stable even during strong winds. The drone was flown at an altitude of 30 m over the beet field in a light morning fog and at an altitude of 20 m in a sunny weather for the parsley and spinach fields. The drone followed a specific flight plan and the camera captured RGB images at regular intervals as shown in Figure 2. The images captured have respectively a minimum longitudinal and lateral overlapping of 70% and 50-60% depending on the fields vegetation coverage and homogeneity, assuring a better and complete coverage of the whole field of 4ha (40 000 m²) and improving the accuracy of the field orthophoto generation.   The data was then manually labelled using bounding boxes with the tool LabelImg (https://github.com/tzutalin/labelImg) and then divided into 5 classes as shown in Figure 4 and Table I below.   The labelled dataset has been classified in 5 folders for each class label and each containing an equal number of images as presented in Table I. We have 16.9% off-type beet plants (obtained by data augmentation -flips and rotations from 765 labelled data) and equally 20.8% images for the four other classes. For a total of 19 265 images of size 64x64. The dataset is then divided into training, validation and testing sets as shown in Figure 11.

B. Image preprocessing
Due to the huge labour cost of the whole process of supervision training, artificial data augmentation was used in the experiment to generate additional new images and increase the amount of data to solve the problem of insufficient agricultural image dataset of weeds and crops. Image data augmentation is not only used to expand the training dataset and attempting to cover real-world scenarios but also to improve the performance and ability of the model to generalise on agricultural images as they can vary a lot depending on the soil, environment, seasons and climate conditions.
As a data preprocessing method, data augmentation plays an important role in deep learning [41]. In general, effective data augmentation can better improve the robustness of the model and obtain stronger generalisation capabilities. We thus employed data augmentation strategies, which have been widely used in practice, so as to enrich the datasets. After normalising each image, the following steps have been applied: random rotation, random resized crop, random horizontal flip, colour jitters and rand augment (Cubuk et al.). This technique is implemented using keras ImageDataGenerator, generating augmented images on the fly while the model is still in the training stage.

C. Self attention for weeds detection
Attention mechanism is becoming a key concept in the deep learning field [43]. Attention was inspired by the human perception process where the human tends to focus on parts of information, ignoring other perceptible parts of information at the same time. The attention mechanism has had a profound impact on the field of natural language processing, where the goal was to focus on a subset of important words. The selfattention paradigm has emerged from the concepts attention showing improvement in the performance of deep networks [38].
Lets denote a sequence of n entities (x 1 , x 2 , ..., x n ) by X ∈ R n×d , where d is the embedding dimension to represent each entity. The goal of self-attention is to capture the interaction amongst all n entities by encoding each entity in terms of the global contextual information. This is done by defining three learnable weight matrices, The attention matrix A ∈ R n×d v indicates a score between N queries Q and K T keys representing which part of the input sequence to focus on.
where σ is an activation function, usually so f tmax(). To capture the relations among the input sequence, the values V are weighted by the scores from Equation 1. Resulting in [40], where d k is dimension of the input queries.
If each pixel in a feature map is regarded as a random variable and the paring covariances are calculated, the value of each predicted pixel can be enhanced or weakened based on its similarity to other pixels in the image. The mechanism of employing similar pixels in training and prediction and ignoring dissimilar pixels is called the self-attention mechanism. It helps to relate different positions of a single sequence of image patches in order to gain a more vivid representation of the whole image [44].
The transformer network is an extension of the attention mechanism from Equation 2 based on the Multi-Head Attention operation. It is based on running k self-attention operations, called "heads", in parallel, and project their concatenated outputs [38]. This helps the transformer jointly attend to different information derived from each head. The output matrix is obtained by the concatenation of each attention heads and a dot product with the weight W O . Hence, generating the output of the multi-headed attention layer. The overall operation is summarised by the equations below [38].
where W Q i ,W K i ,W V i are weight matrices for queries, keys and values respectively and W O ∈ R hd v ×d model . By using the self-attention mechanism, global reference can be realised during the training and prediction of models. This helps in reducing by a considerable amount training time of the model to achieve high accuracy [40]. The selfattention mechanism is an integral component of transformers, which explicitly models the interactions between all entities of a sequence for structured prediction tasks. Basically, a self-attention layer updates each component of a sequence by aggregating global information from the complete input sequence. While, the convolution layers' receptive field is a fixed K × K neighbourhood grid, the self-attention's receptive field is the full image. The self-attention mechanism increases the receptive field compared to the CNN without adding computational cost associated with very large kernel sizes [45]. Furthermore, self-attention is invariant to permutations and changes in the number of input points. As a result, it can easily operate on irregular inputs as opposed to standard convolution that requires grid structures [39].  Average attention weights of all heads mean heads across layers and the head in the same layer. Basically, the area has every attention in the transformer which is called attention pattern or attention matrix. When the patch of the weed image is passed through the transformer, it will generate the attention weight matrix for the image patches. For example, when patch 1 is passed through the transformer, self-attention will calculate how much attention should pay to others (patch 2, patch 3, ...). And every head will have one attention pattern as shown in Figure 6 and finally, they will sum up all attention patterns (all heads). We can observe that the model tries to identify the object (weed) on the image and tries to focus its attention on it (as it stands out from the background).
An attention mechanism is applied to selectively give more importance to some of the locations of the image compared to others, for generating caption(s) corresponding to the image. And consequently, this helps to focus on the main differences between weeds and crops in an images and improves the learning of the model to identify the contrasts between these plants. This mechanism also helps the model to learn features faster, and eventually decreases the training cost [40].

D. Vision Transformers
Transformer models were major headway in natural language processing (NLP). They became the standard for modern NLP tasks and they brought spectacular performance yields when compared to the previous generation of state-of-the-art models [38]. Recently, it was reviewed and introduced to computer vision and image classification aiming to show that this reliance on CNNs is not necessary anymore in object detection or image classification and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks [40].  For an input image, and patch size P, N image patches are created where N is the sequence length (token) similar to the words of a sentence, (H,W ) is the resolution of the original image and C is the number of channels [40].
Afterwards, each flattened element is then fed into a linear projection layer that will produce what is called the "patch embedding". There is one single matrix, represented as 'E' (embedding) used for the linear projection. A single patch is taken and first unrolled into a linear vector as shown in Figure 8. This vector is then multiplied with the embedding matrix E. The final result is then fed to the transformer, along with the positional embedding. In the 4 th phase, the position embeddings are linearly added to the sequence of image patches so that the images can retain their positional information. It injects information about the relative or absolute position of the image patches in the sequence. The next step is to attach an extra learnable (class) embedding to the sequence according to the position of the image patch. This class embedding is used to predict the class of the input image after being updated by self-attention. Finally, the classification is performed by just stacking a multilayer perceptron (MLP) head on top of the transformer, at the position of the extra learnable embedding that has been added to the sequence. All models were trained using the same parameters in order to have an unbiased and reliable comparison between their performance. The initial learning rate was set to 0.0001 with a reducing factor of 0.2. The batch size was set to 8 and the models were trained for 100 epochs with an early stopping after a wait of 10 epochs without better scores. The models used, ViT-B16, ViT-B32, EfficientNet B0, EfficientNet B1 and ResNet 50 were loaded from the keras library with pre-trained weights of "ImageNet".

A. Cross Validation
The experiments have been carried out using the crossvalidation technique to ensure the integrity and accuracy of the models. We used stratified K-Fold with the variation of the proportion of training and validation folders. Therefore, the dataset was divided into k folders using certain percentage of the data as a validation set and the rest as a training set. Stratified is to ensure that each fold contains the same proportion of each label. Thus stratified K-Fold is one of the best approaches in ensuring that the data is well balanced and shuffled in each folds before splitting them into validation and training sets. The data was first shuffled randomly and divided equally into 5 folders, each containing an equal number of classes and the performances of the tested models were evaluated using the stratified five-folds cross validation leaving k folds as validation set (where 1 ≤ k ≤ 4). Figure 10 shows how the data was splitted and divided into 5 folders each containing and equal number of classes. Using Equation 7 (with n = 5 and k = 2), this results in 10 training models. Increasing the value of k (number of validation folders), decreases the number of training folders and thus forces the model to train on a smaller dataset. This helps to evaluate how well the models perform on reduced training datasets and their capacities to extract features from a few images. The number of combinations (splits) of the train-validation is as follows: where n is the number of folders and k is the number of validation folds.    Figure 9 and Figure 10.

B. Evaluation metrics
In the collected dataset, each image has been manually classified into one of the categories: weeds, off-type beet (green leaves beet), beet (red leaves), parsley or spinach, called ground-truth data. By running the classifiers on a test set, we obtained a label for each testing image, resulting in the predicted classes. The classification performance is measured by evaluating the relevance between the ground-truth labels and the predicted ones resulting in classification probabilities of true positives (TP), false positives (FP) and false negatives (FN). We then calculate a recall measure representing how well a model correctly predicts all the ground-truth classes and a precision representing the ratio of how many of the positive predictions were correct relative to all the positive predictions.
The metrics used in the evaluation procedure were the precision, recall and F1-Score. The latter being the weighted average of precision and recall, hence considering both false positives and false negatives into account to evaluate the performance of the model.
Since we used cross validation techniques to evaluate the performance of each model, we calculated the mean (µ) and standard deviation (σ ) of the F1-scores of the model in order to have an average overview of its performance. The equations used are presented below: where N is the number of splits generated from the cross validation procedure. For instance, one leave out generates five splits (N = 5) using Equation 7 as shown in Figure 9.
As for the loss metrics, we used the cross-entropy loss function between the true classes and predicted classes.

IV. RESULTS AND DISCUSSION
State-of-the-art CNN-based architectures, ResNet and Effi-cientNet were trained along the ViT-B16 and ViT-B32 in order to compare their performances on our custom dataset comprising of 5 classes (weeds, beet, off-type beet, parsley and spinach). All models have been trained using the five-folds cross validation leaving one out technique. The accuracies and losses of the models tend to be flat after the 30th epoch. The average F1-Scores and losses obtained with 3211 testing images (original and unprocessed images except for off-type beet images) are reported in Table II

A. Influence of the training set size
In the next stage we tried to answer the question of which network family yields the best performance with a smaller training dataset. We did so by carrying out a five-folds cross validation leaving k out, where k is a varying parameter from 1 to 4 while keeping the testing set to 3211 images to evaluate the performance of the models.
Varying the number of training images has a direct influence on the performance of the trained ViT model, as shown in Table III. The results obtained with the five-folds cross validation, leaving two out as validation set (k=2) are promising, with a mean F1-Score of 99.7% and a standard deviation of 0.1% showing a very small difference between the scores of the 10 generated models. We notice a very small decrease in performance of the ViT B-16 model while reducing the number of training images. We note a very light decrease of 0.1% in the accuracy of the ViT B-16 model while training only with 2/5 of the dataset (6422 images, k=3) and validating on the remaining 3/5. With k=4, the ViT B-16 model was trained with a smaller dataset of 3211 images (75% reduction), and its performance decreased as expected but by a small margin of only 0.44% for an overall accuracy of 99.63%. These experimental results shows how well the vision transformer models perform with small datasets. The ViT B-16 model makes use of the self-attention mechanism to learn recognisable patterns from few images in order to achieve such high accuracy.   Furthermore, we compared the performance of the models with a variation in the number of testing images while using a 5-Folds leaving one fold out cross validation technique. The F1-Scores are reported in Table IV. As shown in Figure 13, there is a notable decrease in the F1-Scores of the four models while testing with 9633 and 13596 images and training with only 50% and 30% of the labelled dataset. On the third set of experiments, the models were trained on only 4535 images and validating on 1134 images, which explains the decrease in their performances. Even though all models have a decrease in their F1-Scores with an increasing number of testing images, the ViT B-16 model still achieves higher performance than the state-of-the art CNN-based models, EfficientNetB0, Efficient-NetB1 and ResNet50. The ViT B-16 model had the smallest decrease in performance from 99.80% (for 3211 testing images

V. CONCLUSION
In this study, we used the self-attention paradigm via the vision transformer models to learn and classify custom crops and weeds images acquired by UAV in beet, parsley and spinach fields. The results achieved with these datasets indicate a promising direction in the use of vision transformers in agricultural problems. Outperforming current state-of-the-art CNN-based models like ResNet and EfficientNet, the base ViT model is to be preferred over the other models for its high accuracy and its low calculation cost. Furthermore, the ViT B-16 model has proven better with its high performance specially with small training datasets where other models failed to achieve such high accuracy. This shows how well the convolutional-free, ViT model interprets an image as a sequence of patches and processes it by a standard transformer encoder, using the self-attention mechanism, to learn patterns between weeds and crops images. In this respect, we come to conclusion that the application of vision transformer could change the way to tackle vision tasks in agricultural applications for image classification by bypassing classic CNN-based models. In future works, we plan to use vision transformer classifier as a backbone in an object detection architecture to locate and identify weeds and plants on UAV orthophotos.