Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring

Mesías-Ruiz, Gustavo A.; Peña, José M.; de Castro, Ana I.; Borra-Serrano, Irene; Dorado, José

doi:10.3390/rs16163026

Open AccessEditor’s ChoiceArticle

Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring

by

Gustavo A. Mesías-Ruiz

^1,2

,

José M. Peña

¹

,

Ana I. de Castro

³

,

Irene Borra-Serrano

¹

and

José Dorado

^1,*

¹

Institute of Agricultural Sciences, Spanish National Research Council (ICA-CSIC), Serrano 115b, 28006 Madrid, Spain

²

School of Agricultural, Food and Biosystems Engineering (ETSIAAB), Polytechnic University of Madrid (UPM), Av. Puerta de Hierro 2, 28040 Madrid, Spain

³

Environment and Agronomy Department, National Agricultural and Food Research and Technology Institute (INIA-CSIC), Ctra. Coruña km 7.5, 28008 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3026; https://doi.org/10.3390/rs16163026

Submission received: 28 June 2024 / Revised: 12 August 2024 / Accepted: 16 August 2024 / Published: 18 August 2024

(This article belongs to the Special Issue Remote Sensing and Associated Artificial Intelligence in Agricultural Applications)

Download

Browse Figures

Versions Notes

Abstract

Early detection of weeds is crucial to manage weeds effectively, support decision-making and prevent potential crop losses. This research presents an innovative approach to develop a specialized cognitive system for classifying and detecting early-stage weeds at the species level. The primary objective was to create an automated multiclass discrimination system using cognitive computing, regardless of the weed growth stage. Initially, the model was trained and tested on a dataset of 31,002 UAV images, including ten weed species manually identified by experts at the early phenological stages of maize (BBCH14) and tomato (BBCH501). The images were captured at 11 m above ground level. This resulted in a classification accuracy exceeding 99.1% using the vision transformer Swin-T model. Subsequently, generative modeling was employed for data augmentation, resulting in new classification models based on the Swin-T architecture. These models were evaluated on an unbalanced dataset of 36,556 UAV images captured at later phenological stages (maize BBCH17 and tomato BBCH509), achieving a weighted average F1-score ranging from 94.8% to 95.3%. This performance highlights the system’s adaptability to morphological variations and its robustness in diverse crop scenarios, suggesting that the system can be effectively implemented in real agricultural scenarios, significantly reducing the time and resources required for weed identification. The proposed data augmentation technique also proved to be effective in implementing the detection transformer architecture, significantly improving the generalization capability and enabling accurate detection of weeds at different growth stages. The research represents a significant advancement in weed monitoring across phenological stages, with potential applications in precision agriculture and sustainable crop management. Furthermore, the methodology showcases the versatility of the latest generation models for application in other knowledge domains, facilitating time-efficient model development. Future research could investigate the applicability of the model in different geographical regions and with different types of crops, as well as real-time implementation for continuous field monitoring.

Keywords:

deep learning; vision transformer; generative AI; remote sensing; unbalanced multiclass discrimination; dynamics systems

1. Introduction

Weeds pose a significant challenge to crop protection by engaging in intense competition with crops for essential resources. This competition leads to a notable reduction in crop yields [1], further complicating agricultural endeavors. Accurate monitoring of weed populations is crucial [2]. Traditional methods of weed identification often rely on manual observation, which can be both time-consuming and error-prone, frequently constrained by human experience [3]. Given the similarities in physical traits among numerous weed species, the potential for erroneous identification and subsequent treatment decisions can lead to diminished control-measure effectiveness. Furthermore, owing to the morphological variations among weed species and their growing stages [4], there is a critical demand for a system capable of accurately recognizing and distinguishing these changes.

In the third era of information technology, characterized by the adoption of cognitive computing (CC) systems, a synergistic interaction between humans and technology emerges, aiming at expanding our knowledge base, enhancing the efficient utilization of natural resources and optimizing production processes. CC is based on the idea that computers can learn and simulate human cognitive functions, such as perception, memory and reasoning, using advanced algorithms [5]. Additionally, CC can accelerate the analysis of extensive datasets [6], enabling the detection of patterns and trends that may not be obvious to human perception. Therefore, CC seeks to understand the relationships between different types of data and real-world occurrences, employing cutting-edge technologies such as artificial intelligence (AI), pattern recognition and machine learning (ML) [7].

An inherent challenge in CC is its adaptation to new environments and datasets without extensive training data, thereby minimizing risks associated with bias and overfitting. According to Lytras and Visvizi [8], computer vision possesses the potential to tackle environmental challenges and promote sustainable practices by efficiently processing vast amounts of data and emulating human cognitive capabilities. ML, a foundational component of cognitive systems [9], allows computers to learn autonomously without the need for explicit programming, discerning patterns through statistical methods. Deep learning (DL), inspired by the workings of the human brain, employs artificial neural networks to perform high-level tasks [10]. These neural networks, akin to the brain, enhance their performance by adjusting their connections based on patterns within the data.

Convolutional neural networks (CNNs) are widely used in DL models and provide a fundamental role in several research areas. These architectures are especially relevant in fields such as image processing [11], natural language processing and emotion recognition [12], where they have proven to be instrumental in improving the accuracy and efficiency of automated systems. Using CNN models, data-driven decisions can be made and effective predictions can be generated from new data due to their learning capabilities. The advent of the transformer model [13] represents a significant innovation in natural language processing and ML, revolutionizing our approach to tasks such as machine translation, text generation and computer vision [14]. This architecture works similarly to the human brain, using prior knowledge to make decisions in new situations, facilitated by its dynamic attention mechanism. The emergence of the vision transformer (ViT) model [15] has yielded models comparable to CNNs in the domain of computer vision tasks [16,17].

Through the application of ML, CC systems can enhance their ability to recognize patterns, understand context and manage the complexities of food production systems by integrating data from multiple sources, thus supporting decision-making [18]. CC is revolutionizing modern agriculture through the integration of soft computing techniques such as fuzzy logic, neural networks and fuzzy cognitive maps. These technologies enable more efficient and sustainable management of resources, optimizing crop yields, reducing environmental impact. In addition, these techniques demonstrate a high potential to improve decision-making in real time, adapting to changing conditions and the specific challenges of the agricultural environment [19,20,21].

Technological advancements in remote sensing have popularized the use of unmanned aerial vehicles (UAVs) in agriculture, particularly for crop protection [22], enabling high-resolution data acquisition on crop conditions [23]. Nevertheless, the limited availability of data can pose challenges when training CC models [24]. To address this constraint, data augmentation techniques manipulate existing images to generate new ones representing a broader spectrum of real-world diversity. This strategic approach mitigates overfitting and improves model performance and generalization ability by inferring new data [25]. Among the array of data augmentation techniques, generative adversarial neural networks (GANs) stand out as promising tools for the generation of synthetic data.

The integration of CC in weed detection has significantly improved accuracy, efficiency and cost-effectiveness. Advancements in computer vision and DL models provide robust tools for precision agriculture, ultimately enhancing crop yield and sustainability. Therefore, building upon the progress in CC, with a particular focus on the application of DL models such as GANs and ViTs, the main objective of this research was to develop a robust and automated multiclass discrimination system capable of accurately identifying various weed species, irrespective of their growth stage. The specific objectives of this study are defined as follows:

Develop a comprehensive dataset including weed species in their early growth stages, intended for the training, validation and testing of the ViT model’s classification performance on both maize and tomato crops.
Assess the performance of the previously tested ViT model on the dataset featuring the same weed species but in later growth stages. The hypothesis being that there could be a significant decrease in performance compared to the initial growth-stage findings.
Implement GAN-based data augmentation techniques to enhance the spatial resolution of the early growth-stage dataset. Then, create new classification models and evaluate the performance of weed species classification using ViT architecture on the subsequent growth-stage dataset.
Evaluate and quantify the impact of increased training data on the performance of an object detection model for weed detection by training three different models using incremental datasets.

2. Materials and Methods

This study employed the principles of CC to develop a methodology that enables the creation of a system capable of analyzing complex data, extracting knowledge and utilizing that knowledge for informed decision-making. Our approach involved training a ViT-based model using UAV imagery, enabling the identification of multiple weed species during their early growth stages. Furthermore, we assessed the model’s capacity to transfer knowledge by applying it to a dataset depicting a subsequent growth stage (Figure 1).

2.1. Dataset Generation

The data collection process involved various geographical locations, each characterized by varying crop types and agricultural conditions. The selected sites included an experimental maize farm in Arganda del Rey (Madrid, Spain), affiliated with the Spanish National Research Council (CSIC). The maize planting area covered approximately 7400 m², and data were collected at two phenological growth stages (Figure 2): early crop growth BBCH14 (4 leaves unfolded) and subsequent crop growth BBCH17 (7 leaves unfolded) [26]. In addition, we included two commercial tomato crops in Santa Amalia (Badajoz, Spain), with planting areas of approximately 12,000 m² and 14,000 m². Data collection for the tomato crops also involved an early growth stage BBCH501 (first flower bud visible) and subsequent growth stage BBCH509 (ninth flower bud visible) [26]. Selection of these growth stages for this research was not arbitrary, but is optimal for applying effective weed control strategies. Natural weed infestations were observed in the study fields during the data collection process.

A Sony ILCE-6300L RGB visible light camera with an effective resolution of 24.2 megapixels was used, mounted on a UAV model Microdrones MD4-1000. Flying the UAV at an altitude of 11 m resulted in a ground sample distance (GSD) of 0.17 cm per pixel. To ensure comprehensive coverage, a 70% overlap ratio was maintained both laterally and frontally during image capture. Each acquired image had dimensions of 6000 × 3376 pixels. Maize and tomato crop data were acquired in May 2020 and May 2021, respectively. The UAV images were taken at solar zenith (around 13:00 CET), with cloudless skies and constant lighting conditions. Collecting the images at this time is crucial to minimize the presence of plant shadows in the images, thus improving the quality of the image data and the reliability of the analysis results according to the field conditions.

A total of 568 and 565 images for the BBCH14 and BBCH17 growth stages in maize, respectively, and 895 and 950 images for the BBCH501 and BBCH509 growth stages in tomato, respectively, were used to create the orthomosaics of the entire fields used in the study. Subsequently, fragments of 1000 × 1000 pixels were extracted from these orthomosaics to facilitate the species identification and labeling task carried out by agronomy experts with rectangular boxes using the graphical tool LabelImg [27]. Only whole plants were labeled, which means that the plants divided by the image cuts were not considered. Through this process, the following species were identified: Atriplex patula, Chenopodium album, Convolvulus arvensis, Cyperus rotundus, Datura ferox, Lolium rigidum, Portulaca oleracea, Salsola kali, Solanum nigrum and Sorghum halepense. The number of labels per species and growth stage is shown in Table 1. As a starting point for our study, we allocated 100 labels per species for testing in the early growth stage. Subsequently, the remaining labels were divided into 80% for training and 20% for validation (Table 1). The dataset generated and used for this research can be found in [28].

2.2. Vision Transformer Neural Network for Weed and Crops Classification

ViTs represent an innovative extension of transformer models that apply the concept of attention to learn relationships between different parts of an image. ViTs achieve this by breaking down an image into uniformly sized patches and then converting each patch into a vector using an embedding layer. These patch vectors subsequently undergo a relationship learning process using transformers, which exploit attention mechanisms to understand the interconnections among these vectors. This modular structure allows for greater flexibility and efficiency compared to traditional convolutional architectures [15].

The classification stage in our study used the Swin-T model [29], which was selected for its ability to attain high accuracy in image classification tasks while demanding less computational resources compared to other commonly employed ViT and CNN models. This model was specifically designed to maintain feature map resolutions akin to traditional convolutional networks such as VGG and ResNet. However, instead of conventional convolutional layers, Swin-T uses a hierarchical architecture of transformer blocks with sliding windows for processing image fragments and feature extraction [29]. This approach enables multiscale modeling while preserving linear computational complexity concerning image size. The architecture of Swin-T consists of four stages (Figure 3e): linear embedding, Swin transformer block, patch partitioning and patch merging. Initially, the input image is divided into non-overlapping patches by patch partitioning. Subsequently, the patch merging process within the Swin transformer block combines these patches based on their adjacency in a 2 × 2 arrangement. Finally, the data stream is repeatedly subjected to patch merging and Swin transformer block operations to effectively process the information.

To explore the interpretability of the Swin-T model, the gradient-weighted class activation mapping (Grad-CAM) technique was used [30]. This technique facilitated an in-depth understanding of the mechanisms of self-attention and window shifting, and how these contribute to model decision-making. Grad-CAM not only improves interpretability by providing detailed, class-specific activation maps, but also maintains fidelity to the original model. This allows granular inspection of Swin-T model responses without the need to modify its architecture.

2.3. Model Training and Inference Details

The model training and inference include:

Preprocessing: The PyTorch library was used for label resizing to 224 × 224 pixels. Additionally, the library enabled various transformations, including random horizontal image flipping with a 50% probability and normalization based on two sets of values: mean and standard deviation. These values, (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225), correspond to the means and standard deviations of each color channel (red, green, blue), calculated from a reference dataset such as ImageNet (Figure 3b).
Custom sampler: Due to class imbalance within our dataset, we developed a custom sampler that relies on a calculation involving the number of samples per class in the training dataset and inverse class weights. These weights were assigned inversely proportional to the number of samples in each class, meaning that classes with fewer samples received higher weights, while those with more samples received lower weights. This sampler was used during training to select batches of data. This approach facilitated the model in giving greater focus to underrepresented classes, mitigating any bias toward the majority classes in the trained model.
Hyperparameter optimization: In the cross-validation process, we utilized the AdamW algorithm [31] with a learning rate of 6 × 10⁻⁴ and a weight decay factor of 1 × 10⁻⁴. The chosen loss function was sparse categorical cross-entropy, with batches of size 32 and a total of 50 training epochs.
Performance metrics: To evaluate the performance of the classification model, several metrics were used. Accuracy (ACC) determines the overall accuracy by calculating the percentage of correctly predicted images in relation to the total number of images. However, it is crucial to acknowledge that accuracy is reliable only when the class distribution is balanced. Precision (P) represents the fraction of images correctly labeled as positive by the model. The Recall (R) metric measures the proportion of actual positive cases that the model correctly identifies. F1-score combines precision and recall into a single value, providing a comprehensive measure of classification performance. It proves particularly useful when managing class imbalances in the dataset.

A C C = \frac{\sum_{i = 1}^{n} (T_{P_{i}} + T_{N_{i}})}{\sum_{i = 1}^{n} (T_{P_{i}} + F_{P_{i}} + F_{N_{i}} + T_{N_{i}})}

(1)

P = \frac{\sum_{i = 1}^{n} {T_{P}}_{i}}{\sum_{i = 1}^{n} ({T_{P}}_{i} + {F_{P}}_{i})}

(2)

R = \frac{\sum_{i = 1}^{n} {T_{P}}_{i}}{\sum_{i = 1}^{n} ({T_{P}}_{i} + {F_{N}}_{i})}

(3)

F 1 - s c o r e = 2 \cdot \frac{P \cdot R}{P + R}

(4)

2.4. Generative Adversarial Neural Network for Image Augmentation

Data augmentation is a technique that effectively enhances both the quantity and diversity of images within a training dataset [25]. A recent advancement in this domain involves the use of GANs to augment datasets with a wide array of contrasting images [32]. In our work, we employed the pretrained generative facial prior GAN (GFPGAN) model, adapting it to align with the specific characteristics of our dataset. The GFPGAN is primarily a facial restoration model designed to achieve a balance between realism and fidelity in super-resolution images [33]. The model consists of two core components (Figure 3c): a U-Net module for mitigating degradation and a StyleGAN2 module, which serves as a pretrained facial GAN used as a prior generator. These components are interconnected through a latent code mapping and multiple layers of specialized split-channel feature transforms. The model was trained using the Adam optimizer [34] for both the generator and the discriminator with a learning rate of 2 × 10⁻³ for both modules, and the training process spanned a total of 800,000 iterations. Additionally, the model employs several loss functions to train the neural network and enhance the quality of facial restoration.

Figure 3. Flowchart of the CC system developed for weed monitoring at different phenological stages. (a) System input allows RGB images (12 classes) corresponding to BBCH14 and BBCH501 stages. (b) Preprocessing of the dataset. (c) GFPGAN framework. (d) Data augmentation. (e) Swin-T classification architecture. (f) Inference for BBCH17 and BBCH509 stages. Figure (c) adapted from [33]. Figure (e) adapted from [29]. (g) Classification.

To prevent data leakage, data augmentation, as recommended by LeCun et al. [14], was exclusively applied to the training images. Using the GFPGAN framework, images were generated with upscaling factors ×1 (GFPGAN×1), ×2 (GFPGAN×2) and ×3 (GFPGAN×3). These new datasets were used to augment the training data, resulting in the creation of various classification models (Figure 3d): Original + GFPGAN×1 (Ori + GFPGAN×1), Original + GFPGAN×2 (Ori + GFPGAN×2), Original + GFPGAN×3 (Ori + GFPGAN×3), Original + GFPGAN×1 + GFPGAN×2 (Ori + GFPGAN×1 + ×2), Original + GFPGAN×1 + GFPGAN×2 + GFPGAN×3 (Ori + GFPGAN×1 + ×2 + ×3).

Reference-complete quality metrics were employed to directly compare the target image (generated by GFPGAN) with the reference (original) image. The mean squared error (MSE) quantifies the root mean square difference between actual and ideal pixel values. While straightforward to compute, MSE may not accurately reflect human perception of quality. In contrast, the structural similarity index (SSIM) [35] integrates local image structure, luminance and contrast into a single local quality score. Here, structures refer to patterns of pixel intensities, particularly between adjacent pixels, normalized for luminance and contrast. Because the human visual system is very skilled at detecting structures, the SSIM metric is best suited for subjective quality assessment.

2.5. Vision Transformer Neural Network for Weed Detection

To validate our dataset and align our research with real-world conditions, we implemented the detection transformer (DETR) object detection model, based on ViTs, to identify and locate weed species in agricultural areas. Using DETR for this task offers significant advantages due to its ability to simplify the detection pipeline and enhance overall image reasoning. DETR eliminates the need for hand-designed components such as anchor generation and non-maximum suppression, which are common in other object detection models. Instead, it employs an encoder–decoder transformer architecture and global ensemble-based loss, guaranteeing unique predictions through bipartite allocation [36]. This approach enables DETR to consider the relationships between objects and the global image context, producing the final set of predictions directly and in parallel. This feature is particularly beneficial for identifying multiple weed species in a single image, improving both the accuracy and efficiency of the detection process.

Three models were created using different combinations of data sets: Original, Ori + GFPGAN×1 and Ori + GFPGAN×1 + ×2. The PyTorch DL framework was used to train the DETR model. The hyperparameter settings included a learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁴. Batch sizes were set to 2 for the training data loader and 1 for the validation data loader. Training was conducted for a total of 20 epochs. In addition, the gradient trimming values were set to 0.1, gradient accumulation to 8 and a row record aggregation frequency of 5. This configuration was chosen to balance the stability and efficiency of the training, after testing several configurations.

To evaluate the performance of the DETR model during the training stage, several metrics were used. The intersection over union (IoU) metric evaluates the accuracy of model predictions by determining how well a predicted region overlaps with the actual object. IoU is calculated as the intersection area divided by the junction area between the ground truth box and the box predicted by the model. A high IoU value indicates a strong correspondence between predictions and actual annotations, essential for evaluating the spatial accuracy of the model. The mean average precision (mAP) assesses the model’s accuracy and completeness in object detection. The mAP is calculated by averaging the accuracies at various retrieval levels, considering multiple IoU thresholds.

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(5)

m A P = \frac{1}{|c l a s s e s|} P

(6)

To comprehensively understand the model’s performance in various real-world scenarios, several metrics were used. The mAP@[IoU = 0.50] and mAP@[IoU = 0.75] evaluate accuracy at the specific IoU thresholds 0.5 and 0.75, allowing a more detailed understanding of model performance at different levels of overlap. The mAP@[area = small/medium/large] metrics provide information about the model’s effectiveness in detecting objects of different sizes. On the other hand, the recall metrics evaluate the model’s ability to identify all relevant objects. Recall@[maxDets = 1/10/100] measures the recall considering only the top 1, 10 or 100 detections per image, respectively, which is useful to evaluate the performance at different levels of object density. Finally, Recall@[area = small/medium/large] complements the mAP metrics by evaluating the completeness of detections for objects of different sizes.

3. Results

3.1. Classification Inference Using the Original Datasets in Both Early and Subsequent Growth Stages

The results of inferring the Swin-T classification model on the dataset corresponding to the early growth stage are shown in Table 2. The evaluation highlights the outstanding performance of the multiclass classification in species identification, with all species achieving evaluation metrics above 97%. This indicates the model’s high accuracy in multiclass classification, showcasing its significant ability to effectively identify various species. Notably, certain species, such as P. oleracea, S. nigrum, S. halepense, maize and tomato, achieved 100% accuracy and recall, highlighting the model’s absolute reliability in identifying these particular species.

The results of the Swin-T classification model’s inference on the dataset corresponding to the subsequent growth stage are shown in Table 3. These inference results revealed varying performance in classifying different species. Notably, species such as maize and S. halepense exhibited a high level of precision and recall, with F1-scores close to 99%. This suggests that the model excels in identifying these species. In contrast, A. patula had a low recall of 53.3%, attributable to a high percentage of F_N between A. patula and C. arvensis, at 22%. This suggested that the model had difficulty differentiating between these two species, leading to a higher omission rate of A. patula when it was actually present. The low accuracy of 55.6% in C. arvensis was influenced by a significant percentage of F_N between C. arvensis and C. album, at 20%. This level of F_N indicated that the model misclassified a considerable number of C. album instances as C. arvensis. The low recall value of 52.2% for C. rotundus was explained by the high rate of F_P with other species, specifically 16% with maize and 14% with S. kali. Confusion with S. kali was due to similar morphological characteristics. In the case of L. rigidum, its low accuracy of 37.1% was due to a remarkable confusion with S. kali (53%). This high F_P rate indicated that the model tended to misclassify instances of S. kali as L. rigidum. The confusion with S. kali was of particular concern and suggested considerable similarity in the data characteristics used by the model to differentiate these species.

Figure 4 illustrates the original input images and their corresponding Grad-CAM activation maps. These maps highlight the areas within the input images that the model considers most relevant to its prediction. More intensely colored regions (usually red) indicate a greater contribution to the model’s decision, while fainter regions (usually blue or uncolored) indicate less influence. Subimages (a), (d), (e), (j), (k) and (l) show the location of discriminative features that align with intuitive human features. In contrast, subimages (c), (f) and (g) show activations in the background rather than the main object, which may indicate model failure and bias.

3.2. Data Augmentation GFPGAN on the Early Growth-Stage Dataset

Figure 5 illustrates the contrasting results of synthetic images from our dataset, generated using the GFPGAN model’s capabilities for color restoration and enhancement, employing upscaling factors of ×1, ×2 and ×3, and highlighting an example with D. ferox.

The results in this section demonstrate substantial advancements in image enhancement through generative AI (Figure 6). They highlight the GFPGAN model’s capability to enhance texture and color of various elements within each image. Importantly, the images generated from our dataset exhibit a high degree of realism, effectively identifying and emphasizing important morphological features, such as leaf arrangement and the texture of both plants and the soil surface. SSIM values confirm these findings, since this quality metric closely correlates with the subjective perception of a human observer. Generally exceeding 0.9, SSIM values in the local SSIM map depict sparse regions of low similarity (e.g., dark pixels in Figure 6), where the processed image differs significantly from the reference image. Conversely, light pixels indicate areas of high local SSIM, typically coinciding with uniform regions in the reference image where changes, like blurring, have minimal impact on perceived image quality.

The image generation runtimes using the early growth-stage training dataset consisting of 23,842 labels showed minor variations based on different scaling factors. Processing at a factor of ×1 exhibited the shortest runtime, completing in 722.1 s, closely followed by the ×2 factor with a runtime of 724.2 s. In contrast, factor ×3 showed the longest execution time, taking 813.4 s to complete. It is important to emphasize that these results are specific to the dataset and hardware configuration used in this study.

3.3. Classification Inference after Data Augmentation on the Subsequent Growth-Stage Dataset

The results of the classification models, which were developed with the implementation of data augmentation on the dataset related to the subsequent growth stage, showed varying performance across different species (Table 4). In general, the Ori + GFPGAN×1 model enhanced the F1-score compared to the original Swin-T model, particularly for species like S. nigrum, highlighting its exceptional classification performance in that specific category. However, both the Ori + GFPGAN×2 and Ori + GFPGAN×3 models showed underperformance in most species when compared to the Ori+ GFPGAN×1. This indicates that the Ori + GFPGAN×1 model excelled in terms of overall performance across all species, regardless of species imbalance. Furthermore, the Ori + GFPGAN×1 + ×2 model showed a significant enhancement in its performance compared to the unweighted average accuracy. This suggests that the Ori + GFPGAN×1 + ×2 model is better suited for scenarios where species balance hold significance in the application.

3.4. Detection of Weeds

3.4.1. Model Training

The training performance results of the three models indicate a higher mAP in the model trained with the Ori + GFPGAN×1 dataset compared to the model trained only with the original dataset (Table 5). Furthermore, the model trained with Ori + GFPGAN×1 + ×2 outperformed the model trained with Ori + GFPGAN×1, indicating continued improvement with the incorporation of additional data. Similarly, IoU also improved with increasing training data. The model trained with Ori + GFPGAN×1 + ×2 achieved the highest IoU, followed by the model trained with Ori + GFPGAN×1, and lastly, the model trained with the original dataset. This consistent pattern suggests that the inclusion of additional data sets (GFPGAN×1 and GFPGAN×2) has a significantly positive impact on both metrics.

3.4.2. Model Inference

Figure 7 illustrates the inference results of the weed detection models. Detections are labeled with weed names and confidence values, demonstrating the models’ effectiveness in identifying and locating the species present. The detected species include C. arvensis, C. album, A. patula and D. ferox, each marked with a bounding box and confidence values ranging from 0.18 to 0.93. Comparisons among the three models show consistency in detecting C. arvensis and C. album, with minor variations in confidence scores. The model trained with the Ori + GFPGAN×1 + ×2 dataset achieved the highest number of detections, demonstrating its ability to correctly identify weeds at the growth stages recommended to apply an efficient weed control treatment.

Inference was performed on a set of images corresponding to partitioned orthomosaics, obtaining 1699 and 2010 images for the early growth phase in maize BBCH14 and tomato BBCH501, respectively. For the later growth phase, 2390 and 3669 images were obtained in maize BBCH17 and tomato BBCH509, respectively. Computational performance of each model and crop growth stages in term of number of parameters trained, disk size and inference times are shown in Table 6.

4. Discussion

The present study is based on a dataset that stands out for its considerable size and diversity, covering a wider range of species compared to datasets used in similar research. Indeed, previous studies, such as dos Santos Ferreira et al. [37] with 1191 images of broadleaf weeds and 3520 images of grasses, Valente et al. [38] with 631 images of Rumex obtusifolius plants, Petrich et al. [39] with 8100 images of Colchicum autumnale, and Zhang et al. [40] with a collection of 2000 images including species such as Cirsium setosum, Descurainia sophia, Euphorbia helioscopia, Veronica didyma and Avena fatua, have worked with substantially smaller datasets. In contrast, our dataset comprised 31,002 (early growth stage) + 36,556 (subsequent growth stage) images, offering a robust foundation for comprehensive scientific investigations, such as weed detection, classification and segmentation. This is especially important for effective weed identification using supervised learning in the context of DL, as a sufficient amount of labeled data is required.

The Swin-T model achieved an outstanding of 99.1% in the early growth-stage dataset, exceeds the performance of previous CNN-based approaches. For instance, the study conducted by Huang et al. [41] employed the ResNet-101 model to discriminate between rice crops and two weed species, Cyperus iric and Leptochloa chinensis, achieving a noteworthy accuracy of 94.0%. Valente et al. [38] and Lam et al. [42] also tackled the challenge of Rumex obtusifolius identification using CNN models such as AlexNet and VGG16, respectively. Valente et al. [38] achieved an accuracy exceeding 90.0%, while Lam et al. [42] reached an accuracy of 92.1%. Furthermore, the efficacy of ViT-based classifiers for weed identification and discrimination is underscored by the work of Espejo-Garcia et al. [43]. In their study, they applied the Swin-v2 model, yielding top-1 accuracy scores ranging from 98.5% to 98.6% on the DeepWeeds dataset.

The F1-scores obtained in the inference at the advanced growth stages (BBCH17 and BBCH509), both on a per-species basis and in terms of the weighted average F1-score for each model, were lower compared to those in the study by Reedha et al. [44]. Reedha et al. implemented the ViT B-16 model and achieved an F1-score of 99.4%. Notably, their research focused on the classification of five classes (weeds, beet, off-type beet, parsley and spinach), whereas our study involved the classification of ten weed species and two crops.

The results related to the execution time of the GFPGAN framework for generating new images indicate that processing at the ×1 factor is the most time-efficient option. This makes it an attractive choice for applications that demand rapid real-time data processing. The GFPGAN×2 model, although slightly slower than the GFPGAN×1 model, still provides competitive performance and could be preferable in scenarios where a balance between accuracy and processing speed is required. Conversely, the GFPGAN×3 model, despite its slower processing speed, could be suitable for tasks where accuracy takes precedence, and ultra-rapid processing is not a critical requirement.

Inference conducted using the Ori + GFPGAN×1 model at the later phenological stages BBCH17 and BBCH509 demonstrated that the integration of synthetic images generated by the GFPGAN model, along with real images, is an effective strategy for enhancing the quality of datasets in weed monitoring. Consequently, our findings suggest that the application of GFPGAN in characterizing plant images holds great potential for practical applications in real-world scenarios, particularly in the context of weed management in the framework of precision crop protection.

Another distinctive aspect of our article is the development of a model using UAV images captured from an altitude of 11 m. In the existing literature, studies utilizing RGB images acquired by UAVs at a lower altitude of 2 m can be found, such as the one proposed by Khan et al. [45]. In their study, they introduced a semi-supervised GAN system (SGAN) for classifying weeds, including Eleusine indica, amidst pea and strawberry crops.

The implementation of an object detector has underscored the critical role of both the quantity and quality of training data. The improvement observed in mAP and IoU metrics demonstrates that increasing relevant data significantly enhances the model’s capacity to generalize and accurately detect weeds. This highlights how models trained on larger datasets effectively identify and classify weeds across various growth stages, essential for real-world scenarios where images exhibit high complexity and variability. While performance has improved with increased data, it is important to consider the law of diminishing returns. Beyond a certain point, additional data may only result in marginal improvements while increasing computational and storage costs. Moreover, our study identifies that DETR faces specific challenges, particularly in accurately detecting small objects and requiring extensive training durations, akin to issues noted in previous research [46].

Gallo et al. [47] evaluated the efficacy of the YOLOv7 model for weed detection in chicory crops, achieving an accuracy of 61.3% and a recall of 62.1%. While the YOLOv7 models are highly competitive, they could significantly benefit from data augmentation techniques to enhance accuracy across various IoU thresholds. This is shown in Table 5, where applying these techniques resulted in a notable mAP improvement from 0.060 to 0.234. This is particularly relevant given that the mAP@[IoU = 0.50] in Table 5 reaches a value of 0.513, indicating improved predictive capability through fine-tuning of the model and use of additional image processing techniques. Khan et al. [45] proposed a semi-supervised GAN-based framework for crop and weed classification in UAV imagery, achieving 90% accuracy using only 20% labeled imagery, and reaching 94.17% with 80% labeled imagery. These results demonstrate exceptional performance, especially when compared to the overall mAP shown in Table 5. This suggests that the semi-supervised approach is highly efficient, even with a limited amount of labeled data, which could reduce the need for extensive data augmentation techniques. The work of Gašparović et al. [48] on weed detection in oat fields using UAV imagery showed that automatic object-based classification achieved an overall accuracy of 89.0%. This result reflects a high accuracy in weed segmentation, favorably comparable with the performance of the detection models presented in Table 5, when considering the mAP for large (0.393) and medium (0.259) areas. These findings highlight the robustness of the object-based approach, particularly in segmenting larger areas. Future investigations should focus on mitigating these constraints, potentially through optimizing model architectures or adopting more efficient training techniques.

Visual transformers, such as Swin-T, employ attention mechanisms that weight the importance of different regions of an image, dividing it into patches and generating hierarchical representations that integrate both global and local features. In this context, the ability of Swin-T to identify parts of weeds and crop plants was highly dependent on the variability and quality of the training dataset. The presence of images with occlusions, deformations and variations in scale improved the generalization of the model, allowing it to recognize parts of weeds and crops in different conditions. In addition, detailed annotations that included information about the parts of weeds and crops were critical for the model to learn discriminative representations. Swin-T’s self-attentional mechanisms facilitated focusing on the most relevant areas of the image, enhancing the identification of parts even in the presence of partial occlusions. Thus, although Swin-T was not specifically designed in this work for the identification of parts of weeds and crops, its inherent attentional and hierarchical representation capabilities enabled effective identification at the boundaries of the image cutouts made to the orthophotos.

In this study, we chose not to use automated hyperparameter optimization techniques, which has several key advantages from a scientific and practical perspective. The significant reduction in computational costs and run time allowed us to focus resources on faster and more accurate experimental iterations. This simplified approach facilitated efficient implementation and provided rigorous control over the hyperparameters, allowing specific adjustments based on accumulated expertise and qualitative observations of model behavior during training. In addition, by avoiding variations introduced by automated methods, the reproducibility of the experiments was improved, ensuring that the results obtained can be replicated more reliably in future studies. This strategy also promoted a deeper understanding of the interactions between hyperparameters and model performance, providing valuable insights for interpretation and continuous model improvement. In the context of exploratory or resource-constrained research, manual hyperparameter configuration proved to be an effective methodology, aligned with operational efficiency and scientific integrity.

The growth stages studied in this research occur mainly in the timely moment for weed treatment for two reasons. One, weed control is particularly effective when applied at early stages to minimize the adverse competence of the weeds with the crops. Two, overlap between crop plants is usually low in this critical treatment window, thus reducing the screening action of the crop plants on the application of an herbicide and thus increasing its efficacy. From a practical point of view, our results can be helpful to apply species- and site-specific weed control measures at the right time, but further work should be developed to deploy these CC models in more complex systems that combine detection, decision making and actuation tasks for weed treatment following precision agriculture strategies.

5. Conclusions

This research relied on a large UAV imagery dataset, collected at two phenological stages, to classify ten weed species and two crops using a cognitive computing-based system. By adapting a generative model originally designed for facial super-resolution, we enhanced our dataset to create a robust classification model capable of adapting to morphological variations in the studied species. Despite its initial design for facial enhancement, this study demonstrates that the GFPGAN framework can be effectively repurposed for enhancing plant images, indicating significant potential for future weed management strategies. Our results highlight the Swin-T model’s outstanding performance in multiclass classification of weed species from UAV imagery, showing remarkable adaptability across different phenological stages. This underscores its effectiveness in discerning image variations at various growth stages, establishing it as a robust tool for weed monitoring. By applying data augmentation and transfer learning techniques, we achieved significant improvements in result robustness, reducing the risks of overfitting and dataset unrepresentativeness. Models generated with the data augmentation showed enhanced F1-scores for most weed species within the unbalanced dataset associated with the later growth stages, demonstrating the methodology’s adaptability over time. The proposed multiclass discrimination system, combining the classification and the DETR model for detecting multiple weed species, proved effective in accommodating variations across weed species’ developmental stages, holding potential to improve weed management strategies in precision crop protection. This could lead to reduced herbicide usage, resistance prevention, optimized control costs, efficient resource allocation, and ultimately, improved crop quality.

Author Contributions

Conceptualization, G.A.M.-R., J.M.P., A.I.d.C., I.B.-S. and J.D.; methodology, G.A.M.-R., J.M.P. and J.D.; software, G.A.M.-R.; formal analysis, G.A.M.-R., J.M.P., A.I.d.C., I.B.-S. and J.D.; investigation, G.A.M.-R., J.M.P., A.I.d.C., I.B.-S. and J.D.; data curation, G.A.M.-R.; writing—original draft preparation, G.A.M.-R. and J.D.; writing—review and editing, G.A.M.-R., J.M.P., A.I.d.C., I.B.-S. and J.D.; supervision, J.M.P. and J.D.; funding acquisition, J.M.P. and J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the SPANISH RESEARCH STATE AGENCY (AEI) through the project PID2020-113229RB-C41/AEI/10.13039/501100011033. The lead author, G.A. Mesías-Ruiz has been a beneficiary of an FPI fellowship by the SPANISH MINISTRY OF EDUCATION AND PROFESSIONAL TRAINING, grant number PRE2018-083227. The research of I. Borra-Serrano was financed by the grant FJC2021-047687-1 funded by MCIN/AEI/10.13039/501100011033 and EUROPEAN UNION NEXTGENERATIONEU/PRTR.

Data Availability Statement

Datasets generated and used for this research concerning the early phenological stages BBCH14 (maize) and BBCH501 (tomato) are available at the link https://doi.org/10.20350/digitalCSIC/16131 (accessed on 15 August 2024).

Acknowledgments

The authors thank the technicians David Campos and José Manuel Martín, members of the Tech4Agro research group, for their support during field sampling and image labeling.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Horvath, D.P.; Clay, S.A.; Swanton, C.J.; Anderson, J.V.; Chao, W.S. Weed-induced crop yield loss: A new paradigm and new challenges. Trends Plant Sci. 2023, 28, 567–582. [Google Scholar] [CrossRef] [PubMed]
Fernández-Quintanilla, C.; Dorado, J.; Andújar, D.; Peña, J.M. Site-Specific Based Models. In Decision Support Systems for Weed Management; Chantre, G.R., González-Andújar, J.L., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 143–157. ISBN 978-3-030-44402-0. [Google Scholar]
Andújar, D.; Ribeiro, A.; Carmona, R.; Fernández-Quintanilla, C.; Dorado, J. An assessment of the accuracy and consistency of human perception of weed cover. Weed Res. 2010, 50, 638–647. [Google Scholar] [CrossRef]
Wang, A.; Zhang, W.; Wei, X. A review on weed detection using ground-based machine vision and image processing techniques. Comput. Electron. Agric. 2019, 158, 226–240. [Google Scholar] [CrossRef]
Aghav-Palwe, S.; Gunjal, A. Chapter 1-Introduction to cognitive computing and its various applications. In Cognitive Computing for Human-Robot Interaction; Mittal, M., Shah, R.R., Roy, S., Eds.; Cognitive Data Science in Sustainable Computing; Academic Press: Cambridge, MA, USA, 2021; pp. 1–18. ISBN 978-0-323-85769-7. [Google Scholar]
Sreedevi, A.G.; Nitya Harshitha, T.; Sugumaran, V.; Shankar, P. Application of cognitive computing in healthcare, cybersecurity, big data and IoT: A literature review. Inf. Process. Manag. 2022, 59, 102888. [Google Scholar] [CrossRef]
Dong, Y.; Hou, J.; Zhang, N.; Zhang, M. Research on How Human Intelligence, Consciousness, and Cognitive Computing Affect the Development of Artificial Intelligence. Complexity 2020, 2020, e1680845. [Google Scholar] [CrossRef]
Lytras, M.D.; Visvizi, A. Artificial Intelligence and Cognitive Computing: Methods, Technologies, Systems, Applications and Policy Making. Sustainability 2021, 13, 3598. [Google Scholar] [CrossRef]
Remya, R.; Sumithra, M.G.; Krishnamoorthi, M. Foundation of cognitive computing. In Deep Learning for Cognitive Computing Systems: Technological Advancements and Applications; Sumithra, M., Kumar Dhanaraj, R., Iwendi, C., Manoharan, A., Eds.; De Gruyter: Berlin, Germany; Boston, MA, USA, 2023; pp. 19–33. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
Singh, V.; Prasad, S. Speech emotion recognition system using gender dependent convolution neural network. Procedia Comput. Sci. 2023, 218, 2533–2540. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Yang, Y.; Jiao, L.; Liu, X.; Liu, F.; Yang, S.; Feng, Z.; Tang, X. Transformers Meet Visual Learning Understanding: A Comprehensive Review. arXiv 2022, arXiv:2203.12944. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–21. [Google Scholar] [CrossRef] [PubMed]
Lonij, V.P.A.; Fiot, J.-B. Chapter 8-Cognitive Systems for the Food–Water–Energy Nexus. In Handbook of Statistics; Gudivada, V.N., Raghavan, V.V., Govindaraju, V., Rao, C.R., Eds.; Cognitive Computing: Theory and Applications; Elsevier: Amsterdam, The Netherlands, 2016; Volume 35, pp. 255–282. [Google Scholar]
Huang, Y.; Lan, Y.; Thomson, S.J.; Fang, A.; Hoffmann, W.C.; Lacey, R.E. Development of soft computing and applications in agricultural and biological engineering. Comput. Electron. Agric. 2010, 71, 107–127. [Google Scholar] [CrossRef]
Mourhir, A.; Papageorgiou, E.I.; Kokkinos, K.; Rachidi, T. Exploring Precision Farming Scenarios Using Fuzzy Cognitive Maps. Sustainability 2017, 9, 1241. [Google Scholar] [CrossRef]
Munteanu, S.; Sudacevschi, V.; Ababii, V.; Branishte, R.; Turcan, A.; Leashcenco, V. Cognitive Distributed Computing System for Intelligent Agriculture. Int. J. Progress. Sci. Technol. (IJPSAT) 2021, 24, 334–342. [Google Scholar]
Maes, W.H.; Steppe, K. Perspectives for Remote Sensing with Unmanned Aerial Vehicles in Precision Agriculture. Trends Plant Sci. 2019, 24, 152–164. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Rejeb, A.; Abdollahi, A.; Rejeb, K.; Treiblmaier, H. Drones in agriculture: A review and bibliometric analysis. Comput. Electron. Agric. 2022, 198, 107017. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Meier, U. Growth Stages of Mono- and Dicotyledonous Plants: BBCH Monograph. Open Agrar Repositorium: Quedlinburg, Germany, 2018; ISBN 978-3-95547-071-5. [Google Scholar]
Tzutalin, LabelImg. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 15 August 2024).
Mesías-Ruiz, G.A.; Borra-Serrano, I.; Peña Barragán, J.M.; de Castro, A.I.; Fernández-Quintanilla, C.; Dorado, J. Unmanned Aerial Vehicle Imagery for Early Stage Weed Classification and Detection in Maize and Tomato Crops. DIGITAL.CSIC. 2024. Available online: https://digital.csic.es/handle/10261/347533 (accessed on 15 August 2024). [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
Olaniyi, E.; Chen, D.; Lu, Y.; Huang, Y. Generative Adversarial Networks for Image Augmentation in Agriculture: A Systematic Review. arXiv 2022, arXiv:2204.04707. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; Zhang, H.; Shan, Y. Towards Real-World Blind Face Restoration with Generative Facial Prior. arXiv 2021, arXiv:2101.04061. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar] [CrossRef]
dos Santos Ferreira, A.; Matte Freitas, D.; Gonçalves da Silva, G.; Pistori, H.; Theophilo Folhes, M. Weed detection in soybean crops using ConvNets. Comput. Electron. Agric. 2017, 143, 314–324. [Google Scholar] [CrossRef]
Valente, J.; Doldersum, M.; Roers, C.; Kooistra, L. DETECTING RUMEX OBTUSIFOLIUS WEED PLANTS IN GRASSLANDS FROM UAV RGB IMAGERY USING DEEP LEARNING. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, IV-2/W5, 179–185. [Google Scholar] [CrossRef]
Petrich, L.; Lohrmann, G.; Neumann, M.; Martin, F.; Frey, A.; Stoll, A.; Schmidt, V. Detection of Colchicum autumnale in drone images, using a machine-learning approach. Precis. Agric. 2020, 21, 1291–1303. [Google Scholar] [CrossRef]
Zhang, R.; Wang, C.; Hu, X.; Liu, Y.; Chen, S.; Su, B. Weed location and recognition based on UAV imaging and deep learning. Int. J. Precis. Agric. Aviat. 2020, 3, 23–29. [Google Scholar] [CrossRef]
Huang, H.; Deng, J.; Lan, Y.; Yang, A.; Deng, X.; Wen, S.; Zhang, H.; Zhang, Y. Accurate Weed Mapping and Prescription Map Generation Based on Fully Convolutional Networks Using UAV Imagery. Sensors 2018, 18, 3299. [Google Scholar] [CrossRef]
Lam, O.H.Y.; Dogotari, M.; Prüm, M.; Vithlani, H.N.; Roers, C.; Melville, B.; Zimmer, F.; Becker, R. An open source workflow for weed mapping in native grassland using unmanned aerial vehicle: Using Rumex obtusifolius as a case study. Eur. J. Remote Sens. 2021, 54, 71–88. [Google Scholar] [CrossRef]
Espejo-Garcia, B.; Panoutsopoulos, H.; Anastasiou, E.; Rodríguez-Rigueiro, F.J.; Fountas, S. Top-tuning on transformers and data augmentation transferring for boosting the performance of weed identification. Comput. Electron. Agric. 2023, 211, 108055. [Google Scholar] [CrossRef]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Khan, S.; Tufail, M.; Khan, M.T.; Khan, Z.A.; Iqbal, J.; Alam, M. A novel semi-supervised framework for UAV based crop/weed classification. PLoS ONE 2021, 16, e0251008. [Google Scholar] [CrossRef] [PubMed]
Shahi, T.B.; Dahal, S.; Sitaula, C.; Neupane, A.; Guo, W. Deep Learning-Based Weed Detection Using UAV Images: A Comparative Study. Drones 2023, 7, 624. [Google Scholar] [CrossRef]
Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep Object Detection of Crop Weeds: Performance of YOLOv7 on a Real Case Dataset from UAV Images. Remote Sens. 2023, 15, 539. [Google Scholar] [CrossRef]
Gašparović, M.; Zrinjski, M.; Barković, Đ.; Radočaj, D. An automatic method for weed mapping in oat fields based on UAV imagery. Comput. Electron. Agric. 2020, 173, 105385. [Google Scholar] [CrossRef]

Figure 1. Flowchart depicting the research development process, which comprises the following steps: (a) UAV programmed flights at an altitude of 11 m above two crops; (b) orthomosaics building; (c) labeling and categorization of identified species by partitioning the orthomosaics + model building; (d) generation of synthetic images using GANs; (e) implementation of ViT classifiers using the dataset from the initial flights; and (f) assessment of the dataset related to the subsequent crop growth stage for comparative analysis.

Figure 2. Images of the maize crop showing two sampling times: (a) early growth stage BBCH14 (4 leaves unfolded) and (b) subsequent growth stage BBCH17 (7 leaves unfolded). The images presented correspond to terrestrial images.

Figure 4. Visualization of Grad-CAM activation maps in the interpretation of the Swin-T classification model for the species (a) Atriplex patula, (b) Chenopodium album, (c) Convolvulus arvensis, (d) Cyperus rotundus, (e) Datura ferox, (f) Lolium rigidum, (g) Portulaca oleracea, (h) Salsola kali, (i) Solanum nigrum, (j) Sorghum halepense, (k) maize and (l) tomato.

Figure 5. Photorealistic images produced using the GFPGAN framework: (a) original image, (b) upscaled ×1, (c) upscaled ×2 and (d) upscaled ×3. The images represent the species D. ferox.

Figure 6. Illustrative examples of the original UAV images, alongside their corresponding GFPGAN-processed images scaled at ×1, and local SSIM maps, depicting ten weed species and two crop species during their early growth stages: (a) Atriplex patula, (b) Chenopodium album, (c) Convolvulus arvensis, (d) Cyperus rotundus, (e) Datura ferox, (f) Lolium rigidum, (g) Portulaca oleracea, (h) Salsola kali, (i) Solanum nigrum, (j) Sorghum halepense, (k) maize and (l) tomato. Additionally, MSE and SSIM values are provided.

Figure 7. Examples of DETR model inference on 1000 × 1000 pixel images of maize crops: original image (a) and model inference (b) at phenological stage BBCH14; original image (c) and model inference (d) at phenological stage BBCH17; model inference of Ori + GFPGAN×1 (e) and Ori + GFPGAN×1 + ×2 (f) at phenological stage BBCH17. The images depict partitions of the orthomosaic.

Table 1. Distribution of labeled images for each weed species and crop in the training, validation and testing sets of the model developed during the early growth stage, along with the number of labeled images from the subsequent growth stage.

	Early Growth Stage				Subsequent Growth Stage
	Labels	Train	Validation	Test	Labels
Atriplex patula	1000	720	180	100	1459
Chenopodium album	1200	880	220	100	2175
Convolvulus arvensis	1200	880	220	100	1102
Cyperus rotundus	3090	2392	598	100	134
Datura ferox	683	466	117	100	589
Lolium rigidum	1000	720	180	100	80
Portulaca oleracea	1875	1420	355	100	177
Salsola kali	1200	880	220	100	1216
Solanum nigrum	1900	1440	360	100	2175
Sorghum halepense	1600	1200	300	100	103
Maize	12,364	9811	2453	100	24,614
Tomato	3890	3032	758	100	2732

Table 2. Performance metrics for the Swin-T classification model applied to weed and crop species during the early growth stage (maize BBCH14 and tomato BBCH501).

Species	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Support
Atriplex patula	98.0	97.0	98.0	97.5	100
Chenopodium album	98.0	99.0	98.0	98.5	100
Convolvulus arvensis	99.0	100.0	99.0	99.5	100
Cyperus rotundus	99.0	100.0	99.0	99.5	100
Datura ferox	99.0	98.0	99.0	98.5	100
Lolium rigidum	98.0	100.0	98.0	99.0	100
Portulaca oleracea	100.0	99.0	100.0	99.5	100
Salsola kali	98.0	98.0	98.0	98.0	100
Solanum nigrum	100.0	99.0	100.0	99.5	100
Sorghum halepense	100.0	100.0	100.0	100.0	100
Maize	100.0	99.0	100.0	99.5	100
Tomato	100.0	100.0	100.0	100.0	100
Accuracy	99.1
Macro average	99.1	99.1	99.1	99.1
Weighted average	99.1	99.1	99.1	99.1

Table 3. Performance metrics for the Swin-T classification model applied to weed and crop species during the subsequent growth stage (maize BBCH17 and tomato BBCH509).

Species	Precision (%)	Recall (%)	F1-Score (%)	Support
Atriplex patula	96.3	53.3	68.6	1459
Chenopodium album	91.9	82.0	86.7	2175
Convolvulus arvensis	55.6	97.3	70.7	1102
Cyperus rotundus	65.4	52.2	58.1	134
Datura ferox	88.7	91.7	90.2	589
Lolium rigidum	37.1	90.0	52.6	80
Portulaca oleracea	85.0	89.8	87.4	177
Salsola kali	98.5	88.3	93.2	1216
Solanum nigrum	98.2	100.0	99.1	2175
Sorghum halepense	67.8	100.0	80.8	103
Maize	99.0	98.0	98.5	24,614
Tomato	91.0	98.6	94.7	2732
Accuracy			94.7	36,556
Macro average	81.2	86.8	81.7
Weighted average	95.9	94.7	94.8

Table 4. F1-score by species at the subsequent growth stage for various classification models generated through data augmentation.

Species	F1-Score %
	Original (Ori)	Data Augmentation
	Original (Ori)	Ori + GFPGAN×1	Ori + GFPGAN×2	Ori + GFPGAN×3	Ori + GFPGAN×1 + ×2	Ori + GFPGAN×1 + ×2 + ×3
Atriplex patula	68.6	71.5	65.0	62.1	72.1	70.6
Chenopodium album	86.7	86.4	81.8	79.5	86.5	79.3
Convolvulus arvensis	70.7	70.7	72.4	64.7	74.6	72.18
Cyperus rotundus	58.1	60.8	57.4	51.4	58.5	63.8
Datura ferox	90.2	87.5	71.2	85.2	79.3	85.5
Lolium rigidum	52.6	67.0	53.0	69.1	69.5	63.1
Portulaca oleracea	95.4	88.6	73.0	81.0	84.8	83.5
Salsola kali	87.4	96.3	92.9	96.0	96.6	94.2
Solanum nigrum	99.1	99.3	98.9	98.0	97.9	97.1
Sorghum halepense	80.8	89.2	96.7	95.4	97.6	95.4
Maize	98.5	98.5	97.2	97.8	97.9	97.8
Tomato	94.7	96.8	95.2	95.6	97.0	96.7
Macro average	81.7	84.4	79.6	81.3	84.4	83.3
Weighted average	94.8	95.3	93.3	93.5	94.8	94.1

Table 5. Training performance (mAP and IoU metrics) for various detection models using datasets generated with data augmentation.

	Original	Ori + GFPGAN×1	Ori + GFPGAN×1 + ×2
mAP	0.060	0.191	0.234
mAP@[IoU = 0.50]	0.158	0.444	0.513
mAP@[IoU = 0.75]	0.030	0.129	0.171
mAP@[area = small]	0.031	0.072	0.096
mAP@[area = medium]	0.069	0.206	0.259
mAP@[area = large]	0.185	0.237	0.393
Recall@[maxDets = 1]	0.096	0.084	0.095
Recall@[maxDets = 10]	0.274	0.255	0.252
Recall@[maxDets = 100]	0.336	0.340	0.339
Recall@[area = small]	0.215	0.162	0.149
Recall@[area = medium]	0.351	0.366	0.372
Recall@[area = large]	0.408	0.447	0.522

Table 6. Computational performance of object detector inference in terms of multiple metrics used in the evaluation of computer vision models.

Model	N° of Trainable Parameters (×10⁶)	Size on Disk (Mb)	Inference Time (s)
			Early Growth Stage		Subsequent Growth Stage
			Maize (BBCH14)	Tomato (BBCH501)	Maize (BBCH17)	Tomato (BBCH509)
Original	41.5	166.5	80.28	103.64	110.56	165.88
Ori + GFPGAN×1	41.5	166.5	-	-	115.28	166.14
Ori + GFPGAN×1 + ×2	41.5	166.5	-	-	117.45	169.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mesías-Ruiz, G.A.; Peña, J.M.; de Castro, A.I.; Borra-Serrano, I.; Dorado, J. Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring. Remote Sens. 2024, 16, 3026. https://doi.org/10.3390/rs16163026

AMA Style

Mesías-Ruiz GA, Peña JM, de Castro AI, Borra-Serrano I, Dorado J. Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring. Remote Sensing. 2024; 16(16):3026. https://doi.org/10.3390/rs16163026

Chicago/Turabian Style

Mesías-Ruiz, Gustavo A., José M. Peña, Ana I. de Castro, Irene Borra-Serrano, and José Dorado. 2024. "Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring" Remote Sensing 16, no. 16: 3026. https://doi.org/10.3390/rs16163026

APA Style

Mesías-Ruiz, G. A., Peña, J. M., de Castro, A. I., Borra-Serrano, I., & Dorado, J. (2024). Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring. Remote Sensing, 16(16), 3026. https://doi.org/10.3390/rs16163026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cognitive Computing Advancements: Improving Precision Crop Protection through UAV Imagery for Targeted Weed Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Generation

2.2. Vision Transformer Neural Network for Weed and Crops Classification

2.3. Model Training and Inference Details

2.4. Generative Adversarial Neural Network for Image Augmentation

2.5. Vision Transformer Neural Network for Weed Detection

3. Results

3.1. Classification Inference Using the Original Datasets in Both Early and Subsequent Growth Stages

3.2. Data Augmentation GFPGAN on the Early Growth-Stage Dataset

3.3. Classification Inference after Data Augmentation on the Subsequent Growth-Stage Dataset

3.4. Detection of Weeds

3.4.1. Model Training

3.4.2. Model Inference

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI