Drone-Based Marigold Flower Detection Using Convolutional Neural Networks

Vilcapoma, Piero; Vásconez, Ingrid Nicole; Prado, Alvaro Javier; Moya, Viviana; Vásconez, Juan Pablo

doi:10.3390/pr13103169

Open AccessArticle

Drone-Based Marigold Flower Detection Using Convolutional Neural Networks

by

Piero Vilcapoma

¹

,

Ingrid Nicole Vásconez

²

,

Alvaro Javier Prado

³

,

Viviana Moya

⁴

and

Juan Pablo Vásconez

^1,5,*

¹

Energy Transformation Center, Faculty of Engineering, Universidad Andres Bello, Santiago 7500971, Chile

²

Centro de Biotecnología Vegetal, Facultad de Ciencias de la Vida, Universidad Andrés Bello, Santiago 8370251, Chile

³

Departamento de Ingeniería de Sistemas y Computación, Universidad Católica del Norte, Antofagasta 1249004, Chile

⁴

Departamento de Automatización y Control Industrial, Escuela Politécnica Nacional, Quito 170525, Ecuador

⁵

ANID—Millennium Nucleus in Data Science for Plant Resilience (PhytoLearning), Santiago 8370251, Chile

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(10), 3169; https://doi.org/10.3390/pr13103169

Submission received: 15 August 2025 / Revised: 29 September 2025 / Accepted: 2 October 2025 / Published: 5 October 2025

(This article belongs to the Section AI-Enabled Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

Artificial intelligence (AI) is an important tool for improving agricultural tasks. In particular, object detection methods based on convolutional neural networks (CNNs) enable the detection and classification of objects directly in the field. Combined with unmanned aerial vehicles (UAVs, drones), these methods allow efficient crop monitoring. The primary challenge is to develop models that are both accurate and feasible under real-world conditions. This study addresses this challenge by evaluating marigold flower detection using three groups of CNN detectors: canonical models, including YOLOv2, Faster R-CNN, and SSD with their original backbones; modified versions of these detectors using DarkNet-53; and modern architectures, including YOLOv11, YOLOv12, and the RT-DETR. The dataset consisted of 392 images from marigold fields, which were manually labeled and augmented to a total of 940 images. The results showed that YOLOv2 with DarkNet-53 achieved the best performance, with 98.8% mean average precision (mAP) and 97.9% F1-score (F1). SSD and Faster R-CNN also improved, reaching 63.1% and 52.8%, respectively. Modern models obtained strong results: YOLOv11 and YOLOv12 reached 96–97%, and RT-DETR 93.5%. The modification of YOLOv2 allowed this classical detector to compete directly with, and even surpass, recent models. Precision–recall (PR) curves, F1-scores, and complexity analysis confirmed the trade-offs between accuracy and efficiency. These findings demonstrate that while modern detectors are efficient baselines, classical models with updated backbones can still deliver state-of-the-art results for UAV-based crop monitoring.

Keywords:

artificial intelligence; deep learning; convolutional neural networks; transformers; object detection; drone; marigold

1. Introduction

AI has made key contributions to ground and aerial robotic perception systems in recent years. In particular, agricultural processes have benefited significantly from robotic systems equipped with machine learning (ML), deep learning (DL), artificial neural networks (ANNs), and computer vision (CV) algorithms in several applications, such as crop monitoring, disease and pest detection, yield estimation, autonomous harvesting, precision irrigation, and quality assessment of fruits and vegetables [1,2,3,4,5]. These advances increase efficiency and productivity while reducing resource consumption and supporting more sustainable agricultural practices [6,7,8,9]. In crop monitoring applications, the most commonly used technologies to collect database information are multispectral cameras, thermal cameras, red–green–blue (RGB) cameras, LiDAR sensors, and UAVs, among others [10,11,12,13]. These sensors allow for collecting crop information through images, videos, and measurement recordings. The collected data are normalized and managed to build databases that the AI models can use to perform tasks such as image classification, object detection, and segmentation [14,15,16].

In particular, flower detection applications can significantly reduce time and resource requirements in field operations, as they allow monitoring of flower maturity levels and predicting flower production, which is essential for effective farm management [17,18]. Typical flower detection models include Region-based Convolutional Neural Network (Faster R-CNN), You Only Look Once (YOLO), and Single Shot MultiBox Detector (SSD), which generally rely on convolutional neural networks (CNNs) as the backbone for feature extraction, such as residual network with 50 layers (ResNet-50) for Faster R-CNN and SSD, and DarkNet-19 for YOLO [19,20,21]. However, previous studies have shown that changing the backbone architecture can substantially affect model performance, depending on the distribution and characteristics of the training dataset. For this reason, it is important to test other backbones such as GoogLeNet (Inception v1), a deep CNN designed with inception modules for efficient feature extraction; AlexNet, one of the first deep CNNs successfully applied to large-scale image classification; Inception v3, an improved version of GoogLeNet that incorporates factorized convolutions; Xception, an architecture that replaces standard convolutions with depthwise separable convolutions to increase efficiency; and SqueezeNet, a lightweight CNN that achieves competitive accuracy with a small number of parameters. Exploring these alternatives allows a comprehensive evaluation of the detection capabilities of each model on a specific dataset. Exploring these combinations offers a potential strategy to optimize decision-making in diverse cultivation scenarios [16,19,20,21,22,23]. This work explores alternative backbone architectures in marigold flower detection to determine optimal combinations for agricultural scenarios. Despite advances in AI-based agricultural detection, most studies focus on crops such as tomato, apple, pepper, rice, and kiwi, while research on marigold flower detection is remarkably sparse [19,20]. Furthermore, most studies rely on a limited set of architectures or single canonical structures, which influences detection accuracy under real-world conditions. This disparity is especially relevant for UAV-based monitoring, where lightweight yet accurate models are essential for real-time operation.

This study presents a comparative analysis of three categories of object detection methods for drone-based marigold flower detection: (a) canonical detectors (YOLOv2, Faster R-CNN, SSD) using their original backbones; (b) modified versions of these detectors employing a unified DarkNet-53 backbone; and (c) contemporary architectures including YOLOv11, YOLOv12, and the transformer-based RT-DETR. To this end, a database built with the images recorded by a DJI mini 3 Pro drone was used to evaluate the detection accuracy and inference speed of each object detection method. This comprehensive approach highlights the advantages and disadvantages of canonical, modified, and contemporary detectors, offering new perspectives for the development of efficient and reliable flower identification systems for precision agricultural applications.

1.1. State-of-the-Art Review

In the literature, several studies combine CV and AI algorithms in processes oriented to agriculture for flower detection methods. In the work proposed for Philipe A. Dias et al. [24], the authors developed a method for apple blossom detection using superpixel segmentation based on a CNN, principal component analysis (PCA), and feature classification. The authors classified each superpixel for blossom content using a support vector machine (SVM) classifier and histogrammed HSV (Hue, Saturation, Value color space) images. The authors highlighted that the method was robust under challenging conditions, including complex backgrounds, variable illumination, and occlusions. The results show that their CNN+SVM method achieves recall and precision above 90% on the main dataset under real-world conditions. In contrast, in Wu, D. et al. [25], the YOLOv4 model was used to perform real-time apple blossom detection. This work compared its results obtained with other models, Faster R-CNN, Tiny-YOLO v2, YOLOv3, SSD 300, and EfficientDet-D0, so that the model used obtained accuracy values of 5.67% higher than the other models. In Chen J. et al.’s work [26], they used the MobileNet and M-Inception networks as backbones in combination with the SSD model to perform disease detection in rice plants, achieving results of up to 99.21% accuracy for a public database and 97.89% for the work’s database. In the work proposed by Cheng and Zhang [27], the authors present a flower detection algorithm using a YOLOv4-based end-to-end anchoring method adapted for mobile devices in smart garden scenarios. The authors used a CSDarknet53 backbone that combines CSPNet and Darknet53, as well as a spatial pyramid pooling (SPP) block and a convolutional block attention mechanism (SAM) to place emphasis on relevant features. The authors used a database composed of 8189 images and 102 categories of flowers. The proposed model achieved detection confidence levels between 84% and 98% across different flower classes. In the work presented by Patel [20], the authors developed a deep learning method to identify and classify marigold flowers at various flowering stages (bud and fully open flower) under varying field conditions. The detection algorithm used was Faster R-CNN with a ResNet50 backbone, optimized using transfer learning based on the Common Objects in Context (COCO) database and boosted through geometric data augmentations (reflection, scaling, rotation, zoom) to increase the robustness and generalization capabilities of its detection algorithm. The dataset comprised 550 images obtained from three marigold fields in India, which were subsequently expanded to 1583 augmented images labeled with bounding boxes for the two classes. The results indicated that the proposed Faster R-CNN model achieved an mAP of 88.71% on the original dataset and 89.47% mAP with augmentation, significantly outperforming a MobileNet SSD baseline (74.30% and 78.12%, respectively). However, Faster R-CNN exhibited the slowest inference (4.31 s/image) compared to SSD (0.64 s/image). In the work conducted by Banerjee et al. [28], the authors addressed the diagnosis of various marigold leaf diseases (Alternaria leaf spot, Cercospora leaf spot, powdery mildew, bacterial blight, leaf curl, rust, anthracnose, leaf miner damage) by creating a hybrid deep learning model that integrates CNN and SVM. The dataset consisted of 2700 images (270 per class) and nine disease categories, including a healthy leaf class. The images were preprocessed to

64 \times 64

pixels. The CNN design included six convolutional layers, six max-pooling layers, and two fully connected layers using dropout and L2 regularization, to which an SVM classifier was added to optimize decision boundaries and minimize misclassifications. Hyperparameter optimization (learning rate, batch size, epochs) was performed to improve performance. Experimental results demonstrated an overall accuracy of 92%, with precision, recall, and F1-scores ranging from 50% to 71.79% across classes. In the research conducted by Ma et al. [19], the authors examine the segmentation of harvesting points in marigold flowers. The authors stress how difficult the effort is because the backgrounds are complicated and the flowers’ postures change. To address this problem, the authors propose the lightweight SCS-YOLO-Seg model, an adapted variant of YOLOv8n-seg that integrates a StarNet backbone, a lightweight feature extraction module called C2f-Star, and a dedicated segmentation head (Seg-Marigold head) that facilitates the generation of masks for the corolla and stem. After identifying the corolla and stem, the harvesting point is identified through elliptical fitting of the corolla and skeletonization of the stem. A dataset composed of 1847 field-captured marigold images was used, distributed across training, validation, and testing, with verified manual annotations. The model identifies harvest points with an accuracy of 93.36%, with an average inference time of 28.66 milliseconds (ms) per image, significantly reducing the model size 3.1 megabytes (MB) compared to YOLOv8n-seg, while maintaining high segmentation performance (mAP@0.5 for corolla 88%, stem 78%). Finally, in the work developed by Fan et al. [29], the authors address the problem of marigold flower corolla detection to improve automated harvesting by mobile robots. To solve this problem, they suggest a lighter version of YOLOv7 that removes unnecessary detection layers, replaces parts of the main structure with DSConv (Depthwise Separable Convolution), changes the SPPCSPC module to a simpler SPPF module, prunes the model, and retrains it. They create a new dataset from Xinjiang region of China that is tailored to real-life marigold growing conditions. The improved model has a detection accuracy of 93.9% and an mAP@0.5 of 97.7%, which is better than the original YOLOv7 model. Furthermore, they reduce the computational cost to only 2.2% of the computational cost of the original model. It also reduces the parameters to 15.04 million (41.2% of the original model) and achieves an inference speed of 166.7 frames per second (FPS), which is 26.7% faster than the YOLOv7 standard.

Based on the state-of-the-art study, it can be observed that marigold flower detection remains an open research problem. Previous research has mainly focused on flower detection or specific flower elements (e.g., flowering stages, harvest intervals, corolla), employing architectures such as Faster R-CNN, SSD, and YOLO in lightweight and outdated versions. Additionally, it has been observed that a rigorous evaluation of other feature extractors (backbones) across various detection frameworks is needed, which has not been sufficiently explored to date. Furthermore, the capabilities of current, advanced models such as YOLOv11 and YOLOv12, along with transformer-based detectors such as RT-DETR, still need to be investigated in marigold detection applications. These gaps are the main motivation of our work, based on which we summarize the main contributions below.

1.2. Main Contributions

A unique marigold flower dataset was built from 392 drone images captured with a DJI Mini 3 Pro, manually annotated, and augmented to represent the final blooming stage under real field conditions.
Three groups of detectors were evaluated: canonical models (YOLOv2, Faster R-CNN, SSD with their standard backbones), modified versions using DarkNet-53 as a common backbone, and modern detectors (YOLOv11, YOLOv12, RT-DETR).
An experimental work was carried out on the modified detectors by varying optimizers, learning rates, and training epochs, analyzing their influence on model stability, underfitting, and overfitting.
Model performance was assessed through mAP@0.5, precision–recall curves, and F1-scores, and complemented with an analysis of model complexity and inference speed to evaluate efficiency in drone-based applications.

2. Materials and Methods

This study evaluates DL object detection architectures for detecting marigold flowers in the final blooming stage, using a database of images collected with a DJI Mini 3 Pro (DJI, Shenzhen, China) drone. The overall methodology is summarized in Figure 1. The process consists of three main stages: (i) dataset acquisition and augmentation; (ii) training and evaluation of object detection models; and (iii) performance analysis and comparison. Figure 1 specifically illustrates the modified configurations, where DarkNet53 was used as a common backbone for YOLOv2, SSD, and Faster R-CNN. In parallel, the canonical versions of these detectors (with their original backbones, such as ResNet and Darknet-19) were also evaluated, and modern architectures, including YOLOv11, YOLOv12, and RT-DETR, were included as reference baselines. Results from training, validation, and testing are reported in terms of mean average precision calculated with an IoU (Intersection over Union) threshold of 0.5 (mAP@0.5), per-class PR curves, F1-scores, inference speed, and model complexity, together with visual examples of detections.

2.1. Data Acquisition

This section consists of constructing and manually labeling a dataset for detecting marigold plants in their last flowering stage from the images captured by a DJI Mini 3 Pro drone in a crop field. The images were collected in September 2025 at Hacienda San Agustín de Callo, Lasso, Ecuador, at an altitude of approximately 2800 m, under controlled lighting conditions, between 2000 and 5000 lux, within a marigold flower greenhouse. It was decided to choose the final flowering stage since the initial, intermediate, and final flowering stages present significant differences in visual characteristics, as illustrated in Figure 2.

In other words, the last flowering stage is the stage that most characterizes and represents marigold flowers, given their size, color, and shape. This facilitates a more accurate detection by CNNs. In addition, detecting this last stage makes it possible to provide information on the ideal time for harvesting. The dataset was divided into training, validation, and testing subsets, as summarized in Table 1. The training set was used to train the Faster R-CNN, YOLOv2, and SSD models, while the validation set was used to calibrate the hyperparameters. In contrast, the test set was employed to evaluate the generalization capacity of the models with unseen data, allowing the identification of the optimal algorithm.

The dataset comprised 392 original images of marigold crops from various angles, with an initial resolution of

1280 \times 720

pixels. To increase variability and reduce overfitting, data augmentation techniques were applied exclusively to the training set, as detailed in Table 2.

These included scaling, rotation, and adjustments in brightness and saturation. Scaling involved resizing images to

600 \times 600

pixels, while rotation adjusted the angle between 0.4° and 9° clockwise. Brightness and saturation adjustments simulated changes in illumination conditions. These augmentations were selected to emulate realistic UAV flight conditions, such as changes in altitude, perspective, and lighting variability. The augmented dataset distribution is shown in Table 2, and examples of the transformations are illustrated in Figure 3.

2.2. RGB Versus Hyperspectral Imagery

This work was conducted exclusively with RGB images captured by a DJI Mini 3 Pro drone, which represents the most accessible and practical data source for UAV-based agricultural monitoring. RGB imagery is widely used in object detection tasks because of its low cost, high availability, and compatibility with deep learning frameworks. However, hyperspectral imagery provides finer spectral resolution, allowing the discrimination of subtle physiological traits such as water stress, pigment concentration, or nutrient variability, which may increase detection accuracy in agricultural contexts. Previous research has shown that hyperspectral sensors can record fine-grained spectral signatures outside of the visible range, providing supplementary information that RGB images are unable to provide [30,31]. The lack of hyperspectral data in this work is acknowledged because integrating hyperspectral UAV imagery could be investigated in future research to increase detection robustness under variable field conditions.

2.3. Dataset Split Strategy

The dataset of 392 original images (expanded through augmentation) was divided into training, validation, and testing subsets. A random split was applied to avoid selection bias, following a proportion of 80% for training, 10% for validation, and 10% for testing, as summarized in Table 1. This strategy allowed most of the data to be allocated for model learning, while still preserving independent subsets for hyperparameter adjustment and unbiased performance evaluation. We considered using cross-validation, but with such a limited dataset, it would have left very few images for training in each fold, and, in addition, the high computational cost of retraining deep learning models multiple times would have made the process unfeasible. For this reason, we chose a simple random split, which ensured that the three subsets remained balanced and representative without reducing the training data even further.

2.4. Marigold Detection Using CNN-Based Object Detection

In this work, three representative object detection models were selected: Faster R-CNN, YOLOv2, and SSD. These detectors were first evaluated in their canonical configurations, i.e., using the backbones originally associated with each architecture (ResNet-50 for Faster R-CNN and SSD, and Darknet-19 for YOLOv2). In addition, a modified configuration was tested, where the three classical detectors were coupled with DarkNet-53 as a common backbone, to analyze the effect of a modern residual architecture under the same training conditions. Finally, for comparison, we also included modern state-of-the-art detectors YOLOv11, YOLOv12, and RT-DETR evaluated with their default backbones and training pipelines as provided in the Ultralytics framework. Throughout this paper, the term canonical refers exclusively to the classical detectors (YOLOv2, SSD, and Faster R-CNN) with their original backbones, modern refers to the latest architectures (YOLOv11, YOLOv12, RT-DETR), and modified denotes the canonical detectors when their original backbones were replaced by DarkNet-53. To ensure comparability, all detectors were initialized with pre-trained weights from large-scale datasets (ImageNet for backbones, COCO for Ultralytics models) and then fine-tuned on the marigold dataset.

2.4.1. Canonical CNN Detectors

This subsection describes the canonical object detectors used as baselines in this work, namely YOLOv2, Faster R-CNN, and SSD. Each model is briefly summarized in terms of its detection strategy, backbone, and typical applications in agricultural vision.

YOLO V2

The YOLOv2 is a one-stage object detection model that integrates backbone, bounding box prediction, and class probability estimation into a single pipeline. Unlike two-stage detectors such as Faster R-CNN, YOLOv2 divides the input image into an

N \times N

grid, where each cell predicts multiple bounding boxes using anchors of different scales and aspect ratios. Non-maximum suppression is then applied to remove redundant detections and improve accuracy [32,33,34]. In its canonical configuration, YOLOv2 employs DarkNet-19 as the backbone, which balances performance and computational speed [35]. In this work, we additionally tested a modified configuration in which YOLOv2 was coupled with DarkNet-53 as the backbone. This adjustment was motivated by the strong representational capacity of DarkNet-53, which captures both low-level information (edges, shapes) and high-level semantic features (textures, patterns) [32,36,37]. YOLOv2 belongs to the YOLO family of detectors, which has evolved through multiple versions (YOLOv1–YOLOv11) with progressive improvements in accuracy, speed, and architectural design. While newer variants such as YOLOv5–YOLOv12 achieve state-of-the-art performance, YOLOv2 remains a representative reference point among classical one-stage detectors, and is therefore included in this work both in its canonical form (DarkNet-19) and in a modified version with DarkNet-53. This choice allows for a direct comparison with modern detectors (YOLOv11, YOLOv12, RT-DETR) evaluated later in this work.

Faster R-CNN

The Faster R-CNN model uses a detection technique based on the implementation of a Region Proposal Network (RPN), and a backbone network is responsible for classifying and detailing the detection region. RPN networks consist of CNNs, which establish regions where the probability of finding the object of interest is high. These networks use anchor boxes to make multiple proposals from regions of different sizes and scales. In the feature extraction stage, the model classifies the object’s category from a fully connected layer followed by a softmax layer. Also, another fully connected layer is used to determine the coordinates of the bounding box. The CNN is responsible for extracting the most relevant feature maps in the marigold flower images.

In this work, Faster R-CNN was first implemented in its canonical version with ResNet-50 as the backbone, following Matlab’s recommended configuration and common practice in the literature for agricultural detection tasks. In addition, a modified version with DarkNet-53 was also tested to evaluate whether a deeper backbone could improve performance on the marigold dataset. This dual configuration allows the analysis of both the baseline accuracy of the canonical model and the potential benefits of replacing its backbone with DarkNet-53 [16,36,38].

SSD

The SSD model, unlike Faster R-CNN, does not require a region proposal stage since it has a low inference time. Like YOLOv2, SSD performs its classification and localization processes in a single shot. Traditionally, the SSD algorithm is used with VGG16 or ResNet networks as backbones, and in Matlab’s implementation, the canonical version employs ResNet-50. In this work, SSD was therefore first implemented in its canonical configuration with ResNet-50, following common practice in the literature for agricultural detection tasks. In addition, a modified configuration with DarkNet-53 was tested, since this backbone demonstrated strong performance on our marigold dataset. SSD incorporates additional convolutional layers to perform multi-scale detections, calculating predictions from feature maps at different resolutions. Like the other algorithms, SSD also uses anchor boxes with different scales and aspect ratios in each cell of the feature map, which allows the network to handle the geometric variability of objects in the dataset. The main feature of this model is that SSD performs object detection in a single stage, where class predictions and bounding boxes are obtained from multiple feature maps. This allows the network to maintain high detection speed while improving accuracy [39,40].

After describing YOLOv2, Faster R-CNN, and SSD individually, it is useful to summarize their key methodological differences. Table 3 compares the three canonical detectors in terms of detection type, methodology, and backbone. This overview provides a concise reference point before introducing the modified configurations with DarkNet-53.

2.4.2. Modern CNN Detectors

In addition to the canonical baselines, this work considered recent object detection architectures as modern references. Specifically, YOLOv11, YOLOv12, and RT-DETR were included due to their strong performance in general-purpose benchmarks and their relevance as state-of-the-art models. These detectors were evaluated with their default backbones and training pipelines as provided by the Ultralytics framework, ensuring a fair representation of their capabilities without additional modifications.

YOLOv11 and YOLOv12

YOLOv11 and YOLOv12 are the most recent models released by Ultralytics. Both belong to the YOLO family and follow the one-stage detection approach, where predictions for bounding boxes and classes are obtained in a single step. However, compared with earlier versions, they introduce several improvements that make them more accurate and efficient. YOLOv11 focuses on enhancing the backbone and bottleneck layers, which allows better extraction of image features. This results in higher accuracy while keeping the number of parameters low. In fact, YOLOv11m can reach a higher mAP on the COCO dataset with about 22% fewer parameters than YOLOv8m. Another advantage is that it was designed for flexible deployment, since it can run on edge devices, cloud platforms, or systems with NVIDIA GPUs, and it is not limited only to object detection but also supports segmentation, classification, pose estimation, and oriented bounding box (OBB) tasks. YOLOv12 goes one step further and adds new mechanisms to deal with larger receptive fields and attention. It introduces area attention, which divides the feature map into equal regions to reduce the cost of computation, and R-ELAN, a residual aggregation module that improves training stability. It also optimizes the attention layers using FlashAttention and replaces standard positional encodings with a

7 \times 7

separable convolution that works as a positional perceptron. These changes make the model lighter and faster, but at the same time able to reach high precision in different vision tasks. In this work, both YOLOv11 and YOLOv12 were used with their default Ultralytics configurations. They serve as modern baselines and allow us to compare how the latest YOLO architectures perform against the canonical detectors and their modified versions with DarkNet-53 [41,42].

RT-DETR

The RT-DETR is a transformer-based object detection model designed to provide accurate results at real-time speed. Unlike CNN detectors, RT-DETR does not require a non-maximum suppression stage, since its architecture directly outputs final bounding boxes and confidence scores. The model processes multi-scale features through a hybrid encoder, which combines intra-scale interactions with cross-scale fusion modules, and initializes its object queries using an IoU-based selection strategy. This structure allows the detector to remain efficient while preserving accuracy. In our experiments, RT-DETR was implemented with ResNet-50 as the backbone following the Ultralytics framework. This version served as a modern baseline to compare against both canonical and modified detectors, showing how transformer-based designs can be adapted to drone-based agricultural monitoring [43].

2.4.3. Backbone Networks

In object detection, the backbone acts as the feature extractor that transforms the input image into representations used by the detector. In our work, we worked with two main backbones: ResNet-50 and DarkNet-53. ResNet-50 is a widely used option in detection frameworks and appears in the canonical versions of Faster R-CNN, SSD, and RT-DETR. In the case of Faster R-CNN and SSD, this choice follows Matlab’s standard implementation, since their original backbones (ZFNet, VGG) are no longer supported in the current toolbox. Its residual blocks also allow deeper architectures to be trained without major degradation problems. DarkNet-53, first applied in YOLOv3, was used here to build the modified versions of YOLOv2, Faster R-CNN, and SSD. The idea was to evaluate whether replacing the original backbones with this more recent residual network could improve performance when training under the same conditions.

ResNet-50 as Backbone

ResNet-50 is a deep convolutional network that applies the idea of residual learning through bottleneck blocks, which makes it possible to train models with greater depth without suffering from vanishing gradients. The architecture has about 25.6 million parameters and usually works with input images of

224 \times 224 \times 3

. It begins with a

7 \times 7

convolution and a

3 \times 3

max-pooling layer, followed by four stages that stack residual blocks. Each block reduces the dimensionality with a

1 \times 1

convolution, processes features with a

3 \times 3

convolution, and then restores the dimensionality with another

1 \times 1

layer, always combined with batch normalization and rectified linear unit (ReLU) activations. Thanks to the shortcut connections, gradients can propagate directly, which facilitates optimization in very deep networks. After the convolutional stages, the model applies global average pooling and a final classification layer. Because of this design, ResNet-50 provides a good balance between accuracy and efficiency and has become one of the most common backbones for object detection frameworks such as Faster R-CNN, SSD, and YOLO [34,44,45].

DarkNet-53 as Backbone

DarkNet-53 is a convolutional neural network first used as the backbone of YOLOv3, where it showed a good compromise between accuracy and processing speed [46]. The network is built with 53 convolutional layers and residual connections, adding up to more than 40 million parameters. Thanks to its residual blocks, it allows stable gradient propagation and avoids the performance drop that usually appears in very deep models. This backbone has also been used in several agricultural computer vision studies. For instance, the review titled “Fruit sizing using AI: A review of methods and challenges” reports that YOLOv3 with DarkNet-53 is often applied in fruit detection tasks, mainly to handle variations in scale, light conditions, and background complexity [47]. In another work, “Fruits Classification and Detection Application Using Deep Convolutional Neural Network”, the same backbone was tested for fruit recognition in real cultivation environments, showing consistent performance [48]. These examples indicate that DarkNet-53 is not limited to YOLO alone but can also be used as a solid alternative in agricultural applications.

In our experiments, DarkNet-53 was applied only in the modified versions of YOLOv2, Faster R-CNN, and SSD, while the canonical detectors kept their original backbones (DarkNet-19, VGG, or ResNet), and modern architectures such as YOLOv11 and RT-DETR were evaluated with their default implementations. The idea was to test if a more recent residual backbone could improve the classical models when trained with the same dataset and under identical conditions. Using DarkNet-53 in the modified configurations also reduced the variability caused by having different backbones, making the comparison between detectors more consistent. Finally, these modified versions were also used in the hyperparameter analysis, where different optimizers and learning rates were tested, as explained in the next subsection. This way, the evaluation of DarkNet-53 was not only structural but also connected with the training setup.

2.4.4. Canonical and Modern CNNs Configurations

To establish a fair baseline, three classical detectors were first trained and evaluated in their canonical configurations, using the backbones commonly adopted in the literature and recommended in Matlab’s version 2025b implementations. Specifically, YOLOv2 was implemented with DarkNet-19, while both SSD and Faster R-CNN were implemented with ResNet-50 as backbones. These canonical settings have been widely applied in agricultural computer vision tasks and therefore provide an appropriate benchmark for comparison.

Table 4 summarizes the object detection models considered in this work, including the canonical detectors with their default backbones, as well as the modern baselines (YOLOv11, YOLOv12, and RT-DETR) evaluated with their standard configurations. While the modern detectors are not part of the canonical group, they are included here as state-of-the-art references to contextualize the comparative analysis.

In addition, each canonical model was trained using the anchor box sizes defined in the original literature and Matlab implementations. These anchors were not recalculated but adopted as default values to ensure reproducibility and consistency with previous studies. Canonical and modified detectors (Faster R-CNN, SSD, YOLOv2) were implemented and trained in MATLAB, while modern detectors (YOLOv11, YOLOv12, RT-DETR) were implemented using Ultralytics. Table 5 lists the anchor box dimensions employed for each model, including the default anchors for YOLOv2, SSD, and Faster R-CNN, and the dataset-specific or dynamic strategies applied in YOLOv11/12 and RT-DETR.

2.4.5. Hyperparameter Selection

The selection of hyperparameters in this work was based on both previous work and direct testing with the marigold dataset. As shown in Table 6, the canonical models were trained with optimizers and learning rates reported in the original literature and Matlab’s implementations, while the modern detectors (YOLOv11, YOLOv12, and RT-DETR) were also trained using their standard configurations provided in the Ultralytics framework. In both canonical and modified versions, we evaluated two epoch limits (15 and 30) to analyze the stability of training.

The main hyperparameter exploration was carried out on the modified versions of YOLOv2, SSD, and Faster R-CNN, where we tested three optimizers: Stochastic Gradient Descent with Momentum (SGDM), Adaptive Moment Estimation (Adam), and Root Mean Square Propagation (RMSProp), and three learning rates (0.001, 0.0001, 0.0005). The specific experimental configurations are summarized in Table 7. SGDM with momentum (0.9) was included as the baseline, since it is the same optimizer originally applied in Faster R-CNN, SSD, and YOLOv2 [35,49,50]. Adam was also tested because it usually provides stable training when the dataset is small, as it adapts the learning rate for each parameter and often speeds up convergence [51]. RMSProp was considered because it tends to improve stability when the data show changes in object size or illumination [52], which is the case for our marigold images. This setup allowed us to keep the canonical and modern detectors faithful to their standard training conditions, while at the same time performing a controlled sensitivity analysis on the modified versions. In this way, we were able to evaluate the impact of different optimizers and learning rates without altering the reproducibility of the baseline models. Testing multiple hyperparameter configurations in the modified detectors also helped us identify cases of overfitting and underfitting, strengthening the reliability of the reported results [53].

3. Results

This section presents the results for marigold flower detection at the final flowering stage using three groups of models: (i) canonical detectors (YOLOv2, Faster R-CNN, and SSD trained with their original backbones); (ii) modern state-of-the-art detectors (YOLOv11, YOLOv12, and RT-DETR evaluated with their default implementations); and (iii) modified versions of YOLOv2, Faster R-CNN, and SSD coupled with DarkNet-53 as backbone. Training, validation, and testing outcomes are analyzed in terms of mAP@0.5, PR curves, F1-score, inference speed, and model complexity. The experiments were conducted under the global hyperparameter settings for canonical and modern detectors summarized in Table 6, while the extended configurations defined in Table 7 were applied only to the modified detectors. Results are organized accordingly: Section 3.1 reports the canonical and modern detectors as baselines, Section 3.2 presents the modified configurations with DarkNet-53, Section 3.3 analyzes model complexity and inference speed, and Section 3.4 provides the comparative analysis between modified and modern detectors.

3.1. Canonical and Modern CNN Detector Results

The canonical and modern detectors were first evaluated under matched training budgets of 15 and 30 epochs, using a base learning rate of 0.001 and SGD as optimizer to ensure comparability. Table 8 reports the corresponding mAP@0.5 values for training, validation, and test sets. Among the canonical models, Faster R-CNN improved with longer training, reaching 52.1% test mAP at 30 epochs. SSD, however, performed better with shorter training (40.3% at 15 epochs vs. 38.2% at 30). YOLOv2 also benefited from extended training, increasing from 39.1% to 62.9% in test mAP. In contrast, modern detectors consistently achieved substantially higher scores: YOLOv11 obtained up to 96.7% at 15 epochs (96.5% at 30), YOLOv12 reached 96.5% at 15 epochs (96.3% at 30), and RT-DETR scaled from 83.1% at 15 epochs to 93.5% at 30.

3.2. Modified CNN Detector Results with DarkNet-53

The modified versions of YOLOv2, Faster R-CNN, and SSD were trained using DarkNet-53 as a shared backbone. The extended hyperparameter settings are summarized in Table 7. Table 9 and Table 10 report the mAP@0.5 values for training, validation, and testing across the experimental runs. The results varied depending on the optimizer, learning rate, and number of epochs, which indicates the sensitivity of these detectors to the training setup.

Figure 4 shows PR curves for the best-performing configurations of the modified detectors: (a) Faster R-CNN (Test 11), (b) YOLOv2 (Test 13), and (c) SSD (Test 2). Each row presents training (left), validation (center), and testing (right) results. The curves illustrate the trade-off between precision and recall across thresholds, with the corresponding mAP@0.5 values indicated in the plot titles, and in parallel, Figure 5 presents qualitative comparisons of the same configurations so that the visual results can be directly related to the numerical findings.

3.3. Model Complexity and Inference Speed

In addition to detection accuracy, model complexity and inference speed are critical factors for practical deployment in drone-based agricultural monitoring. Table 11 summarizes the number of parameters, model size, and inference time per image, measured on an NVIDIA RTX 4070 Ti GPU with a batch size of 1. The results show that Faster R-CNN has the largest complexity (33 million parameters, 117 MB), with an average inference time above 100 ms per image, making it unsuitable for real-time applications. SSD is considerably lighter (11 million parameters, 40 MB) and faster (21 ms/img), although its detection accuracy was lower compared to the other models. YOLOv2 with DarkNet-19 is even lighter (9.4 million parameters, 33 MB) and achieves the fastest inference speed (8.6 ms/img), although its final performance depends strongly on the chosen backbone.

Among the modern detectors, YOLOv11 and YOLOv12 stand out for their very compact architectures (2.6 million parameters, 5 MB) while still achieving high accuracy. Their inference times, between 15 and 20 ms/img, make them suitable for real-time applications. These results indicate that the latest YOLO versions provide a good balance between accuracy and efficiency, which is particularly relevant for onboard drone deployment. RT-DETR, while accurate (mAP above 90%), has a larger computational footprint (32 million parameters, 63 MB, and 28 ms/img), which may limit its use in lightweight platforms.

It is important to note that Table 11 only reports canonical and modern detectors. The modified versions of YOLOv2, Faster R-CNN, and SSD with DarkNet-53 were not included, since their purpose was to analyze the effect of a common backbone on detection performance rather than computational efficiency. Their implications for real-world feasibility are discussed in Section 4. Reporting the number of parameters and inference speed highlights not only the accuracy of the models but also their realistic deployability in the field.

3.4. Comparative Analysis of Modified and Modern Detectors

To better understand the contribution of the proposed modifications, this subsection directly compares the best configurations obtained with DarkNet-53 against the modern detectors (YOLOv11, YOLOv12, and RT-DETR). The comparison considers both mAP and F1-scores, as well as qualitative detection examples. These results highlight whether the modified versions can reach levels of accuracy comparable to recent state-of-the-art models, while also clarifying their relative strengths and limitations.

Table 12 compares canonical, modified, and modern detectors in a single view. With this table, it is possible to see first the improvement obtained when using DarkNet-53 as a common backbone, and then how these modified versions stand against the current state-of-the-art. YOLOv2 shows the clearest gain, reaching 98.8% compared to 62.9% with its canonical backbone. SSD also increases notably (63.1% vs. 40.3%), while Faster R-CNN changes only slightly (52.8% vs. 52.1%). This confirms that the backbone modification produced consistent gains, enabling a fair comparison with modern detectors.

Table 13 adds the F1-scores, showing that YOLOv2 combined with DarkNet-53 achieved the best balance between precision and recall (97.9%), while SSD and Faster R-CNN stayed much lower. This table also makes it possible to compare the modified detectors directly with the current state-of-the-art under the same F1 metric, highlighting that the proposed changes allow YOLOv2 to reach values on par with modern models. Overall, these results offer a clearer understanding of model behavior and strengthen the robustness of the reported findings. Table 14 shows the comparison of best detection results and inference speed between this work and other studies.

Finally, Figure 5 and Figure 6 together provide qualitative comparisons between the modified and modern detectors. This visual evidence reinforces the numerical findings, showing that YOLOv2 with DarkNet-53 produced the most accurate detections, while SSD and Faster R-CNN offered more moderate performance. Modern detectors also achieved strong results, though with differences in computational efficiency that are further discussed in Section 4.

4. Discussion

The results demonstrate clear performance differences between canonical, modified, and modern detectors. In their canonical configurations, Faster R-CNN and SSD achieved limited performance (52.1% and 40.3% test mAP, respectively), while YOLOv2 with DarkNet-19 reached 62.9%. These results confirm that classical models with their original backbones are highly sensitive to dataset size and variability, as also reported in previous agricultural detection studies. Replacing the canonical backbones with DarkNet-53 had a decisive impact. SSD improved notably, rising from 40.3% to 63.1% when trained with DarkNet-53, which provided a more stable backbone. Faster R-CNN, in contrast, showed only a slight gain (52.8%), reflecting the inherent limits of its two-stage design and high computational cost. The most significant change was observed in YOLOv2: with DarkNet-53, it reached 98.8% mAP and 97.9% F1-score, surpassing modern detectors. This shows that backbone adaptation can transform a conventional model into one that matches or even outperforms recent architectures in terms of both accuracy and F1-score.

Modern architectures also showed strong and consistent performance. YOLOv11 and YOLOv12 reached 96–97% mAP with F1-scores around 93%, while RT-DETR achieved 93.5% mAP and 89.5% F1-score. Although they did not exceed the absolute accuracy of YOLOv2 with DarkNet-53, they remain more efficient: YOLOv11 and YOLOv12 required less than 20 ms per image and less than 6 MB in size, while Faster R-CNN exceeded 100 ms per image and RT-DETR required 63 MB and 28 ms. This confirms that modern detectors are better suited for deployment on UAV platforms where computational resources are constrained, whereas YOLOv2 with DarkNet-53, despite its high accuracy, has a heavier architecture not represented in the efficiency table. The ablation with optimizers clarified how training stability varied across configurations. Adam provided the most consistent convergence with small datasets, while RMSProp adapted better to changes in illumination and object size. Results from SGDM, the baseline, were balanced but less adaptable. These characteristics account for some results: SSD (Test 3) and Faster R-CNN (Test 10) showed overfitting, with excellent training scores but poor test performance, whereas YOLOv2 in Test 5 manifested clear underfitting (19% training accuracy). F1-scores and PR curves validated these patterns, showing how the optimizer–backbone combination directly influenced the balance between recall and accuracy.

This work also presents some limitations. The dataset was built exclusively from drone flights in marigold fields, making it unique and highly valuable for agricultural detection. The main limitation lies in the number of images collected (392 originals), which restricts the diversity of conditions represented during training and testing. Expanding the dataset with additional field campaigns would improve robustness and allow broader generalization to other scenarios. In summary, the experiments confirm that both backbone adaptation and careful hyperparameter selection directly influence detection accuracy. YOLOv11 and YOLOv12 provided the best balance between accuracy, speed, and model size, while YOLOv2 with DarkNet-53 achieved state-of-the-art performance, surpassing contemporary detectors in accuracy. These findings demonstrate that both lightweight modern architectures and classical models with updated backbones are strong candidates for drone-based monitoring of flowering crops.

5. Conclusions

In this work, we evaluated the detection of marigold flowers in their last flowering stage using three groups of models: canonical detectors, modified versions with DarkNet-53, and recent architectures such as YOLOv11, YOLOv12, and RT-DETR. The dataset was built from 392 images captured with a DJI Mini 3 Pro drone and expanded with data augmentation up to 940 images. The experiments showed that YOLOv2 with DarkNet-53 reached the best performance, with 98.8% mAP and 97.9% F1-score, while SSD and Faster R-CNN also improved compared to their canonical versions. At the same time, the modern models, especially YOLOv11 and YOLOv12, achieved high accuracy (around 96%) with very small sizes (about 5 MB) and fast inference times (under 20 ms), which makes them attractive for drone applications. These results confirm that convolutional detectors can be used to characterize marigold flowers at the final stage of blooming. The work was limited by the number of field images available, which restricted the variability of training conditions. Expanding the dataset in future work would help to improve model robustness and generalization.

Author Contributions

Conceptualization, I.N.V., V.M. and J.P.V.; Data curation, P.V. and J.P.V.; Formal analysis, P.V., I.N.V., V.M. and J.P.V.; Investigation, P.V., I.N.V., V.M. and J.P.V.; Methodology, P.V., I.N.V., V.M. and J.P.V.; Resources, I.N.V., A.J.P. and J.P.V.; Software, P.V. and J.P.V.; Supervision, J.P.V.; Validation, P.V. and J.P.V.; Visualization, P.V. and J.P.V.; Writing—original draft, P.V., I.N.V. and J.P.V.; Writing—review & editing, P.V., I.N.V., A.J.P., V.M. and J.P.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by ANID (National Research and Development Agency of Chile) under Fondecyt Iniciación 2024 Grant 11240105, Grant 11230962, and Fondecyt Postdoctorado N°3250059. This work was funded by ANID—Millennium Science Initiative Program—NCN2024_047.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bayraktar, E.; Basarkan, M.E.; Celebi, N. A low-cost UAV framework towards ornamental plant detection and counting in the wild. ISPRS J. Photogramm. Remote Sens. 2020, 167, 1–11. [Google Scholar] [CrossRef]
Schnalke, M.; Funk, J.; Wagner, A. Bridging technology and ecology: Enhancing applicability of deep learning and UAV-based flower recognition. Front. Plant Sci. 2025, 16, 1498913. [Google Scholar] [CrossRef] [PubMed]
Gallmann, J.; Schüpbach, B.; Jacot, K.; Albrecht, M.; Winizki, J.; Kirchgessner, N.; Aasen, H. Flower mapping in grasslands with drones and deep learning. Front. Plant Sci. 2022, 12, 774965. [Google Scholar] [CrossRef] [PubMed]
Sângeorzan, D.D.; Păcurar, F.; Reif, A.; Weinacker, H.; Rușdea, E.; Vaida, I.; Rotar, I. Detection and quantification of Arnica montana L. inflorescences in grassland ecosystems using convolutional neural networks and drone-based remote sensing. Remote Sens. 2024, 16, 2012. [Google Scholar] [CrossRef]
Vasconez, J.P.; Kantor, G.A.; Cheein, F.A.A. Human–robot interaction in agriculture: A survey and current challenges. Biosyst. Eng. 2019, 179, 35–48. [Google Scholar] [CrossRef]
Usha, V.; Sathya, V.; Kujani, T.; Anitha, T.; Priya, S.S.; Abhinash, N.C. Diagnosing Floral Diseases Automatically using Deep Convolutional Neural Nets. In Proceedings of the 2024 2nd International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT), Faridabad, India, 28–29 November 2024; Volume 1, pp. 491–495. [Google Scholar]
Zhang, C.; Sun, X.; Xuan, S.; Zhang, J.; Zhang, D.; Yuan, X.; Fan, X.; Suo, X. Monitoring of Broccoli Flower Head Development in Fields Using Drone Imagery and Deep Learning Methods. Agronomy 2024, 14, 2496. [Google Scholar] [CrossRef]
Moya, V.; Quito, A.; Pilco, A.; Vásconez, J.P.; Vargas, C. Crop Detection and Maturity Classification Using a YOLOv5-Based Image Analysis. Emerg. Sci. J. 2024, 8, 496–512. [Google Scholar] [CrossRef]
Vasconez, J.P.; Salvo, J.; Auat, F. Toward Semantic Action Recognition for Avocado Harvesting Process based on Single Shot MultiBox Detector. In Proceedings of the 2018 IEEE International Conference on Automation/XXIII Congress of the Chilean Association of Automatic Control (ICA-ACCA), Concepcion, Chile, 17–19 October 2018; pp. 1–6. [Google Scholar] [CrossRef]
Wakchaure, M.; Patle, B.; Mahindrakar, A. Application of AI techniques and robotics in agriculture: A review. Artif. Intell. Life Sci. 2023, 3, 100057. [Google Scholar] [CrossRef]
Pilco, A.; Moya, V.; Quito, A.; Vásconez, J.P.; Limaico, M. Image Processing-Based System for Apple Sorting. J. Image Graph. 2024, 12, 362–371. [Google Scholar] [CrossRef]
Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Y.; Dong, Q.; et al. Challenges and opportunities in remote sensing-based crop monitoring: A review. Natl. Sci. Rev. 2023, 10, nwac290. [Google Scholar] [CrossRef]
Moya, V.; Espinosa, V.; Chávez, D.; Leica, P.; Camacho, O. Trajectory tracking for quadcopter’s formation with two control strategies. In Proceedings of the 2016 IEEE Ecuador Technical Chapters Meeting (ETCM), Guayaquil, Ecuador, 12–14 October 2016; pp. 1–6. [Google Scholar] [CrossRef]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Peña, S.; Pilco, A.; Moya, V.; Chamorro, W.; Vásconez, J.P.; Zuniga, J.A. Color Sorting System Using YOLOv5 for Robotic Mobile Applications. In Proceedings of the 2024 6th International Conference on Robotics and Computer Vision (ICRCV), Wuxi, China, 20–22 September 2024; pp. 1–5. [Google Scholar] [CrossRef]
Vasconez, J.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Duan, Z.; Liu, W.; Zeng, S.; Zhu, C.; Chen, L.; Cui, W. Research on a real-time, high-precision end-to-end sorting system for fresh-cut flowers. Agriculture 2024, 14, 1532. [Google Scholar] [CrossRef]
Estrada, J.S.; Vasconez, J.P.; Fu, L.; Cheein, F.A. Deep Learning based flower detection and counting in highly populated images: A peach grove case study. J. Agric. Food Res. 2024, 15, 100930. [Google Scholar] [CrossRef]
Ma, B.; Wu, Z.; Ge, Y.; Chen, B.; Zhang, H.; Xia, H.; Wang, D. A Recognition Method for Marigold Picking Points Based on the Lightweight SCS-YOLO-Seg Model. Sensors 2025, 25, 4820. [Google Scholar] [CrossRef]
Patel, S. Marigold flower blooming stage detection in complex scene environment using faster RCNN with data augmentation. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 676–684. [Google Scholar] [CrossRef]
Abbas, T.; Razzaq, A.; Zia, M.A.; Mumtaz, I.; Saleem, M.A.; Akbar, W.; Khan, M.A.; Akhtar, G.; Shivachi, C.S. Deep neural networks for automatic flower species localization and recognition. Comput. Intell. Neurosci. 2022, 2022, 9359353. [Google Scholar] [CrossRef] [PubMed]
Sert, E. A deep learning based approach for the detection of diseases in pepper and potato leaves. Anadolu Tarım Bilim. Derg. 2021, 36, 167–178. [Google Scholar] [CrossRef]
Horng, G.J.; Liu, M.X.; Chen, C.C. The smart image recognition mechanism for crop harvesting system in intelligent agriculture. IEEE Sens. J. 2019, 20, 2766–2781. [Google Scholar] [CrossRef]
Dias, P.A.; Tabb, A.; Medeiros, H. Apple flower detection using deep convolutional networks. Comput. Ind. 2018, 99, 17–28. [Google Scholar] [CrossRef]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Chen, J.; Chen, W.; Zeb, A.; Yang, S.; Zhang, D. Lightweight inception networks for the recognition and detection of rice plant diseases. IEEE Sens. J. 2022, 22, 14628–14638. [Google Scholar] [CrossRef]
Cheng, Z.; Zhang, F. Flower End-to-End Detection Based on YOLOv4 Using a Mobile Device. Wirel. Commun. Mob. Comput. 2020, 2020, 8870649. [Google Scholar] [CrossRef]
Banerjee, D.; Kukreja, V.; Sharma, V.; Jain, V.; Hariharan, S. Automated Diagnosis of Marigold Leaf Diseases using a Hybrid CNN-SVM Model. In Proceedings of the 2023 8th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 1–3 June 2023; pp. 901–906. [Google Scholar]
Fan, Y.; Tohti, G.; Geni, M.; Zhang, G.; Yang, J. A marigold corolla detection model based on the improved YOLOv7 lightweight. Signal Image Video Process. 2024, 18, 4703–4712. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Lyon, J.G.; Huete, A. Advances in hyperspectral remote sensing of vegetation and agricultural crops. In Fundamentals, Sensor Systems, Spectral Libraries, and Data Mining for Vegetation; CRC Press: Boca Raton, FL, USA, 2018; pp. 3–37. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Du, J. Understanding of object detection based on CNN family and YOLO. In Proceedings of the 2nd International Conference on Machine Vision and Information Technology (CMVIT 2018), Hong Kong, China, 23–25 February 2018; Journal of Physics: Conference Series. IOP Publishing: Bristol, UK, 2018; Volume 1004, p. 012029. [Google Scholar]
Vilcapoma, P.; Parra Meléndez, D.; Fernández, A.; Vásconez, I.N.; Hillmann, N.C.; Gatica, G.; Vásconez, J.P. Comparison of faster R-CNN, YOLO, and SSD for third molar angle detection in dental panoramic X-rays. Sensors 2024, 24, 6053. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Hamidisepehr, A.; Mirnezami, S.V.; Ward, J.K. Comparison of object detection methods for corn damage assessment using deep learning. Trans. ASABE 2020, 63, 1969–1980. [Google Scholar] [CrossRef]
Rai, N.; Zhang, Y.; Ram, B.G.; Schumacher, L.; Yellavajjala, R.K.; Bajwa, S.; Sun, X. Applications of deep learning in precision weed management: A review. Comput. Electron. Agric. 2023, 206, 107698. [Google Scholar] [CrossRef]
Maity, M.; Banerjee, S.; Chaudhuri, S.S. Faster r-cnn and yolo based vehicle detection: A survey. In Proceedings of the 2021 5th international conference on computing methodologies and communication (ICCMC), Erode, India, 8–10 April 2021; pp. 1442–1447. [Google Scholar]
Yuan, T.; Lv, L.; Zhang, F.; Fu, J.; Gao, J.; Zhang, J.; Li, W.; Zhang, C.; Zhang, W. Robust cherry tomatoes detection algorithm in greenhouse scene based on SSD. Agriculture 2020, 10, 160. [Google Scholar] [CrossRef]
Li, M.; Zhang, Z.; Lei, L.; Wang, X.; Guo, X. Agricultural greenhouses detection in high-resolution satellite images based on convolutional neural networks: Comparison of faster R-CNN, YOLO v3 and SSD. Sensors 2020, 20, 4938. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Ultralytics. Ultralytics YOLO11, version 11.0.0; License: AGPL-3.0; Ultralytics: Frederick, MD, USA, 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 August 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974.
Shaheed, K.; Qureshi, I.; Abbas, F.; Jabbar, S.; Abbas, Q.; Ahmad, H.; Sajid, M.Z. EfficientRMT-Net—An efficient ResNet-50 and vision transformers approach for classifying potato plant leaf diseases. Sensors 2023, 23, 9516. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight detection networks for tea bud on complex agricultural environment via improved YOLO v4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Miranda, J.C.; Gené-Mola, J.; Zude-Sasse, M.; Tsoulias, N.; Escolà, A.; Arnó, J.; Rosell-Polo, J.R.; Sanz-Cortiella, R.; Martínez-Casasnovas, J.A.; Gregorio, E. Fruit sizing using AI: A review of methods and challenges. Postharvest Biol. Technol. 2023, 206, 112587. [Google Scholar] [CrossRef]
Mimma, N.E.A.; Ahmed, S.; Rahman, T.; Khan, R. Fruits classification and detection application using deep learning. Sci. Program. 2022, 2022, 4194874. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Tieleman, T.; Hinton, G. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 2012, 17, 6. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Sethy, P.; Barpanda, N.; Rath, A.; Behera, S. Counting of marigold flowers using image processing techniques. Int. J. Recent Technol. Eng. 2019, 8, 385–389. [Google Scholar]

Figure 1. Proposed architecture for marigold detection. The diagram illustrates (a) dataset acquisition and augmentation; (b) training of object detection models; and (c) detection results.

Figure 2. Flowering stages of marigold captured with a DJI Mini 3 Pro drone: (a) initial stage; (b) intermediate stage; (c) final stage.

Figure 3. Data Augmentation Techniques for Marigold Flower Detection Dataset.

Figure 4. PR curves for the modified detectors, with training (left), validation (center), and testing (right): (a) Faster R-CNN (Test 11), (b) YOLOv2 (Test 13), and (c) SSD (Test 2). Precision refers to the PR metric, with mAP@0.5 shown in the plot titles.

Figure 5. Qualitative detection results of marigold flowers at the final flowering stage using the modified models with DarkNet-53: (a) Faster R-CNN (Test 11), (b) YOLOv2 (Test 13), and (c) SSD (Test 2). Results are consistent with the numerical performance reported in Section 3.2.

Figure 6. Qualitative detection results of marigold flowers at the final flowering stage using modern detectors: (a) YOLOv11, (b) YOLOv12, and (c) RT-DETR. These visual results complement the quantitative comparisons in Table 12 and Table 13.

Table 1. Dataset distribution for marigold planting images.

Data Category	Training	Validation	Testing
Normal (original)	313	39	40
Data Augmentation	627	77	78
Total	940	116	118

Table 2. Data augmentation strategy: number of generated images per method.

Augmentation Method	Training	Validation	Testing
Rotation (±9°)	150	20	20
Horizontal Flip	150	20	20
Brightness Adjustment	160	20	20
Zoom/Cropping	167	17	18
Total Augmented	627	77	78

Table 3. Key characteristics of YOLOv2, Faster R-CNN, and SSD architectures.

Characteristics	YOLOv2	Faster R-CNN	SSD
Detection type	One stage	Two stage	One stage
Methodology	Fixed grids with anchors	Multiple scales with default boxes	Region generation (RPN) + classification
Default backbone	Darknet-19	VGG/ResNet	ResNet

Table 4. Canonical and modern object detection models with backbones, optimizers, and references.

Model	Backbone (Used)	Optimizer (Default)	Reference
Faster R-CNN (MATLAB)	ResNet-50 (orig.: ZFNet/VGG-16)	SGD with momentum (0.9)	Ren et al. [49]
SSD (MATLAB)	ResNet-50 (orig.: VGG-16, SSD300)	SGD with momentum (0.9)	Liu et al. [50]
YOLOv2 (MATLAB)	DarkNet-19	SGD with momentum (0.9)	Redmon et al. [35]
YOLOv11/12 (Ultralytics)	Default backbone	Auto (SGD/Adam)	Ultralytics [41,42]
RT-DETR (Ultralytics)	ResNet-50	SGD (default Ultralytics)	Zhao et al. [43]

Table 5. Anchor box sizes (width × height, in pixels) used for each model.

Model	Anchor Boxes	Reference
Faster R-CNN (ResNet-50)	32, 64, 128, 256, 512 × (1:1, 1:2, 2:1)	Ren et al. [49]
SSD (ResNet-50)	$30 \times 30$ , $60 \times 60$ , $111 \times 111$ , $162 \times 162$ , $213 \times 213$ , $264 \times 264$	Liu et al. [50]
YOLOv2 (DarkNet-19)	$1.19 \times 1.98$ , $2.79 \times 4.53$ , $4.14 \times 8.92$ , $8.77 \times 6.72$ , $11.3 \times 10.5$	Redmon et al. [35]
YOLOv11/12 (Ultralytics)	auto-computed by k-means clustering (dataset-specific)	Ultralytics [41]
RT-DETR (Ultralytics)	dynamic anchors via transformer attention	Tian et al. [42]

Table 6. Global training hyperparameters applied to canonical (MATLAB) and modern (Ultralytics) detectors.

Parameter	MATLAB Detectors (Faster R-CNN, SSD, YOLOv2)	Ultralytics Detectors (YOLOv11-12, RT-DETR)
Max epochs	15 and 30	15 and 30
Batch size	16	16
Input resolution	416 × 416 px	416 × 416 px
Initial learning rate	0.001	0.001
Optimizer	SGDM	SGD

Table 7. Hyper-parameter configuration for CNN training experiments.

Test	Optimizer	Max Epochs	Learning Rate
Test 1	Adam	15	0.001
Test 2	SGDM	15	0.001
Test 3	RMSProp	15	0.001
Test 4	Adam	15	0.0001
Test 5	SGDM	15	0.0001
Test 6	RMSProp	15	0.0001
Test 7	Adam	15	0.0005
Test 8	SGDM	15	0.0005
Test 9	RMSProp	15	0.0005
Test 10	Adam	30	0.001
Test 11	SGDM	30	0.001
Test 12	RMSProp	30	0.001
Test 13	Adam	30	0.0001
Test 14	SGDM	30	0.0001
Test 15	RMSProp	30	0.0001
Test 16	Adam	30	0.0005
Test 17	SGDM	30	0.0005
Test 18	RMSProp	30	0.0005

Table 8. Canonical and modern detector accuracy under standardized training setups (mAP@0.5, %).

			15 Epochs			30 Epochs
Model	Backbone	Optimizer	Train	Val	Test	Train	Val	Test
Faster R-CNN	ResNet-50	SGDM	44.1%	45.2%	43.1%	53.1%	51.1%	52.1%
SSD	ResNet-50	SGDM	41.2%	41.0%	40.3%	38.3%	39.4%	38.2%
YOLOv2	DarkNet-19	SGDM	39.6%	38.7%	39.1%	66.8%	63.0%	62.9%
YOLOv11	Default	SGD	95.9%	96.2%	96.7%	95.9%	95.7%	96.5%
YOLOv12	Default	SGD	96.4%	95.9%	96.5%	96.5%	95.9%	96.3%
RT-DETR	ResNet-50	SGD	81.0%	79.3%	83.1%	91.3%	92.6%	93.5%

All detectors were trained with fixed budgets of 15 and 30 epochs and base LR = 0.001 to ensure comparability (IoU = 0.5 for evaluation).

Table 9. Ablation with DarkNet-53: results for the 18 training configurations (all values are mAP@0.5, %).

	Training			Validation			Testing
Test	Faster	YOLOv2	SSD	Faster	YOLOv2	SSD	Faster	YOLOv2	SSD
Test 1	46.2%	98.1%	57.9%	46.1%	97.2%	55.6%	43.8%	96.3%	56.8%
Test 2	49.3%	74.0%	63.2%	47.7%	73.0%	61.3%	47.1%	73.0%	63.1%
Test 3	25.2%	93.4%	60.1%	23.1%	91.2%	57.5%	23.2%	91.5%	57.8%
Test 4	42.0%	98.7%	58.0%	40.0%	98.6%	58.0%	40.0%	96.5%	58.0%
Test 5	35.6%	19.0%	51.0%	36.0%	19.0%	51.0%	33.1%	19.0%	51.0%
Test 6	8.0%	98.0%	62.0%	6.0%	97.0%	61.0%	6.0%	97.0%	62.0%
Test 7	29.0%	99.0%	58.0%	28.1%	98.0%	58.0%	28.2%	97.0%	57.0%
Test 8	44.9%	58.0%	57.0%	44.7%	57.0%	56.0%	42.0%	60.0%	56.0%
Test 9	13.0%	94.0%	62.0%	13.0%	91.0%	61.0%	12.0%	91.0%	60.0%
Test 10	55.0%	99.0%	59.2%	53.0%	98.0%	58.8%	51.0%	97.0%	57.1%
Test 11	56.9%	85.0%	53.0%	55.8%	84.0%	55.0%	52.8%	83.0%	54.0%
Test 12	39.0%	98.0%	56.0%	38.0%	96.0%	54.0%	38.0%	97.0%	54.0%
Test 13	51.0%	99.2%	59.3%	50.0%	98.9%	58.2%	51.0%	98.8%	58.1%
Test 14	53.0%	30.3%	56.9%	53.0%	31.2%	56.9%	52.0%	33.3%	56.7%
Test 15	17.0%	99.0%	57.2%	17.0%	98.0%	57.7%	16.0%	97.0%	57.2%
Test 16	37.0%	99.0%	54.0%	35.0%	98.0%	53.0%	35.0%	98.0%	52.0%
Test 17	49.0%	75.0%	55.1%	49.0%	75.0%	54.5%	49.0%	74.0%	53.8%
Test 18	21.2%	98.0%	56.0%	20.3%	98.0%	55.0%	21.1%	97.0%	54.0%

Table 10. Best results for models trained with DarkNet-53 backbone across Train, Validation, and Test sets (mAP@0.5, %).

Model	Test ID	Backbone	Epochs	Optimizer	Train	Validation	Test
Faster R-CNN	Test 11	DarkNet-53	30	Adam	56.9%	55.8%	52.8%
SSD	Test 2	DarkNet-53	15	SGDM	63.2%	61.3%	63.1%
YOLOv2	Test 13	DarkNet-53	30	Adam	99.2%	98.9%	98.8%

Table 11. Model complexity and inference speed (measured on NVIDIA RTX 4070 Ti, batch size 1).

Model	Parameters (M)	Model Size (MB)	Inference Speed (ms/Img)
Faster R-CNN (ResNet-50, canonical)	33.0	117.3	102.6
SSD (ResNet-50, canonical)	11.4	40.7	21.2
YOLOv2 (DarkNet-19, canonical)	9.4	33.3	8.6
Faster R-CNN (DarkNet-53, modified)	42.98	153.1	99.85
SSD (DarkNet-53, modified)	6.87	24.5	35.07
YOLOv2 (DarkNet-53, modified)	9.13	32.6	12.13
YOLOv11 (Ultralytics)	2.6	5.2	14.7
YOLOv12 (Ultralytics)	2.6	5.2	19.8
RT-DETR (ResNet-50)	32.8	63.1	28.0

Table 12. Best detection results per model (mAP@0.5, %) for canonical, modified, and modern detectors.

Model	Backbone	Epochs	Optimizer	Best mAP (Test)
Faster R-CNN	ResNet-50 (canonical)	30	SGDM	52.1%
	DarkNet-53 (modified)	30	Adam	52.8%
SSD	ResNet-50 (canonical)	15	SGDM	40.3%
	DarkNet-53 (modified)	15	SGDM	63.1%
YOLOv2	DarkNet-19 (canonical)	30	SGDM	62.9%
	DarkNet-53 (modified)	30	Adam	98.8%
YOLOv11	Default (Ultralytics)	15	SGD	96.7%
YOLOv12	Default (Ultralytics)	15	SGD	96.5%
RT-DETR	ResNet-50 (Ultralytics)	30	SGD	93.5%

Table 13. Test F1-scores of the best configurations for modified and modern detectors (IoU = 0.5).

Category	Model	Backbone	Test F1 (%)
Modified	Faster R-CNN	DarkNet-53	64.7
	SSD	DarkNet-53	66.1
	YOLOv2	DarkNet-53	97.9
Modern	YOLOv11	Default	93.0
	YOLOv12	Default	93.1
	RT-DETR	ResNet-50	89.5

All values correspond to the best test configuration of each modified detector (see Table 9 and Table 10). Canonical models were reported earlier in terms of mAP, so they are omitted here. This table emphasizes the most informative comparison between modified detectors with a shared backbone and recent state-of-the-art models (YOLOv11/12 and RT-DETR).

Table 14. Comparison of best detection results (mAP@0.5, %) and inference speed (ms/img) between this work and other studies.

Reference	Model	Backbone	Best mAP (Test)	Speed (ms/Img)
This work	YOLOv2 (canonical)	DarkNet-19	62.9%	8.6
	YOLOv2 (modified)	DarkNet-53	98.8%	12.13
	YOLOv11 (modern)	Default	96.7%	14.7
Sethy et al. [54]	Classical (HSV+CHT)	Backbone A	95.6%	–
Patel et al. [20]	Faster R-CNN	ResNet-50 (TL COCO)	88.71%	4.31
Patel et al. [20]	SSD	MobileNet	74.30%	0.64
Ma et al. [19]	SCS-YOLO-Seg	StarNet + C2f-Star + Seg-head	93.3%	100
Fan et al. [29]	YOLOv7 (lite, pruned)	DSConv + SPPF + pruning	93.9%	166.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vilcapoma, P.; Vásconez, I.N.; Prado, A.J.; Moya, V.; Vásconez, J.P. Drone-Based Marigold Flower Detection Using Convolutional Neural Networks. Processes 2025, 13, 3169. https://doi.org/10.3390/pr13103169

AMA Style

Vilcapoma P, Vásconez IN, Prado AJ, Moya V, Vásconez JP. Drone-Based Marigold Flower Detection Using Convolutional Neural Networks. Processes. 2025; 13(10):3169. https://doi.org/10.3390/pr13103169

Chicago/Turabian Style

Vilcapoma, Piero, Ingrid Nicole Vásconez, Alvaro Javier Prado, Viviana Moya, and Juan Pablo Vásconez. 2025. "Drone-Based Marigold Flower Detection Using Convolutional Neural Networks" Processes 13, no. 10: 3169. https://doi.org/10.3390/pr13103169

APA Style

Vilcapoma, P., Vásconez, I. N., Prado, A. J., Moya, V., & Vásconez, J. P. (2025). Drone-Based Marigold Flower Detection Using Convolutional Neural Networks. Processes, 13(10), 3169. https://doi.org/10.3390/pr13103169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Drone-Based Marigold Flower Detection Using Convolutional Neural Networks

Abstract

1. Introduction

1.1. State-of-the-Art Review

1.2. Main Contributions

2. Materials and Methods

2.1. Data Acquisition

2.2. RGB Versus Hyperspectral Imagery

2.3. Dataset Split Strategy

2.4. Marigold Detection Using CNN-Based Object Detection

2.4.1. Canonical CNN Detectors

YOLO V2

Faster R-CNN

SSD

2.4.2. Modern CNN Detectors

YOLOv11 and YOLOv12

RT-DETR

2.4.3. Backbone Networks

ResNet-50 as Backbone

DarkNet-53 as Backbone

2.4.4. Canonical and Modern CNNs Configurations

2.4.5. Hyperparameter Selection

3. Results

3.1. Canonical and Modern CNN Detector Results

3.2. Modified CNN Detector Results with DarkNet-53

3.3. Model Complexity and Inference Speed

3.4. Comparative Analysis of Modified and Modern Detectors

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI