Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach

Calcagno, Salvatore; Midolo, Alessandro; Scaletta, Erika; Tramontana, Emiliano; Verga, Gabriella

doi:10.3390/fi18040204

Open AccessArticle

Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach

by

Salvatore Calcagno

,

Alessandro Midolo

,

Erika Scaletta

,

Emiliano Tramontana

^*

and

Gabriella Verga

Dipartimento di Matematica e Informatica, University of Catania, 95125 Catania, Italy

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(4), 204; https://doi.org/10.3390/fi18040204

Submission received: 25 January 2026 / Revised: 9 April 2026 / Accepted: 10 April 2026 / Published: 13 April 2026

(This article belongs to the Special Issue Developments of Computer Vision and Image Processing: Methodologies and Applications—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Drones have proven effective for acquiring aerial imagery, and when equipped with onboard analysis tools, they can automatically identify objects of interest. Neural-network methods for image analysis typically require large training datasets and substantial computational resources. By contrast, algorithmic techniques can detect objects using simple features, such as pixel colors, thereby reducing the need for extensive training and computational resources. Once trained, both types of system can analyze images in a short time. In our experiments, each approach has distinct strengths. The YOLO-based detector is more accurate for complex-shaped objects, such as trees, whereas the pixel-color approach performs better on sparser objects. This paper proposes YOLO-C3, a hybrid system designed for onboard drone image processing. By leveraging the strengths of both YOLO-based and pixel-based approaches, YOLO-C3 balances detection accuracy with estimation confidence. Trained on Mediterranean imagery dataset, the system is optimized for identifying natural objects, including citrus groves and trees. To assess the robustness of the image classifier, a K-fold cross-validation is performed. Compared to existing models, YOLO-C3 detects a wider range of natural objects with high accuracy and minimal latency, achieving a processing speed of 0.01 s per image. By performing object detection locally, drones can adapt their trajectories to support emergency response, helping to map safe corridors and locate buildings where people may be awaiting rescue after a natural disaster.

Keywords:

drones; rescue operations; aerial images; land use/land cover; remote sensing; YOLO; object detection; pixel-based algorithms

Graphical Abstract

1. Introduction

Drones equipped with high-resolution cameras have been used to survey large areas [1,2]. Aerial imagery provides valuable information for extracting geographic data [3,4], in road traffic analysis [5,6], wildfire detection [7], solar panel detection [8,9], and storm prevention [10]. A key goal is detecting objects and extracting associated data [11,12]. Pixel-based classification effectively extracts information; however, it struggles with variability in resolution, scale, and color characteristics across heterogeneous datasets [13]. AI-based methods, particularly neural networks, have become the norm [14]. However, they require extensive training data and significant computational costs. The performance of neural network models depends on the quality and quantity of training data, which are often scarce for detecting certain objects of interest, such as citrus groves or wells. To the best of our knowledge, although several datasets exist for classifying man-made objects in aerial imagery, datasets targeting natural agricultural objects, such as citrus groves, trees, and fields, remain limited. Our objective was to analyze land cover in the Mediterranean region to detect both natural and man-made objects. To address the lack of a dataset for this purpose, we created a custom dataset. Moreover, a direct comparison of neural-network and pixel-based performance for these categories is missing.

This paper proposes a drone-based system in which aerial imagery are acquired and analyzed onboard to detect objects of interest, allowing the drone to reroute itself and gather high-resolution data where it matters most. Image analysis is performed by our component, named YOLO-C3, which combines a lightweight, pixel-color classifier for land use/land cover with a YOLOv11-seg model for robust object detection and segmentation. Evaluating both methods on the same custom dataset and comparing their accuracy, resource demands, and generalization is instrumental in clarifying when each method is most useful. While YOLO-based detection is generally robust, estimates marked by a low confidence score need further analysis, which is achieved by the pixel-based approach. The YOLO-C3 software component is compact enough for a Raspberry Pi yet powerful enough to support real-time decision-making. In practice, the approach enables autonomous drones to rapidly map roads, vegetation, buildings, wells, and other features, generating actionable alerts for rescue teams or infrastructure managers without constant human supervision. The hybrid approach thus delivers a flexible, explainable, and low-cost solution suitable for emergency response, environmental monitoring, and other field operations.

The remainder of the paper is structured as follows. Section 2 discusses related work, analyzes the different image classification methods and existing datasets. Section 3 presents the proposed approach using drones, the dataset, and its analysis based on pixel colors and YOLO. Section 4 reports the experiments and results obtained for each classification approach for the validation phase and when considering drone images. Section 5 discusses their differences, highlighting their advantages and disadvantages. Finally, Section 6 draws our conclusions.

2. Related Work

2.1. Pixel-Based and Traditional Land-Cover Classification Approaches

In the past decade, land-cover classification has been addressed using a variety of approaches, beginning with early pixel-based algorithms and then progressively extending them to knowledge-based, object-oriented, and hybrid methods [15]. Pixel-based methods assign a class to each pixel from its spectral cues (e.g., reflectance/indices or color rules), offering lightweight processing and minimal training but remaining sensitive to clouds, seasonality, and mixed-pixel effects [16]. To mitigate data gaps, pixel-based compositing has been used to generate cloud-free, seasonally and radiometrically consistent datasets, e.g., in Landsat imagery [16,17]. From a historical perspective, Phiri et al. [15] traced early Landsat classification to pixel-based (supervised/unsupervised) algorithms. In recent years, the focus extended to hyperspectral imaging, which provides richer spectral–spatial information [18]. However, studies highlight persistent challenges, including data sparsity and the complexity of pixel-level classification [19].

2.2. Deep Learning Object Detection and YOLO-Based Approaches

To overcome the limitations of pixel-based approaches, recent research has increasingly relied on deep learning-based detectors, with YOLO becoming a central reference in the field. Jiang et al. [20] provided a brief review of the YOLO algorithm and its advanced versions up to YOLOv5, outlining the main differences and similarities between versions, discussing their motivations, feature development, and limitations. The study highlighted YOLO’s strengths in terms of speed and generalization ability, while also noting weaknesses such as inaccurate localization and poor recall compared to region-based approaches. More recent surveys [21,22,23] confirm the versatility of the YOLO family, which has evolved from YOLOv1 to YOLOv11, progressively integrating modules such as attention mechanisms and multi-scale strategies to improve accuracy and robustness. YOLO-v11-seg shows exceptional adaptability, supporting tasks beyond object detection, including instance segmentation and image classification. Within the field of remote sensing, several studies have recognized the potential of YOLO. Bai et al. [24] reviewed object detection algorithms, comparing the YOLO series, the Single-Shot MultiBox Detector (SSD) series, region-proposal-based methods, and Transformer-based detectors. After conducting a comparative performance analysis of various algorithms, the authors stated that YOLO offered an excellent trade-off between detection speed and accuracy, making it suitable for real-time scenes, while addressing the persistent challenges in detecting small objects. Similarly, Chen et al. [25] provided a comprehensive survey by reviewing more than 110 studies on object detection in aerial imagery. The work highlighted YOLO among the leading deep learning approaches and emphasized open challenges such as small-object detection, large-scale imagery, and variations in scale and orientation. Among the most recent developments in the YOLO family, YOLOv11 represents a further evolution in real-time object detection and segmentation. He et al. [26] applied YOLOv11 to high-resolution remote sensing imagery, training the model on a dataset containing 20 categories of man-made objects. Their results confirmed the effectiveness of YOLOv11 in multi-class and multi-object detection scenarios. In a related work, He et al. [27] investigated the use of YOLOv11-seg for landslide detection, demonstrating the potential of segmentation-based YOLO variants for object localization in remote sensing imagery. Unlike these works, which focus on specific targets or rely on bounding-box annotations, our study employs YOLOv11-seg with polygon-based instance annotations and evaluates its performance in a multi-class land-cover detection scenario. However, these studies focus exclusively on deep learning-based detection and do not investigate comparisons with deterministic pixel-based approaches, which remain relevant in certain remote sensing contexts.

2.3. Comparative Studies Between Traditional and Deep Learning Approaches

Several studies compared traditional remote sensing approaches to modern machine learning (ML) or deep learning (DL) models. Li et al. [28] compared a convolutional neural network (CNN)-based approach with two traditional approaches based on color filters and RGB matching for detection purposes, showing that the DL model significantly outperformed traditional methods. However, their comparison was limited to a single category (buildings), whereas our work proposes a broader evaluation on a multi-category dataset. Southworth et al. [29] compared ML and DL methods for spatial systems science, demonstrating that although DL often improved accuracy, ML approaches could remain competitive when training data or computational resources were limited. Their study provided a methodological comparison and a decision tree model to select a suitable model. However, their analysis remained conceptual and methodological, without a controlled head-to-head experimental comparison on the same dataset.

Other works explored the evolution from traditional pipelines to DL detectors. Gui et al. [30] analyzed the transition from rule-based and handcrafted feature approaches to DL detectors such as YOLO and RCNN, highlighting that simple approaches may still remain useful in specific contexts. Similarly, Chen et al. [25] reviewed both traditional remote sensing techniques and modern DL detectors, identifying key challenges such as image resolution variability and the need for sufficiently large annotated datasets.

Numerous studies have contrasted pixel-based image analysis (PBIA) with object-based image analysis (OBIA). Nasiri et al. [31] evaluated PBIA and OBIA on PlanetScope imagery for tree-cutting detection, showing that OBIA classifiers benefited from spatial context and outperformed PBIA models relying solely on spectral cues. Similarly, Adiningsih et al. [32] demonstrated that OBIA better captured complex spatial structures. However, these studies remain confined to the PBIA–OBIA paradigm and do not compare modern DL detectors.

Taken together, this diverse landscape highlights comparisons that differ from our approach: to the best of our knowledge, no previous study provides a direct comparison between a deterministic RGB pixel-based method and a YOLO-based DL model on the same multi-category dataset using polygon-based annotations. Our work addresses this gap by proposing a head-to-head evaluation between a color-rule-based pipeline and YOLOv11-seg applied to the same imagery and categories. Table 1 compares the works most closely related to our experimental setting and approach. Such studies either propose methodological comparisons between different paradigms (e.g., traditional methods vs. ML or DL) or evaluate recent YOLO-based detectors applied to remote sensing imagery.

2.4. Public Remote Sensing Datasets

Several publicly available aerial and remote sensing datasets are widely used in object detection and segmentation research, including DOTA [33], xView [34], iSAID [35], SpaceNet [36], LoveDA [37], and FLAIR-HUB [38]. However, none of these datasets provide all the characteristics required in our study: (i) the specific agricultural categories of interest, i.e., citrus groves, olive groves, and wells; (ii) instance-level polygon annotations aligned with the specific object categories considered; (iii) high-resolution RGB images representing Mediterranean agricultural landscapes comparable to the scenario of interest.

Some widely adopted datasets rely on bounding-box annotations, which would introduce systematic color contamination within object regions and compromise the reliability of the characterizing-color extraction phase. For this reason, the creation of a dedicated dataset with precise polygonal masks was necessary to ensure a fair and methodologically sound comparison between the two approaches. Table 2 compares our dataset and some representative remote sensing datasets.

3. Methodology

This section presents the design of a system composed of a fleet of drones capable of acquiring images and dynamically adapting their trajectories according to the objects identified by the proposed YOLO-C3 hybrid approach. The design of YOLO-C3 integrates two complementary techniques, a YOLO-based and a pixel-based one, to leverage their strengths and mitigate their respective limitations, both for object/area detection and for adaptive navigation. Thanks to its hybrid nature, YOLO-C3 can perform better than previous approaches, as the results of the YOLO-based method are filtered according to a confidence score and the area of detected objects and then compared with the results of the pixel-based method. The proposed methodology consists of the following stages.

Training on satellite imagery: The hybrid model is trained using satellite images obtained from Google Maps, selected for their high quality, availability, and regular updates. The proposed approach combines a YOLO–based detector with a pixel-based color analysis method. The results of the two methods are assessed on the same dataset to ensure a fair comparison. The pixel-based method relies on the analysis of color distributions and tonal variations, whereas the YOLO-based method uses a CNN for automatic object detection. (Section 3.2, Section 3.3 and Section 3.4 detail dataset construction, pixel-based approach, and YOLO-based approach, respectively.) Figure 1 shows an overview of the proposed hybrid detection framework: the pixel-based method performs steps 2.1 to 2.4, whereas the YOLO-based method performs steps 3.1 to 3.4.
Testing on drone-acquired images: The model trained on satellite images is subsequently used for drone-captured images to support trajectory adaptation. The feasibility of this approach is evaluated experimentally to check whether the model satisfies the generalization to images acquired at lower altitude but with vertical (nadiral) perspective (see Section 4.4).
Deployment and territorial analysis: Drones are deployed to predefined areas of interest identified through GPS coordinates extracted from satellite imagery. The system enables autonomous image acquisition and real-time trajectory adaptation based on detected objects (see Section 3.1). The availability of higher-resolution drone imagery allows for a more detailed analysis and improved object understanding.

Figure 1. The pixel-based method extracts colors, while the YOLO-based method filters polygons (yellow box); then, after training, the models are embedded into drones. Drones analyze images (red box), filter results by confidence score and size (blue box), combine results (green box) by applying rules and select targets.

3.1. Drone Image Acquisition and Analysis

The proposed solution involves drone-based image acquisition combined with ad hoc image analysis techniques to efficiently detect and monitor objects of interest over large geographic areas. The use of drones enables flexible, on-demand data collection with high spatial resolution, allowing the system to capture up-to-date visual information.

Drones analyze aerial imagery to automatically identify relevant objects and assess their conditions. Once objects of interest are detected, drones can dynamically change their trajectory to perform further targeted image acquisition, significantly improving the level of detail and accuracy of the collected data. This targeted approach reduces unnecessary data capture while ensuring that critical areas receive focused attention.

Integration of intelligent image analysis with a coordinated drone network supports precise territorial documentation and adaptive monitoring. Each drone is equipped with a high-resolution camera and an onboard computing unit (e.g., a Raspberry Pi), enabling autonomous operation and local data handling. Dedicated scripts running on the drones manage image capture, temporary storage, and deferred data transmission, as well as object detection. This ensures robust operation even in environments with intermittent connectivity. This design allows drones to independently survey large areas, collect multiple images, and efficiently transfer data in a consolidated form.

Figure 2 represents the workflow for drones. Drones are given a set of geographic coordinates to delimit the area where images have to be captured; the drones move towards their destination and, once a set of images is captured, this set is analyzed onboard drones to detect objects of interest. The drones recognize objects of interest according to a given list of categories and the execution of the software component detecting objects from the image. Once an object of interest has been found, its geographic coordinates are approximately extracted from the image and used to re-direct the drones and perform further image acquisition. This enables drones to find more detailed images of objects of interest. Moreover, drones can explore portions within the delimited area to look for objects of interest.

The images are stored locally, compressed, and transferred to the server when connectivity is restored. The server receives the images and the list of identified objects and then filters them according to up-to-date relevance, for example, in rescue operations, the degraded road conditions, the presence of debris in the rail network, etc.; for renovation operations, the presence of uncultivated areas, degraded road, etc. Filtered and ranked data allow the generation of actionable alerts to be distributed to authorized users via notifications or email, allowing timely and informed decision-making.

3.2. Initial Data Collection

To develop a component that can be used onboard drones to identify objects from aerial imagery, an initial dataset is needed to perform the appropriate training.

Data Source. We extracted images from Google Maps (API version 3.62), a choice motivated by the high spatial detail and image quality. The image retrieval was performed using a Python script that downloaded an image based on a pair of GPS coordinates that represented the center of the image. Each image was approximately 1024 × 1024 pixels and had a zoom of 21, the highest available. Figure 3 shows a miniature of some extracted images. The dataset gathered consisted of 177 images. Although the dataset contained only 177 images, it included approximately 3000 annotated object instances, and 80% of them enabled satisfactory model training. This high instance count came from multi-label annotations that covered a wide range of elements. Table 3 reports the instance counts per category; there is a clear gap between the citrus groves (749 objects) and wells (101 objects). However, the class imbalance (at most 1:7) was considered acceptable, as supported by the results in Section 4.

Labeling and Coloring. Each image was manually annotated using the Roboflow tool [39] (https://roboflow.com/, accessed on 30 March 2026). Roboflow enables fast and intuitive labeling by allowing users to outline the polygonal boundaries that delimit each object. The images pertained to land use/land cover, and the following were the categories of interest: citrus groves, trees, houses, wells, meadows, roads, fields, and olive groves. Each category was associated with a specific color (e.g., red for citrus groves, yellow for wells, etc.). This annotation process made it possible to identify all objects of interest within an image. Once the labeling was complete, another version of each image was generated so that each object in it was filled with the default color for the category. This coloring procedure was necessary for the validation phase of the classification approaches that were compared.

Train/Test. To train the classification algorithms for each approach, the dataset was divided as follows: 144 (approximately 80%) training images and 33 (approximately 20%) testing images. The training phase was substantially different in each of the following approaches. In the case of the pixel-based algorithm, only the colors of the images were analyzed and filtered to train the algorithm, while in the other case (YOLO-based), the entire image training set was used to train the algorithm.

1.: Pixel-based Color Classification: The training phase consisted of the creation of the characterizing color (CC) sets for each category of objects of interest. CC sets were built by extracting pixel colors from the annotated aerial imagery, filtering out irrelevant dark tones, and assigning each remaining color to the land-cover category where it was most representative. Counts of unique RGB values were aggregated into a Land Color Table (LCT), which supported the detection of dominant colors for each category.
2.: YOLO-based Classification: The selected model YOLOv11x-seg was trained by a canonical procedure using Python version 3.13.
3.: The output of the Roboflow-based labeling process was the dataset given as input to the YOLO model, consisting of the training images together with the polygons drawn for each category.

The details of each approach are given in Section 3.3 and Section 3.4. The testing set was used in the same fashion for both approaches to have a ground truth, allowing us to verify the quality of the two approaches. A similar consideration applied to the testing phase, described in Section 4. Both approaches included a post-processing stage aimed at improving the performance of the individual methods. For instance, in the YOLO-based method, the proper calibration of the confidence score proved to be crucial, significantly improving the overall performance.

Both models were trained using a host that had the following characteristics: 32 GB RAM; Intel i7 12700H CPU; NVIDIA RTX 3050 Ti. YOLO-based method required a training time of 15 h and 30 min, while the training of the pixel-based approach required approximately 10 min.

3.3. Pixel-Based Color Classification

The proposed approach classifies land use and land cover (LULC) by exploiting the characteristic colors of each land type rather than training ML-based models. It relies on a compact set of annotated aerial imagery to automatically extract characterizing colors (CCs)—colors that uniquely describe each land category—and then uses those colors to classify new unlabeled images. The method unfolds in three main phases: (i) generation of CC sets from the annotated dataset; (ii) pixel-based classification of unlabeled images; and (iii) post-processing for noise removal and gap filling.

The first phase generates the characterizing color sets, identifying the most representative colors for each category. For every pixel in each category of the annotated dataset, the RGB triplet and its label are recorded. Counts of identical colors are aggregated across all images to build the Land Color Table (LCT). The LCT has one row per unique RGB color; the first column stores the color, and the remaining columns record the frequency of that color in each category and in the background. Dark tones, often caused by shadows or ground features, are not reliable indicators of land cover. To remove them, RGB values are projected into the HSL color space, and the lightness component is inspected. Any color with a lightness below a fixed threshold (

0.16

on a

[0, 1]

scale in our experiments) is removed from the LCT. For every row of the filtered LCT, the normalized pixel counts across categories are compared. A color is assigned to the category with the highest count only if two conditions hold: (i) no other category’s count exceeds

75 %

of the maximum; and (ii) the combined count of all other categories does not exceed

1.2

–

1.5

times the maximum. These conditions guarantee that the color occurs predominantly in a single category. Colors that fail these checks are collected in a separate Unassigned Color (UC) set. The outcome of this phase is one CC set per category (and one UC set) that captures the most distinctive colors of each land type.

The second phase classifies aerial imagery following the set of characterizing colors defined in the first phase. Each pixel RGB value is compared to the CC and UC sets. If it belongs to a category’s CC set, it is immediately marked with that category’s unique output color. Pixels whose colors appear in the UC set are temporarily marked as unassigned. This produces a preliminary, pixel-wise map of category marks. Because UCs often appear near the borders of homogeneous regions, a local density analysis is applied to re-assign them. The image is scanned with square windows of progressively larger size (for example 3 × 3, 9 × 9 and 25 × 25 pixels). In each window, if exactly one category occupies more than 50% of the window’s pixels, the entire window is colored with that category. If no category crosses the threshold, the UC pixels in the window are temporarily counted towards every category; if this update causes one category to exceed the threshold, the window is colored accordingly. Through three iterations, the dominant areas expand while preserving sharp boundaries.

A final post-processing phase further improves the map through two complementary operations, denoising and closing. The denoising addresses all the isolated misclassifications, removing them by scanning the image with a large window (e.g., 75 × 75 pixels). If the number of pixels of a given color inside the window is less than a third of the window area, the pixels are considered noise and reverted to background. The closing phase focuses on unmarked pixels that lie inside a region dominated by a single category with the objective of filling such small areas. A window expands around each unmarked tile; if a category covers more than two-thirds of the window, the tile is assigned to that category. These steps produce a clean, contiguous LULC map with minimal false positives and negatives.

Details are provided in our previous work dedicated to the implementation of this algorithm [40].

3.4. YOLO-Based Classification

A valid alternative to image classification based on pixel analysis is object-based classification. Generally, object identification can provide more detailed information and more robust results. One methodology for classifying objects in aerial imagery is supported by the YOLO model. YOLO is a popular model for object detection and image segmentation that was originally presented in 2016 [41]. YOLO quickly gained popularity due to its high speed and accuracy and now supports various vision AI tasks such as detection, segmentation, pose estimation, tracking, and classification [21,22]. The YOLO family of models has continued to evolve since its initial release in 2016. The YOLO11 version was released in September 2024. Among its various extensions, YOLOv11-seg (https://docs.ultralytics.com/tasks/segment/#models, accessed on 30 March 2026) stands out, as it was specifically optimized for image segmentation tasks to significantly improve pixel-level classification accuracy [27].

The key strength of YOLOv11-seg lies in its ability to detect diverse geometric shapes and provide highly detailed results. The algorithm presented in this study was implemented using the Ultralytics YOLO Python library (version 8.3.9) (https://docs.ultralytics.com/usage/python/, accessed on 30 March 2026).

This library provides functionalities such as object detection, image segmentation, and classification. The steps performed to execute the algorithm were as follows: (i) the original gathered dataset was labeled by using Roboflow tool; (ii) a preprocessing stage cleaned the dataset to ensure YOLOv11-seg’s compliance and correctness; (iii) the YOLOv11-seg model was trained. The approach is inspired by our previous work discussing image classification using YOLOv11-seg [42].

The first step consisted in making the dataset ready with all the labels for the categories and for each object in the images. This was performed via the Roboflow tool and by manually labeling images according to the defined categories. This initial step produced a YAML file that was then required for the training of the model. However, to use YOLOv11-seg, it is necessary that each polygon (i.e., each detected object) contains at least five points; for this reason, the second step performed a dataset cleaning, using a custom script, to ensure the removal of non-compliant labeled images. In this same phase, an analysis of the assigned labels was also carried out: labels corresponding to relatively small objects (on the order of 15,000 pixels) were excluded, since they could be just noise, or to marginally visible elements that could be mistakenly identified, therefore negatively affecting the final performance of the algorithm. The third and final step involved running the training via the Ultralytics YOLO Python library.

4. Experimental Evaluation and Results

The test set was used to evaluate both approaches. The 33 test images comprised both simple cases, in which the images had a single category and a single object, and complex cases, in which some images had nearly all categories with multiple objects per category. This section presents and discusses the results obtained for each approach. In the images, the most frequently detected object category was citrus groves, with more than 150 instances identified. In contrast, meadows were the least represented category and were counted only once. Consequently, data for this category were omitted because their limited occurrence could lead to misleading interpretations.

4.1. Evaluation of the Pixel-Based Approach

Before classification, the pixel-based algorithm analyzed and extracted the colors in each land-cover category in the dataset. The process was applied to all 177 images (having labels for each object in them) and computed the number of unique RGB colors belonging to each target category. To reduce noise and improve reliability, colors that frequently appeared in more than one category were filtered out, retaining only the most representative colors. Table 4 reports the total colors, as the number of unique RGB colors initially detected from the training dataset, the number of colors that were removed because they were in other categories, and the number and percentage of unique RGB filtered colors that were retained for each category.

The extraction results confirmed that the algorithm effectively isolated distinctive color distributions for all classes, even though the percentage of retained colors varied significantly among categories. Higher retention rates, such as those observed for citrus groves (48.44%) and olive groves (46.36%), indicated stable and homogeneous color patterns across images. Conversely, lower retention for roads (16.85%) and trees (18.50%) reflected a broad variability in color, caused by factors such as shadows, illumination change, or the presence of mixed pixels containing soil, vegetation, and asphalt. Intermediate retention values for houses, wells and fields (around 21–27%) suggested moderate intra-class consistency, while the meadows category, with only 13.50% retention, appeared to be the most visually ambiguous and underrepresented. Overall, this phase provided a compact and representative color database, reducing noise and supporting the subsequent classification stage that was performed on a pixel-by-pixel basis.

Figure 4 compares the output of the pixel-based classification (left) with its corresponding ground-truth mask (center) for the agricultural area dominated by citrus groves and fields (right). The images on the left and the center have strong visual correspondence, with red regions indicating citrus groves and cyan areas representing fields. The pixel-based classifier successfully identifies the dominant categories, maintaining coherent and homogeneous patches that closely match the ground-truth segmentation.

The test images were classified, and a label was given to each pixel according to its RGB coordinates (as shown in the left part of Figure 4). The number of true positives, false positives, true negatives, and false negatives were calculated. We performed cross-validation by splitting the set in five parts (i.e., K-fold cross-validation where K equals to five): we ran five experiments and each time, we used four parts (or folds) for training and the other remaining part for testing. Therefore, we had five models and calculated for each the number of pixels identified correctly (or not). Table 5 shows the mean values and the standard deviation values for accuracy, precision and recall, and F1 Score, for each category. The results confirm that the classification is always consistent as can be seen by the low values of the standard deviation. The standard deviation for accuracy is between 0.01 and 0.06 across categories, except for the category wells, which presents high color variability.

The results show that the pixel-based approach achieves satisfactory accuracy across most categories, with houses, trees, and roads performing particularly well. High precision values for houses and wells show that the model reliably identifies these categories with very few false positives, confirming that the color-based discrimination is effective when color patterns are distinctive. Notably, both olive groves and citrus groves achieve balanced and robust performance, with high values for accuracy, precision and recall, indicating stable color patterns and effective recognition. Olive and citrus groves have the highest values for the F1 score. Lower recall values observed in trees, houses, and wells suggest that many true pixels were missed, mainly because colors alone cannot fully capture variations due to shadows, roof materials, or occlusions. These limitations align with the lower retention percentages observed during the extraction stage, where color variability was highest. The absence of reliable metrics for the meadows category further confirms its color ambiguity and limited representation.

The experiments show that the proposed color-based pixel approach distinguishes among multiple land types using a purely data-driven filtering process. This approach is transparent, interpretable, and computationally efficient, and these characteristics make it suitable for rapid land-cover assessments and as a preprocessing or validation step for object-based deep learning models.

This visual agreement confirms the reliability of the color extraction process, particularly for categories with distinctive chromatic characteristics such as citrus groves, which achieved the highest color retention (48.44%) (as shown in Table 4). The land category, characterized by lower color retention (21.08%), displays slightly fragmented regions and small edge inconsistencies, which can be explained considering the variability in soil color and lighting conditions. Minor discrepancies along field boundaries are primarily due to mixed pixels and gradual transitions between soil and vegetation, which occasionally lead to local misclassifications. Nevertheless, the comparison highlights the effectiveness of the proposed approach in distinguishing large homogeneous regions using only RGB information, producing interpretable and high-resolution maps that reflect the real distribution of land-cover classes in the area.

Figure 5 shows the confusion matrix computed according to the results of the pixel-based approach when considering the whole dataset (177 images) with a split of 80% for training and 20% for testing. Rows give the normalized values for detected pixels, while columns represent the ground truth. The background class in the column indicates the predicted pixels that are outside any labeled region, whereas the background class in the row represents missed ground-truth pixels (false negatives). The most accurate category is olive groves, with most pixels correctly identified, followed by the fields category, which also obtains good performance. Note that the normalized values were computed from the count of pixels; hence, there is fine-grained detail, and this degrades the performance of categories that occupy small areas in the images, such as houses and wells.

The confusion matrix reveals several key performance insights that characterize the model’s behavior prior to final optimization. While the model yields a strong identification of major land features, some mismatches occur between spectrally similar vegetation types, e.g., citrus groves and olive groves, or citrus groves and trees. Additionally, a portion of smaller or peripheral objects, such as wells and houses, are categorized as background. These results represent the model’s output, which were further evaluated by the subsequent post-processing stage. This final refinement phase effectively mitigates these edge-case misclassifications and resolves many of the background overlaps, ultimately leading to more accurate results.

4.2. Evaluation of the YOLO-Based Approach

During the classification phase, the YOLO-based approach analyzed an image and highlighted the detected objects with different colors. When the model ran and an image was provided as input, the output was an image similar to the one in the input but with additional labels indicating each detected object and its corresponding category. For every detected object, the following data were given: (i) the category name, (ii) a bounding box enclosing the object and (iii) a confidence score. The confidence score represented the model’s certainty that a detected object belonged to a particular class. It ranged from zero to one, with higher scores indicating greater confidence. The model used the default threshold of 0.25, hence objects were marked if they were given a confidence score of at least 0.25. In our post-processing step, the results were further refined by filtering out the ones below a 0.45 confidence score, and also removing small objects, i.e., objects having less than 15,000 pixels. The values for such parameters were set by performing a manual analysis of image samples to increase confidence and accuracy. The visual inspection of the results showed that most objects were detected correctly, even in the most complex images.

Figure 6 shows a sample of the results, where the only missing object was the house in the top-right image, which was likely ignored due to the shadow. Other cases of undetected objects occurred because of mud or shadows obstructing object recognition. Nevertheless, the overall results were highly reliable, as objects were correctly identified and accurately labeled regardless of their shape and size. The image in the top-left shows a label named House 0.26, which is an example of object detection having a low confidence score, which often occurs when the object is at the edge of the image.

Figure 7 shows the YOLO-based classification for the image analyzed by the pixel-based approach (see Figure 4). In it, citrus groves and fields (or Land on the label) are correctly identified. Some small objects (see labels House and Tree) have a confidence score lower than 0.45 and would be filtered out in the post-processing phase.

Table 6 reports the mean values and the standard deviation values for accuracy, precision, recall and F1 score metrics obtained for each category, when running a 5-fold cross-validation for the dataset. Cross-validation was performed similarly to the previous pixel-based experiment. The low values for the standard deviation for accuracy across all categories show that the five models (one for each run of the five partitions) give comparable and consistent results. Wells have the highest value of standard deviation for the precision metric; this is due to the high color variability of wells. In terms of accuracy, olive groves and citrus groves perform very well and achieve the highest value, while houses and roads have the lowest. With respect to recall, the values are generally high, while roads have the lowest value. Low values for roads could be due to their highly variable shapes and the presence of unrecognized elements within them, such as isolated trees, mud, or grass. Still, roads achieve an accuracy above 73% and recall above 77%. Overall, all categories reach satisfactory performance levels, as the average accuracy is above 82%, precision is above 92% and recall is 89%.

Figure 8 shows the confusion matrix computed according to the results of the YOLO-based approach when considering the whole dataset (177 images) with a split of 80% for training and 20% for testing. Rows give the normalized values for detected objects in each category, while columns represent the ground truth. The background category in the column indicates the predicted objects outside any labeled region, whereas the background class in the row gives the missed ground-truth objects (false negatives). The categories are generally correctly identified, with citrus groves achieving the highest value, followed by houses and wells. Fields present the lowest performance, mainly due to their similarity with other vegetation categories. The YOLO libraries compute the confusion matrix during the validation phase; however, these results do not account for the subsequent post-processing steps, which enhance performance and provide better metrics. The main mismatched detections are for the small objects and the objects located at the edges of the image.

For the YOLO-based experiments, the number of training epochs was 200, a value within the range 100–300, commonly suggested for deep learning and to accommodate several categories. In all the experiments for the 5-fold cross-validation, the models consistently converged between epochs 101 and 103, indicating stable behavior and, consequently, a robust model (no overfitting). This finding was further supported by the metrics (see Table 6), particularly the low standard deviation values indicating limited variability across folds and suggesting stability and ability to generalize.

4.3. Comparison of Approaches

Figure 9 shows two images in the right column that represent citrus groves. In the two images, the YOLO-based classifier (left-column images) correctly identifies only a portion of the area that has citrus groves. Instead, the pixel-based classifier (center-column images) correctly determines the whole area that has citrus groves.

Figure 10 shows an image (on the right) representing citrus groves and a large well having a rectangular shape. The YOLO-based classifier (left image) partially identifies the citrus groves but misses the well. The pixel-based classifier detects the citrus groves (highlighted in red) and some parts of the well (yellow pixels).

The comparison of the two approaches is apparent by the results shown in the confusion matrices (see Figure 5 and Figure 8). The matrices report the results obtained when training was performed using 80% of the whole dataset. The results highlight the superior performance of the YOLO-based approach, which integrates contextual and spatial information to improve object recognition. However, the pixel-based approach demonstrates its effectiveness in capturing fine-grained color details and achieves high accuracy for some categories, such as citrus groves. In a minority of cases, the pixel-based approach yields more accurate results (see Figure 9 and Figure 10, where the red pixels in the center images indicate citrus groves and the yellow pixels indicate a well).

Based on the evaluation above, our YOLO-C3 image analysis component runs both methods on the input image. It selects labels as accurate if they are confirmed by both methods with the same category. For objects assigned different labels, such as those at the image edges, wells, or small objects not accurately detected by the YOLO-based method, for which the confidence score is lower than a threshold set at 0.45, these are retained if suggested by the pixel-based approach and considered accurately labeled by it. Otherwise, the results by the YOLO-based method are confirmed. Additionally, moving the drone to the center of the object helps capture it better, enabling further labeling or confirmation of the previous label.

4.4. Drone Image Analysis

The drones captured images in predefined geographical areas and moved towards some destinations according to the provided list of objects of interest. The images acquired were analyzed by our YOLO-C3 component, which leveraged both algorithms to achieve more specific and detailed detections and to provide appropriate feedback. The approach was tested on a dataset containing drone images. The dataset used was odm_data_aukerman (https://github.com/OpenDroneMap/odm_data_aukerman, accessed on 30 March 2026). It contains 32 images with a resolution of 4896 × 3672 pixels and 37 images with a resolution of 6000 × 4000 pixels [43]. Image patches were extracted, and both approaches were evaluated (see Figure 11). The results, shown below, confirmed the robustness and validity of the training performed on satellite imagery.

In the case of YOLO, an additional post-processing stage was applied to obtain more efficient results. During post-processing, the following steps were performed:

Step-1: Predictions with a confidence score lower than 0.45 were discarded. This filtering removed less reliable predictions.
Step-2: Predictions corresponding to small objects were removed. An object was considered too small if it contained fewer than 15,000 pixels. This size-based filtering was consistent with the medium-to-large dimensional characteristics of the analyzed categories.

To validate the post-processing procedure, metrics were computed for each step performed (see Table 7). In these cases, the improvement in the metrics was strongly influenced by the house category: (i) without post-processing, 14 houses were detected; (ii) after step-1, the detections decreased to eight; (iii) after step-2, five houses were detected, of which three were correct. By eliminating small objects, we avoided, for example, confusing houses with cars, which are smaller objects.

Figure 12 shows the results obtained using YOLO. The left-most image shows several objects, and among these wells and meadows (the same as lawns), objects that are removed by the post-processing step-1 for their low confidence score. The objects labeled Land and the two labeled House are removed in step-2. In the center image, the House labeled object is removed by step-2; in fact, such an object is actually a crane. In the right-most image, the Citrus Grove label is removed during step-1. In the last image, one of the houses has not been labeled; this is due to the camera perspective being excessively oblique, altering the standard shape of the house.

Tests executed using YOLO lasted 0.01 s per image, while the analysis relying on the pixel-based approach lasted approximately 2.8 s per image. The two models ran in parallel. Most of the time, the YOLO-based model gave accurate results, and only small objects, objects in the edges, or wells had to be confirmed by the pixel-based model. Hence, a batch of results given by the YOLO-based model provided accurate results and possible coordinates for the next drone destinations, then the drone could move towards them immediately. Some edge cases were later confirmed (or excluded) by the pixel-based approach and only then were such coordinates given to the drone as the next destinations.

5. Discussion of the Approaches: Pros and Cons

The experimental evaluation of the YOLO-C3 hybrid system reveals a significant trade-off between computational efficiency and categorical accuracy. The pixel-based approach is notably efficient, requiring only 10 min for training compared to the 15 h and 30 min required by the YOLOv11-seg model. While YOLOv11-seg demonstrates superior overall performance, integrating spatial and contextual information to achieve a weighted precision and recall exceeding 90%, the pixel-based method excels at capturing fine-grained color details in categories with stable chromatic patterns, such as citrus and olive groves.

One of the primary advantages of the hybrid approach is its ability to handle low-confidence estimates. YOLO-based detection has been shown to be less accurate for small objects, objects located near image boundaries, or those obscured by shadows (e.g., houses near forest borders). In these instances, our system uses the pixel-color classifier to validate detections where the YOLO confidence score falls below 0.45. This is particularly evident in identifying sparse or complex objects like wells, which the YOLO model occasionally missed while the pixel-based classifier successfully identified via yellow-pixel detection.

Furthermore, the study confirms that satellite-trained models can generalize to drone-acquired imagery. Despite differences in altitude, the vertical (nadiral) perspective remains consistent enough for the models to maintain high performance on real-world drone datasets. However, challenges remain regarding perspective distortion (oblique camera angles) and class imbalance, particularly for underrepresented categories like meadows.

6. Conclusions

This paper introduced YOLO-C3, a hybrid image-analysis component designed for deployment onboard drones. The approach combines the strong object-detection capabilities of YOLOv11-seg with a lightweight and deterministic pixel-color classifier. This design helps address two key challenges: the lack of specific Mediterranean agricultural datasets and the high computational cost typical of deep neural networks. The results suggest that the proposed hybrid system balances detection accuracy and processing speed. The YOLO-based model acts as the primary detection mechanism, while the pixel-based classifier serves as a transparent and efficient validation step for edge cases and small objects. Experimental results show that with appropriate post-processing, such as confidence filtering and size-based denoising, the system reaches a final precision of 0.917 on drone imagery.

The compact nature of the YOLO-C3 component makes it suitable for low-cost hardware such as the Raspberry Pi. This allows drones to autonomously adapt their trajectories in real time. This capability can support applications such as emergency response, post-event inspection, and infrastructure monitoring, helping quickly map safe corridors or locate buildings in disaster areas. In future work, we plan to expand the dataset with images captured across different seasons and light conditions.

Author Contributions

Conceptualization, E.T.; methodology, S.C., A.M., E.T. and G.V.; software, S.C., A.M. and G.V.; validation, S.C., A.M., E.S., E.T. and G.V.; data curation, S.C., A.M. and E.S.; writing—original draft preparation, E.S. and G.V.; writing—review and editing, E.T.; visualization, S.C., A.M., E.S. and G.V.; supervision, E.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data derived from public domain resources.

Acknowledgments

We acknowledge the support of the University of Catania PIACERI project TEAMS, PNRR project CN-HPC, Big Data and Quantum Computing, Spoke 2 Fundamental Research and Space Economy, and Innovation Grant Agri@Intesa.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Nayak, A. LSCNet: A Lightweight Shallow Feature Cascade Network for Small Object Detection in UAV Imagery. Future Internet 2025, 17, 568. [Google Scholar] [CrossRef]
Hamdi, A.; Noura, H.N. AI-Driven Damage Detection in Wind Turbines: Drone Imagery and Lightweight Deep Learning Approaches. Future Internet 2025, 17, 528. [Google Scholar] [CrossRef]
Shahbaz, M.; Guergachi, A.; Noreen, A.; Shaheen, M. Classification by object recognition in satellite images by using data mining. In Proceedings of the World Congress on Engineering; The International Association of Engineers (IAENG): Hong Kong, China, 2012; Volume 1, pp. 4–6. [Google Scholar]
Abburu, S.; Golla, S.B. Satellite image classification methods and techniques: A review. Int. J. Comput. Appl. 2015, 119, 20–25. [Google Scholar] [CrossRef]
Eslami, M.; Faez, K. Automatic traffic monitoring from satellite images using artificial immune system. In Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR); Springer: Berlin/Heidelberg, Germany, 2010; pp. 170–179. [Google Scholar]
Khalil, M.; Li, J.; Sharif, A.; Khan, J. Traffic congestion detection by use of satellites view. In Proceedings of the 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP); IEEE: Piscataway, NJ, USA, 2017; pp. 278–280. [Google Scholar]
Shanmuga Priya, R.; Vani, K. Deep learning based forest fire classification and detection in satellite images. In Proceedings of the 11th International Conference on Advanced Computing (ICoAC); IEEE: Piscataway, NJ, USA, 2019; pp. 61–65. [Google Scholar]
Marletta, D.; Midolo, A.; Tramontana, E. Detecting Photovoltaic Panels in Aerial Images by Means of Characterising Colours. Technologies 2023, 11, 174. [Google Scholar] [CrossRef]
Mao, H.; Chen, X.; Luo, Y.; Deng, J.; Tian, Z.; Yu, J.; Xiao, Y.; Fan, J. Advances and prospects on estimating solar photovoltaic installation capacity and potential based on satellite and aerial images. Renew. Sustain. Energy Rev. 2023, 179, 113276. [Google Scholar] [CrossRef]
Aleem-ul Hassan, M.; Haider, S.; Ullah, K. Diagnostic study of heavy downpour in the central part of Pakistan. Pak. J. Meteorol. 2010, 7, 53–61. [Google Scholar]
Bhil, K.; Shindihatti, R.; Mirza, S.; Latkar, S.; Ingle, Y.; Shaikh, N.; Prabu, I.; Pardeshi, S.N. Recent progress in object detection in satellite imagery: A review. In Sustainable Advanced Computing: Select Proceedings of ICSAC 2021; Springer: Singapore, 2022; pp. 209–218. [Google Scholar]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar] [CrossRef]
Wang, A.; Tian, P.; Wang, S. High resolution satellite imagery segmentation based on adaptively integrated multiple features. In Proceedings of the Automatic Target Recognition and Image Analysis; and Multispectral Image Acquisition (MIPPR); SPIE: Bellingham, WA, USA, 2007; Volume 6786, pp. 812–818. [Google Scholar]
Atik, M.E.; Duran, Z.; Özgünlük, R. Comparison of YOLO versions for object detection from aerial images. Int. J. Environ. Geoinform. 2022, 9, 87–93. [Google Scholar] [CrossRef]
Phiri, D.; Morgenroth, J. Developments in Landsat land cover classification methods: A review. Remote Sens. 2017, 9, 967. [Google Scholar] [CrossRef]
Griffiths, P.; van der Linden, S.; Kuemmerle, T.; Hostert, P. A pixel-based Landsat compositing algorithm for large area land cover mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 2088–2101. [Google Scholar] [CrossRef]
Li, J.; Ma, J.; Ye, X. A Batch Pixel-Based Algorithm to Composite Landsat Time Series Images. Remote Sens. 2022, 14, 4252. [Google Scholar] [CrossRef]
Lekka, C.; Petropoulos, G.P.; Detsikas, S.E. Appraisal of EnMAP hyperspectral imagery use in LULC mapping when combined with machine learning pixel-based classifiers. Environ. Model. Softw. 2024, 173, 105956. [Google Scholar] [CrossRef]
Moharram, M.A.; Sundaram, D.M. Land use and land cover classification with hyperspectral data: A comprehensive review of methods, challenges and future directions. Neurocomputing 2023, 536, 90–113. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Mao, M.; Hong, M. YOLO object detection for real-time fabric defect inspection in the textile industry: A review of YOLOv1 to YOLOv11. Sensors 2025, 25, 2270. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Hussain, M. Yolov5, yolov8 and yolov10: The go-to detectors for real-time vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Bai, C.; Bai, X.; Wu, K. A review: Remote sensing image object detection algorithm based on deep learning. Electronics 2023, 12, 4902. [Google Scholar] [CrossRef]
Chen, Z.; Wang, H.; Wu, X.; Wang, J.; Lin, X.; Wang, C.; Gao, K.; Chapman, M.; Li, D. Object detection in aerial images using DOTA dataset: A survey. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104208. [Google Scholar] [CrossRef]
He, L.h.; Zhou, Y.z.; Liu, L.; Cao, W.; Ma, J.H. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Zhang, Y.; Ma, J. Application of the YOLOv11-seg algorithm for AI-based landslide detection and recognition. Sci. Rep. 2025, 15, 12421. [Google Scholar] [CrossRef]
Li, Q.; Shi, Y.; Auer, S.; Roschlaub, R.; Möst, K.; Schmitt, M.; Glock, C.; Zhu, X. Detection of undocumented building constructions from official geodata using a convolutional neural network. Remote Sens. 2020, 12, 3537. [Google Scholar] [CrossRef]
Southworth, J.; Smith, A.C.; Safaei, M.; Rahaman, M.; Alruzuq, A.; Tefera, B.B.; Muir, C.S.; Herrero, H.V. Machine learning versus deep learning in land system science: A decision-making framework for effective land classification. Front. Remote Sens. 2024, 5, 1374862. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Nasiri, V.; Hawryło, P.; Janiec, P.; Socha, J. Comparing object-based and pixel-based machine learning models for tree-cutting detection with planetscope satellite images: Exploring model generalization. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103555. [Google Scholar] [CrossRef]
Adiningsih, N.S.; Setiawan, N. Pixel-Based vs. Object-Based Remote Sensing for Linking Land Use Change and Land Value Zone. Results Earth Sci. 2025, 3, 100120. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 3974–3983. [Google Scholar]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xview: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar] [CrossRef]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2019; pp. 28–37. [Google Scholar]
Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. Spacenet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Garioud, A.; Giordano, S.; David, N.; Gonthier, N. FLAIR-HUB: Large-scale multimodal dataset for land cover and crop mapping. arXiv 2025, arXiv:2506.07080. [Google Scholar]
Alexandrova, S.; Tatlock, Z.; Cakmak, M. RoboFlow: A flow-based visual programming language for mobile manipulation tasks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2015; pp. 5537–5544. [Google Scholar]
Marletta, D.; Midolo, A.; Tramontana, E. Automatic Land Use and Land Cover Classification by Means of Characterising Colours. In Proceedings of the International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE); IEEE: Piscataway, NJ, USA, 2024; pp. 146–151. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Calcagno, S.; Scaletta, E.; Tramontana, E.; Verga, G. YOLO-based Recognition of some Crop Categories from Real-World Aerial Images. In Proceedings of the International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Portakal, S.M.; Kindiroglu, A.A.; Ozturk, M.U. Real time incremental image mosaicking without use of any camera parameter. arXiv 2022, arXiv:2212.02302. [Google Scholar] [CrossRef]

Figure 2. Drone workflow consisting of the definition of initial coordinates and the list of objects of interest, the systematic analysis of captured images to detect objects, the search of further objects in the neighborhood, and the acquisition of more detailed images.

Figure 3. Sample images downloaded from Google Maps.

Figure 4. Comparison between the pixel-based classification output (left), the corresponding ground-truth mask (center), and the original image (right) for an agricultural area. In the images, cyan pixels correspond to fields, red pixels represent citrus groves, yellow pixels show wells, green pixels indicate trees, and blue pixels represent houses.

Figure 5. Confusion matrix for the pixel-based approach: the values are normalized according to the number of pixels in the ground truth.

Figure 6. Images labeled by YOLO: each colored region is an object of a known category given in the label, together with the confidence score. Images were cropped, hence, some objects are without a visible label. Roads are shown in pink, trees in cyan, citrus groves in violet, and wells in green.

Figure 7. Comparison between the YOLO-based classification output (left) and the original image (right) for the same agricultural area analyzed by the pixel-based approach (Figure 4).

Figure 8. Confusion matrix for YOLO-based approach: the values are normalized according to the number of objects in the ground truth.

Figure 9. Comparison between the YOLO-based classification output (left), the pixel-based one (center) and the original images (right) for citrus groves.

Figure 10. Comparison between the YOLO-based classification output (left), the pixel-based one (center) and the original image (right) for citrus groves and a large well.

Figure 11. Example of image patches extracted from drones.

Figure 12. Example of image patches extracted from drones labeled according to the YOLO-based approach.

Table 1. Comparison between our approach and related works.

Study	Goal	Task Type	Annotation	Classes	Limitations
Li et al. (2020) [28]	Compare CNNs to RGB filter-based methods	Binary semantic segmentation for buildings	Semantic pixel-level building masks	Binary (building vs. non-building)	No multi-class evaluation; no polygon-based RGB vs. DL head-to-head comparison
Nasiri et al. (2023) [31]	Compare PBIA to OBIA (ML classifiers)	Forest cover mapping and tree-cutting detection	Pixel-level and object-based segmentation	Binary land-cover classification (forest vs. non-forest)	No multi-class evaluation; no comparison with DL detectors; confined to PBIA–OBIA
Southworth et al. (2024) [29]	Compare ML to DL	Cross-domain methodological evaluation	Dataset-dependent	Domain-dependent	No controlled head-to-head evaluation on the dataset
He et al. (2025) [27]	Evaluate YOLOv11-seg (DL detector)	Object detection and mask segmentation for landslide recognition	Polygon-based instance-level annotations	Single-target (landslide)	No multi-class evaluation; no comparison to RGB-based approaches on the same dataset
He et al. (2025) [26]	Evaluate YOLOv11 (DL detector)	Multi-class object detection for generic man-made remote sensing objects	Instance-level bounding-box annotations	Multi-class (vehicles, aircraft, ships, infrastructure)	No polygon-based segmentation; no comparison with RGB-based approaches; does not address agricultural or rural land-cover categories.
Ours	Compare pixel-based to YOLOv11-seg-based detection	Multi-class object detection and polygon-based segmentation in high-resolution aerial imagery	Polygon-based instance-level annotations	Multi-class evaluation (agricultural and man-made)

Table 2. Comparison between our dataset and existing remote sensing datasets in terms of task, annotation type, and limitations.

Dataset	Task	Annotation	Classes	Main Characteristics and Limitations
FLAIR-HUB [38]	Semantic segmentation (land-cover and crop mapping)	Pixel-level masks	19 semantic land-cover classes (vegetation, artificial surfaces, water bodies, natural bare areas) + 46 hierarchical crop-type categories	Primarily designed for large-scale semantic land-cover segmentation; not organized as an instance-level object detection dataset with separate annotated objects
LoveDA [37]	Semantic segmentation + unsupervised domain adaptation (urban/rural land cover)	Pixel-level masks	Building, road, water, barren, forest, agriculture, background	Focused on coarse land-cover classes at pixel level; not designed for fine-grained object-level polygon detection of specific agricultural and man-made categories
DOTA [33]	Oriented object detection (man-made objects)	Oriented bounding boxes	Generic man-made object categories (vehicles, aircraft, maritime objects, infrastructure, large facilities)	Focused on generic object detection using bounding boxes; not tailored to land-cover categories or polygon-based multi-class segmentation
Ours	Instance segmentation (land-cover and crop mapping)	Polygon-based instance-level annotations (multi-class)	Olive grove, citrus grove, tree, house, road, land, well, meadow

Table 3. Number of instances for the 177 training images.

Category	Total Instances
Citrus groves	749
Trees	564
Roads	483
Houses	427
Wells	101
Meadows	168
Fields	332
Olive groves	182
All categories	3006

Table 4. Gathered colors for the pixel-based approach as total unique RGB colors (first column), extracted from the training dataset, and filtered unique RGB colors (third column), for each category.

Category	Total Colors	Removed	Filtered	Retention (%)
Citrus groves	242,990	276,809	117,700	48.44
Trees	210,866	295,226	39,003	18.50
Houses	132,003	237,170	35,620	26.98
Wells	63,495	125,779	15,018	23.65
Fields	199,107	278,312	41,968	21.08
Roads	166,559	299,867	28,070	16.85
Olive groves	117,913	189,158	54,667	46.36

Table 5. The set of mean and standard deviation measures of a 5-fold cross-validation for the pixel-based classification.

Category	$μ$ Accuracy ( $σ$ )	$μ$ Precision ( $σ$ )	$μ$ Recall (SD)	$μ$ F1 Score ( $σ$ )
Citrus groves	0.670 (0.02)	0.721 (0.02)	0.733 (0.04)	0.699 (0.03)
Trees	0.894 (0.01)	0.836 (0.05)	0.129 (0.07)	0.155 (0.06)
Houses	0.948 (0.01)	0.888 (0.04)	0.083 (0.04)	0.143 (0.07)
Wells	0.806 (0.20)	1.00 (0.00)	0.077 (0.03)	0.142 (0.05)
Fields	0.801 (0.03)	0.656 (0.07)	0.413 (0.04)	0.474 (0.06)
Roads	0.927 (0.01)	0.630 (0.06)	0.171 (0.02)	0.232 (0.04)
Olive grove	0.730 (0.06)	0.675 (0.08)	0.955 (0.03)	0.787 (0.05)
All categories	0.828 (0.013)	0.764 (0.031)	0.375 (0.015)	0.383 (0.022)

Table 6. The set of mean (

μ

) and standard deviation (

σ

) measures of a 5-fold cross-validation for the YOLO-based classification.

Table 6. The set of mean (

μ

) and standard deviation (

σ

) measures of a 5-fold cross-validation for the YOLO-based classification.

Category	$μ$ Accuracy ( $σ$ )	$μ$ Precision ( $σ$ )	$μ$ Recall (SD)	$μ$ F1 Score ( $σ$ )
Citrus groves	0.892 (0.05)	0.957 (0.03)	0.928 (0.07)	0.942 (0.29)
Trees	0.834 (0.08)	0.876 (0.03)	0.949 (0.08)	0.908 (0.04)
Houses	0.690 (0.03)	0.824 (0.11)	0.839 (0.14)	0.816 (0.02)
Wells	0.823 (0.04)	0.978 (0.49)	0.841 (0.05)	0.902 (0.03)
Fields	0.842 (0.02)	0.927 (0.06)	0.908 (0.05)	0.914 (0.01)
Roads	0.735 (0.03)	0.943 (0.04)	0.772 (0.05)	0.847 (0.02)
Olive grove	0.961 (0.03)	0.961 (0.03)	1.00 (0.00)	0.980 (0.01)
All categories	0.825 (0.043)	0.924 (0.057)	0.890 (0.060)	0.901 (0.026)

Table 7. Metrics computed for each post-processing stage. The first row reports results without post-processing; the second after score filtering; the third after small-object removal.

Stage	Precision	Recall	Accuracy
No-post-processing	0.746	0.867	0.738
Step-1	0.778	0.867	0.758
Step-2	0.917	0.917	0.867

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Calcagno, S.; Midolo, A.; Scaletta, E.; Tramontana, E.; Verga, G. Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach. Future Internet 2026, 18, 204. https://doi.org/10.3390/fi18040204

AMA Style

Calcagno S, Midolo A, Scaletta E, Tramontana E, Verga G. Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach. Future Internet. 2026; 18(4):204. https://doi.org/10.3390/fi18040204

Chicago/Turabian Style

Calcagno, Salvatore, Alessandro Midolo, Erika Scaletta, Emiliano Tramontana, and Gabriella Verga. 2026. "Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach" Future Internet 18, no. 4: 204. https://doi.org/10.3390/fi18040204

APA Style

Calcagno, S., Midolo, A., Scaletta, E., Tramontana, E., & Verga, G. (2026). Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach. Future Internet, 18(4), 204. https://doi.org/10.3390/fi18040204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Objects in Aerial Imagery Using Drones and a YOLO-C3 Hybrid Approach

Abstract

1. Introduction

2. Related Work

2.1. Pixel-Based and Traditional Land-Cover Classification Approaches

2.2. Deep Learning Object Detection and YOLO-Based Approaches

2.3. Comparative Studies Between Traditional and Deep Learning Approaches

2.4. Public Remote Sensing Datasets

3. Methodology

3.1. Drone Image Acquisition and Analysis

3.2. Initial Data Collection

3.3. Pixel-Based Color Classification

3.4. YOLO-Based Classification

4. Experimental Evaluation and Results

4.1. Evaluation of the Pixel-Based Approach

4.2. Evaluation of the YOLO-Based Approach

4.3. Comparison of Approaches

4.4. Drone Image Analysis

5. Discussion of the Approaches: Pros and Cons

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI