The test set was used to evaluate both approaches. The 33 test images comprised both simple cases, in which the images had a single category and a single object, and complex cases, in which some images had nearly all categories with multiple objects per category. This section presents and discusses the results obtained for each approach. In the images, the most frequently detected object category was citrus groves, with more than 150 instances identified. In contrast, meadows were the least represented category and were counted only once. Consequently, data for this category were omitted because their limited occurrence could lead to misleading interpretations.
4.1. Evaluation of the Pixel-Based Approach
Before classification, the pixel-based algorithm analyzed and extracted the colors in each land-cover category in the dataset. The process was applied to all 177 images (having labels for each object in them) and computed the number of unique RGB colors belonging to each target category. To reduce noise and improve reliability, colors that frequently appeared in more than one category were filtered out, retaining only the most representative colors.
Table 4 reports the total colors, as the number of unique RGB colors initially detected from the training dataset, the number of colors that were removed because they were in other categories, and the number and percentage of unique RGB filtered colors that were retained for each category.
The extraction results confirmed that the algorithm effectively isolated distinctive color distributions for all classes, even though the percentage of retained colors varied significantly among categories. Higher retention rates, such as those observed for citrus groves (48.44%) and olive groves (46.36%), indicated stable and homogeneous color patterns across images. Conversely, lower retention for roads (16.85%) and trees (18.50%) reflected a broad variability in color, caused by factors such as shadows, illumination change, or the presence of mixed pixels containing soil, vegetation, and asphalt. Intermediate retention values for houses, wells and fields (around 21–27%) suggested moderate intra-class consistency, while the meadows category, with only 13.50% retention, appeared to be the most visually ambiguous and underrepresented. Overall, this phase provided a compact and representative color database, reducing noise and supporting the subsequent classification stage that was performed on a pixel-by-pixel basis.
Figure 4 compares the output of the pixel-based classification (left) with its corresponding ground-truth mask (center) for the agricultural area dominated by citrus groves and fields (right). The images on the left and the center have strong visual correspondence, with red regions indicating citrus groves and cyan areas representing fields. The pixel-based classifier successfully identifies the dominant categories, maintaining coherent and homogeneous patches that closely match the ground-truth segmentation.
The test images were classified, and a label was given to each pixel according to its RGB coordinates (as shown in the left part of
Figure 4). The number of true positives, false positives, true negatives, and false negatives were calculated. We performed cross-validation by splitting the set in five parts (i.e., K-fold cross-validation where K equals to five): we ran five experiments and each time, we used four parts (or folds) for training and the other remaining part for testing. Therefore, we had five models and calculated for each the number of pixels identified correctly (or not).
Table 5 shows the mean values and the standard deviation values for accuracy, precision and recall, and F1 Score, for each category. The results confirm that the classification is always consistent as can be seen by the low values of the standard deviation. The standard deviation for accuracy is between 0.01 and 0.06 across categories, except for the category wells, which presents high color variability.
The results show that the pixel-based approach achieves satisfactory accuracy across most categories, with houses, trees, and roads performing particularly well. High precision values for houses and wells show that the model reliably identifies these categories with very few false positives, confirming that the color-based discrimination is effective when color patterns are distinctive. Notably, both olive groves and citrus groves achieve balanced and robust performance, with high values for accuracy, precision and recall, indicating stable color patterns and effective recognition. Olive and citrus groves have the highest values for the F1 score. Lower recall values observed in trees, houses, and wells suggest that many true pixels were missed, mainly because colors alone cannot fully capture variations due to shadows, roof materials, or occlusions. These limitations align with the lower retention percentages observed during the extraction stage, where color variability was highest. The absence of reliable metrics for the meadows category further confirms its color ambiguity and limited representation.
The experiments show that the proposed color-based pixel approach distinguishes among multiple land types using a purely data-driven filtering process. This approach is transparent, interpretable, and computationally efficient, and these characteristics make it suitable for rapid land-cover assessments and as a preprocessing or validation step for object-based deep learning models.
This visual agreement confirms the reliability of the color extraction process, particularly for categories with distinctive chromatic characteristics such as citrus groves, which achieved the highest color retention (48.44%) (as shown in
Table 4). The land category, characterized by lower color retention (21.08%), displays slightly fragmented regions and small edge inconsistencies, which can be explained considering the variability in soil color and lighting conditions. Minor discrepancies along field boundaries are primarily due to mixed pixels and gradual transitions between soil and vegetation, which occasionally lead to local misclassifications. Nevertheless, the comparison highlights the effectiveness of the proposed approach in distinguishing large homogeneous regions using only RGB information, producing interpretable and high-resolution maps that reflect the real distribution of land-cover classes in the area.
Figure 5 shows the confusion matrix computed according to the results of the pixel-based approach when considering the whole dataset (177 images) with a split of 80% for training and 20% for testing. Rows give the normalized values for detected pixels, while columns represent the ground truth. The background class in the column indicates the predicted pixels that are outside any labeled region, whereas the background class in the row represents missed ground-truth pixels (false negatives). The most accurate category is olive groves, with most pixels correctly identified, followed by the fields category, which also obtains good performance. Note that the normalized values were computed from the count of pixels; hence, there is fine-grained detail, and this degrades the performance of categories that occupy small areas in the images, such as houses and wells.
The confusion matrix reveals several key performance insights that characterize the model’s behavior prior to final optimization. While the model yields a strong identification of major land features, some mismatches occur between spectrally similar vegetation types, e.g., citrus groves and olive groves, or citrus groves and trees. Additionally, a portion of smaller or peripheral objects, such as wells and houses, are categorized as background. These results represent the model’s output, which were further evaluated by the subsequent post-processing stage. This final refinement phase effectively mitigates these edge-case misclassifications and resolves many of the background overlaps, ultimately leading to more accurate results.
4.2. Evaluation of the YOLO-Based Approach
During the classification phase, the YOLO-based approach analyzed an image and highlighted the detected objects with different colors. When the model ran and an image was provided as input, the output was an image similar to the one in the input but with additional labels indicating each detected object and its corresponding category. For every detected object, the following data were given: (i) the category name, (ii) a bounding box enclosing the object and (iii) a confidence score. The confidence score represented the model’s certainty that a detected object belonged to a particular class. It ranged from zero to one, with higher scores indicating greater confidence. The model used the default threshold of 0.25, hence objects were marked if they were given a confidence score of at least 0.25. In our post-processing step, the results were further refined by filtering out the ones below a 0.45 confidence score, and also removing small objects, i.e., objects having less than 15,000 pixels. The values for such parameters were set by performing a manual analysis of image samples to increase confidence and accuracy. The visual inspection of the results showed that most objects were detected correctly, even in the most complex images.
Figure 6 shows a sample of the results, where the only missing object was the house in the top-right image, which was likely ignored due to the shadow. Other cases of undetected objects occurred because of mud or shadows obstructing object recognition. Nevertheless, the overall results were highly reliable, as objects were correctly identified and accurately labeled regardless of their shape and size. The image in the top-left shows a label named House 0.26, which is an example of object detection having a low confidence score, which often occurs when the object is at the edge of the image.
Figure 7 shows the YOLO-based classification for the image analyzed by the pixel-based approach (see
Figure 4). In it, citrus groves and fields (or Land on the label) are correctly identified. Some small objects (see labels House and Tree) have a confidence score lower than 0.45 and would be filtered out in the post-processing phase.
Table 6 reports the mean values and the standard deviation values for accuracy, precision, recall and F1 score metrics obtained for each category, when running a 5-fold cross-validation for the dataset. Cross-validation was performed similarly to the previous pixel-based experiment. The low values for the standard deviation for accuracy across all categories show that the five models (one for each run of the five partitions) give comparable and consistent results. Wells have the highest value of standard deviation for the precision metric; this is due to the high color variability of wells. In terms of accuracy, olive groves and citrus groves perform very well and achieve the highest value, while houses and roads have the lowest. With respect to recall, the values are generally high, while roads have the lowest value. Low values for roads could be due to their highly variable shapes and the presence of unrecognized elements within them, such as isolated trees, mud, or grass. Still, roads achieve an accuracy above 73% and recall above 77%. Overall, all categories reach satisfactory performance levels, as the average accuracy is above 82%, precision is above 92% and recall is 89%.
Figure 8 shows the confusion matrix computed according to the results of the YOLO-based approach when considering the whole dataset (177 images) with a split of 80% for training and 20% for testing. Rows give the normalized values for detected objects in each category, while columns represent the ground truth. The background category in the column indicates the predicted objects outside any labeled region, whereas the background class in the row gives the missed ground-truth objects (false negatives). The categories are generally correctly identified, with citrus groves achieving the highest value, followed by houses and wells. Fields present the lowest performance, mainly due to their similarity with other vegetation categories. The YOLO libraries compute the confusion matrix during the validation phase; however, these results do not account for the subsequent post-processing steps, which enhance performance and provide better metrics. The main mismatched detections are for the small objects and the objects located at the edges of the image.
For the YOLO-based experiments, the number of training epochs was 200, a value within the range 100–300, commonly suggested for deep learning and to accommodate several categories. In all the experiments for the 5-fold cross-validation, the models consistently converged between epochs 101 and 103, indicating stable behavior and, consequently, a robust model (no overfitting). This finding was further supported by the metrics (see
Table 6), particularly the low standard deviation values indicating limited variability across folds and suggesting stability and ability to generalize.
4.3. Comparison of Approaches
Figure 9 shows two images in the right column that represent citrus groves. In the two images, the YOLO-based classifier (left-column images) correctly identifies only a portion of the area that has citrus groves. Instead, the pixel-based classifier (center-column images) correctly determines the whole area that has citrus groves.
Figure 10 shows an image (on the right) representing citrus groves and a large well having a rectangular shape. The YOLO-based classifier (left image) partially identifies the citrus groves but misses the well. The pixel-based classifier detects the citrus groves (highlighted in red) and some parts of the well (yellow pixels).
The comparison of the two approaches is apparent by the results shown in the confusion matrices (see
Figure 5 and
Figure 8). The matrices report the results obtained when training was performed using 80% of the whole dataset. The results highlight the superior performance of the YOLO-based approach, which integrates contextual and spatial information to improve object recognition. However, the pixel-based approach demonstrates its effectiveness in capturing fine-grained color details and achieves high accuracy for some categories, such as citrus groves. In a minority of cases, the pixel-based approach yields more accurate results (see
Figure 9 and
Figure 10, where the red pixels in the center images indicate citrus groves and the yellow pixels indicate a well).
Based on the evaluation above, our YOLO-C3 image analysis component runs both methods on the input image. It selects labels as accurate if they are confirmed by both methods with the same category. For objects assigned different labels, such as those at the image edges, wells, or small objects not accurately detected by the YOLO-based method, for which the confidence score is lower than a threshold set at 0.45, these are retained if suggested by the pixel-based approach and considered accurately labeled by it. Otherwise, the results by the YOLO-based method are confirmed. Additionally, moving the drone to the center of the object helps capture it better, enabling further labeling or confirmation of the previous label.
4.4. Drone Image Analysis
The drones captured images in predefined geographical areas and moved towards some destinations according to the provided list of objects of interest. The images acquired were analyzed by our YOLO-C3 component, which leveraged both algorithms to achieve more specific and detailed detections and to provide appropriate feedback. The approach was tested on a dataset containing drone images. The dataset used was odm_data_aukerman (
https://github.com/OpenDroneMap/odm_data_aukerman, accessed on 30 March 2026). It contains 32 images with a resolution of 4896 × 3672 pixels and 37 images with a resolution of 6000 × 4000 pixels [
43]. Image patches were extracted, and both approaches were evaluated (see
Figure 11). The results, shown below, confirmed the robustness and validity of the training performed on satellite imagery.
In the case of YOLO, an additional post-processing stage was applied to obtain more efficient results. During post-processing, the following steps were performed:
Step-1: Predictions with a confidence score lower than 0.45 were discarded. This filtering removed less reliable predictions.
Step-2: Predictions corresponding to small objects were removed. An object was considered too small if it contained fewer than 15,000 pixels. This size-based filtering was consistent with the medium-to-large dimensional characteristics of the analyzed categories.
To validate the post-processing procedure, metrics were computed for each step performed (see
Table 7). In these cases, the improvement in the metrics was strongly influenced by the house category: (i) without post-processing, 14 houses were detected; (ii) after step-1, the detections decreased to eight; (iii) after step-2, five houses were detected, of which three were correct. By eliminating small objects, we avoided, for example, confusing houses with cars, which are smaller objects.
Figure 12 shows the results obtained using YOLO. The left-most image shows several objects, and among these wells and meadows (the same as lawns), objects that are removed by the post-processing step-1 for their low confidence score. The objects labeled Land and the two labeled House are removed in step-2. In the center image, the House labeled object is removed by step-2; in fact, such an object is actually a crane. In the right-most image, the Citrus Grove label is removed during step-1. In the last image, one of the houses has not been labeled; this is due to the camera perspective being excessively oblique, altering the standard shape of the house.
Tests executed using YOLO lasted 0.01 s per image, while the analysis relying on the pixel-based approach lasted approximately 2.8 s per image. The two models ran in parallel. Most of the time, the YOLO-based model gave accurate results, and only small objects, objects in the edges, or wells had to be confirmed by the pixel-based model. Hence, a batch of results given by the YOLO-based model provided accurate results and possible coordinates for the next drone destinations, then the drone could move towards them immediately. Some edge cases were later confirmed (or excluded) by the pixel-based approach and only then were such coordinates given to the drone as the next destinations.