Next Article in Journal
Trajectory Planning for Autonomous Underwater Vehicles in Uneven Environments: A Survey of Coverage and Sensor Data Collection Methods
Previous Article in Journal
Research on Intelligent Resource Management Solutions for Green Cloud Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Aerial Image Analysis: When LLMs Assist (And When Not)

Dipartimento di Matematica e Informatica, University of Catania, 95125 Catania, Italy
*
Author to whom correspondence should be addressed.
Future Internet 2026, 18(2), 77; https://doi.org/10.3390/fi18020077
Submission received: 15 December 2025 / Revised: 19 January 2026 / Accepted: 20 January 2026 / Published: 1 February 2026

Abstract

Large language models (LLMs) have shown remarkable results when tasked with the analysis and production of texts or images and for captioning images. Aerial images differ from other images since they exhibit many natural objects that have a highly variable color range and no clear contours. This paper reports to what extent an LLM, i.e., Llama-4, can be tasked with the identification and captioning in aerial images of natural objects, such as tree categories, uncultivated land, and some man-made objects, such as roads. This valuable automation is needed to scan large areas and detect the parts for which a sudden maintenance or an emergency intervention is due. Tests on the chosen LLM were performed against a custom image dataset built to overcome the limited availability of such a domain-specific aerial image set. To evaluate the identification and captioning results, the accuracy, precision and recall metrics were computed. The results given by a cutting-edge variant of Llama-4, namely Maverick, reveal its strengths and weaknesses in this context. Although it is remarkable that an out-of-the-box tool can give assistance in such a complex observation and detection task, substantial progress is needed for such a model to improve accuracy and constitute a reliable support, as accuracy is at most 58.6% and recall is at most 56.1%.

Graphical Abstract

1. Introduction

Large language models (LLMs) have been widely used to assist developers in several tasks, e.g., assessing the right level of security and privacy [1], analyzing cybersecurity posts [2], as a companion to quickly implement code [3], to summarize documents, translate short texts, for image captioning [4], for education and healthcare assistance [5], annotating images [6,7], etc. Although LLMs exhibit remarkable proficiency in mimicking some human tasks related to the production of natural language, their application in remote sensing and satellite image interpretation has just begun. Current LLM-based systems are still unable to fully extract or comprehend information from aerial images and may generate inaccurate results. This is likely due to the lack of domain-specific knowledge, which limits their usability in this context [8]. In fact, some man-made objects, e.g., airplanes and trucks seen in aerial images, can be detected due to their characteristic contours and the great importance they have had in training LLMs. Conversely, other man-made objects and especially natural objects are much more difficult to detect [9,10,11]. Previous approaches have detected man-made objects, such as cars or airplanes, on aerial images using multi-layer neural networks [12,13]. Although the previous approaches have proposed extensive datasets and the training of neural-based models, their experiments focused on man-made objects, unlike ours, which aims at detecting especially natural objects, such as trees and uncultivated land. In another experiment, deforestation was detected using a deep learning model and images that were annotated according to forest loss [14].
Unlike previous research, the objective of our experiments is to test the ability of an out-of-the-box LLM to detect several natural objects and some man-made objects from aerial images. This can be valuable support for achieving an advanced level of automation when large areas need to be monitored for security or maintenance goals and to discover unusual conditions that may require prompt intervention. In the proposed scenario, automation can be brought about by using a set of drones that scan a large area and capture images, then an LLM assists in identifying the conditions of objects of interest from the captured images; hence, a team can be possibly sent to a specific part for maintenance.
This paper aims to assess the ability of LLMs to detect natural objects in aerial images and label these objects according to given categories. To this end, we propose an approach that provides an LLM with real-world aerial imagery and asks it to provide detailed information about detected objects for the provision of assistance services, such as territorial monitoring. The extracted information concerns large uncultivated lands, private properties (e.g., land plots) or infrastructure in the vicinity of some residence (e.g., roads and adjacent spaces). The main functionalities of remote sensing in this context include (i) detection of objects present in the area; (ii) assessment of visibility conditions; and (iii) position of objects in the area. In a real-world use case, the assistance service is supported by a network of drones responsible for acquiring images in real time to monitor the territory. For our study, we used Llama-4 since it offers state-of-the-art capabilities and is the latest version available for the Llama series (it was released in April 2025). We have asked Llama-4 to label aerial images that predominantly have green areas, such as orchards and secondary roads.
The main contribution of this paper is a thorough evaluation of an available off-the-shelf LLM, i.e., Llama-4 Maverick, against the problem of identifying the categories of vegetation and distinguishing it from uncultivated areas or roads from aerial RGB images. For this to be achieved, the following steps were needed, whose outcomes are by themselves useful for further studies: (i) the composition and evaluation of a new dataset, together with the results of the analysis provided by Llama-4; (ii) a benchmark consisting of a complex set of labels for images, given that the annotated objects involve intricate land-cover categories, which were not explored in previous studies; (iii) a set of captioning labels for the above image set; and (iv) a comprehensive analysis that aims at object recognition, counting objects, and spatial position analysis of objects in aerial images.
Llama-4 delivered encouraging results in a coarse analysis. However, it exhibited marked limitations in (i) distinguishing similar terrain types, (ii) pinpointing object locations, and (iii) accurately counting object occurrences. Although it was somewhat reliable for a preliminary assessment, many results were inconsistent and often incorrect when finer-grained details were expected. This finding underscores the significant shortcomings that Llama-4 still faces in detailed remote-sensing tasks.
The remainder of the paper is structured as follows: Section 2 discusses related work. Section 3 shows a practical usage scenario. Section 4 presents the dataset, the labeling of images via Roboflow, and the coloring of the areas. Section 5 analyzes the dataset using Llama-4 Maverick. Section 6 illustrates the results achieved. Section 7 describes the strengths and weaknesses of the model. Section 8 draws our conclusions.

2. Related Works

Recently, large multimodal language models (MLLMs) such as Llama [15], Large Language and Vision Assistant (LLaVA) [16], Otter [17], and PandaGPT [18] have shown strong capabilities in processing and interpreting images via structured text-based queries, achieving considerable success in general-purpose visual language tasks. These models analyze terrestrial images, i.e., acquired by cameras or mobile phones, while the analysis of aerial (or satellite) images represents a decidedly more complex and distinct area.
Satellite image analysis has been extensively studied [19,20,21,22,23,24,25,26], yet it remains an active field of research due to ongoing challenges and newly emerging requirements. This area demonstrates significant potential in multiple application domains, including traffic monitoring and management [21], land use and land cover change detection [23], and real-time disaster identification and response [24,25]. However, readily available satellite image collections focus on urban areas and are typically used to analyze, e.g., vehicles or houses [27,28,29], hence the vegetation analysis in images is minimal. As a result, the usefulness of satellite images is limited to applications where it is crucial to detect traffic jams or the occurrence of fire, as none of the available datasets focuses on green areas or aims to distinguish between categories of trees or other types of vegetation.
Some datasets oriented toward aerial imagery have been previously curated and include green areas such as grasslands, trees, or agricultural fields; however, the corresponding research studies pursue different goals, and these areas are not associated with predefined categories. Two such extensive datasets have labeled man-made objects, i.e., xView [12], and DIOR [13]. Moreover, the authors aimed to build and train a deep neural network model for allowing it to detect such objects. In other experiments, the ForestNet dataset was proposed to detect forest loss, and for this goal, aerial images were annotated accordingly [14]. Unlike such previous experiments, our goal is to detect natural objects from aerial images and evaluate an out-of-the-box LLM for such a task. Then, given that the existing datasets were not appropriate for our experiments, we have built an ad hoc dataset of images.
The first dataset that was created to focus on satellite images of green areas was described in [30]. The authors accumulated a set of images centered on green zones and trained an artificial intelligence algorithm to recognize objects in complex imagery. This demonstrated that a well-trained AI system can perform effectively even on multi-label, complex, real-world images; however, it requires an extensive and detailed image collection and annotation process. This dataset was partially used in our work, and an additional coloring step was applied to the images to validate our data.
One major limitation of current LLMs is their inability to reason directly over a visual input, which often leads to inaccurate results. As discussed, e.g., by Du et al. [8], LLMs struggle to learn effectively from data involving forest imagery. A core challenge when processing a visual input is object detection. Although modern detectors perform well on conventional remote-sensing benchmarks, their reliance on a fixed set of predefined classes limits their applicability in real-world settings. E.g., a model trained to detect aircraft could be unsuitable to recognize ground vehicles [11]. Ju et al. found that while vision-oriented language models can identify single objects with high accuracy, their performance drops sharply in multi-object scenarios, as their accuracy falls from circa 100% to below 15% in complex scenes [31]. This is a critical limitation in real-world image analysis, where scenes often contain a wide variety of objects (i.e., multi-label imagery).
Another limitation is related to the generation of semantically rich vision-based explanations. He et al. [9] introduced an automatic captioning pipeline that guides LLMs to describe object annotations in remote-sensing images, hence showing the value of captions for rapid visual interpretation. However, their study was limited to a narrow range of objects (primarily houses and vehicles) and did not address more diverse scene elements. Our work extends this line of research by designing prompts that generate captions for a broader variety of objects and spatial patterns, thus providing a more comprehensive visual summary. Recently, Komurcu et al. [10] evaluated several LLMs for satellite image interpretation tasks, reporting an average accuracy of 43%. Their study underscores the difficulties arising when analyzing multi-label scenes, and it has two key limitations: (i) the results are not broken down by class or scene type, and (ii) their most complex dataset includes images having only 512 × 512 pixels. In contrast, our experiments use 1024 × 1024 pixel imagery, introducing denser scenes with a higher number of labels per image; hence, they more accurately reflect real-world complexity.
Alternatively to the use of large language models (LLMs), pixel-based approaches are commonly employed for the analysis of satellite imagery [32,33]. However, these approaches present two main limitations: (i) they cannot incorporate additional sources of information, such as textual descriptions or contextual metadata; hence, there is a limit on their semantic understanding; and (ii) remote sensing images often exhibit high variability in resolution, scale, and spectral characteristics, this makes it difficult for traditional methods to generalize across different datasets [34]. Still, pixel-based techniques continue to be studied and adopted due to the current challenges and limitations associated with the use of LLMs. According to [35] current LLMs struggle with spectral data typical of satellite images while being better suited for RGB aerial images. For this reason, the previous work has shown that when having spectral data, an ad hoc training for a model is necessary to obtain a high level of accuracy. Our work instead focuses on evaluating an out-of-the-box LLM, without further training.
Multimodal LLMs can perceive multiple modalities, including text, images, and audio, and generate free-form textual responses in zero-shot and/or few-shot settings. BLIP-2 employed a trained bridging module while keeping the backbone networks frozen [6], whereas KOSMOS-2 introduced explicit grounding through location tokens and web-scale training with spatial supervision [7]. Both above approaches demonstrated that the integration of LLMs enables zero-shot text-driven image understanding, allowing object identification and, in the case of KOSMOS-2, spatial localization. In contrast, our contribution focuses on the evaluation of the accuracy of results; it does not rely on pre-training or task-specific training. Moreover, we test the LLM on aerial imagery, a domain that is often underrepresented during model training, and perform evaluations based on pixel-wise ground truth and spatial tolerance criteria. Then, we provide a detailed analysis of the results, including semantic ambiguities and hallucinations. Overall, our work highlights when an LLM is useful and when it is not, providing an example of its application in real-world scenarios such as land monitoring.

3. A Drone Network for Image Analysis Assistance

The proposed solution can be used when a large collection of aerial images should be analyzed to detect specific objects, and the leverage of innovative LLM-based tools, particularly LLaMA, can be handy. Once the areas of interest are known—for example, citrus orchards—these can be further documented using drones [36,37] thereby enriching the available information with greater spatial detail and real-time data acquisition. This integration enables accurate territorial documentation that can be shared among users, providing intelligent and adaptive feedback on the monitored regions. In this section, we describe the integration of our approach with a drone network and how such integration allows the system to acquire new data that can be used to generate and disseminate feedback to interested users. Each drone is equipped with a high-resolution camera and an onboard computing unit, i.e., a Raspberry Pi [38]. By using dedicated scripts running on the drone, it can store the captured images locally and then subsequently upload them to a remote server whenever a WiFi connection becomes available. This design enables each drone to autonomously capture multiple images of the surveyed area and transmit them efficiently once connectivity is restored. The server-side processing pipeline involves querying LLaMA to analyze uploaded images, identify objects, classify them according to predefined categories, and infer their conditions. The LLM then generates a description and contextual feedback regarding the observed objects or environmental features.
The overall workflow, illustrated in Figure 1, is as follows: (i) satellite images are analyzed through LLM queries to detect areas and objects belonging to the studied categories; (ii) the coordinates of the identified objects and area of interest are transmitted to a drone network, which performs targeted image acquisition; (iii) captured images are stored locally on the drones (in SD cards) and compressed into a single archive; (iv) once the acquisition is complete, drones can return to a base where WiFi connectivity allows images to be transferred to the server using Secure Copy Protocol (SCP) [39]; (v) the server analyzes the images to detect unfavorable or anomalous conditions by re-engaging the LLM (e.g., evaluating citrus orchard health, identifying abandoned or uncultivated areas, assessing the state of roads or wells); and (vi) then, according to the previous result, the server generates information alerts (positive or negative) that are sent to authorized users via email or a notification, including the related image as evidence.
When step (v) above was executed, to analyze the captured images, the prompt used was the following:
“Analyze this satellite image and describe in detail the condition of the crops present in it. Indicate: which crops are present in each plot; what growth stage or phenological cycle they are in; whether there are visible signs of water or nutritional stress (e.g., yellowing leaves, dry patches); any areas affected by pests or diseases; abandoned or uncultivated areas; traces of recent fires or burns; conditions of access roads to fields; conditions of wells.”
Figure 2 shows an example of a picture given to the LLM, and the following are some results presented by the LLM as a reply to the above prompt:
  • Visible Signs of Stress: “Some plots show signs of stress, such as yellowing or dry leaves, particularly in the lower-left section of the image. This could indicate water or nutritional stress.”
  • Abandoned or Uncultivated Areas: “One plot in the lower-left section appears uncultivated or abandoned, as it lacks the uniform rows of trees observed in other plots.”
  • Access Roads to Fields: “The access roads to the fields are clearly visible, suggesting that they are reasonably maintained.”
  • Conditions of Wells: “A well is visible near the center. A small dark blue or black patch in the bottom-right area may indicate stagnant water.”
It should be noted that such feedback may be positive or negative, depending on the conditions identified in the areas analyzed. Therefore, the overall system shows how the combination of satellite imagery, drone-based data acquisition, and LLM reasoning can enhance the monitoring and management of territorial resources through adaptive, data-driven feedback mechanisms.

4. Dataset Creation and Annotation

A specific dataset was created with the purpose of collecting aerial images showing several categories of vegetation and man-made objects. This dataset was achieved by extracting images from Google Maps in selected areas where the desired requirements, in terms of categories and varieties, were satisfied; the images are well curated. Hence, there is no need for denoising blurry images [40]. The created dataset was then used to query the model. The dataset consists of 100 real-world images, a subset of which was taken from the dataset in [30], and others were obtained directly from Google. Each image was analyzed by domain experts to annotate man-made objects and natural areas to accurately locate and distinguish each occurrence. The total number of objects in the whole set of images is 2594, and the number of objects in each category is shown in Table 1. The selected images have two key characteristics: (i) the images are high-resolution ones, and (ii) each image contains at least four different objects; hence, four distinct labels were attributed. Therefore, each image is multi-label.
The following details the steps of the pipeline put into work: image extraction, image preprocessing, image annotation, and coloring. To begin with, images were extracted from Google Maps. This source was chosen because the available images have a high resolution, useful for visual analysis, and because easy access to images of large areas is possible. The images were taken with a zoom of 21 (the maximum possible zoom) to have as many details as possible. The size of the downloaded images was 10,240 × 7680 pixels, and each pixel corresponds to an area of approximately 0.076 m. The geographical area of interest was Italy. Secondly, the images were divided into four portions, and each portion was 1024 × 1024 pixels. In such a way, we obtained several detailed images, each having a size smaller than 1.5 MB (this was a limitation imposed by the model Llama, used in a subsequent step). Then the image was encoded in base64 (as needed to send it to Llama-4). Listing 1 shows a Python version 3.13 code snippet to convert an image to base64 format.
Listing 1. Python snippet for encoding images in base64.
1import base64
2def encode_image(image_path):
3    with open(image_path, "rb") as image_file:
4        return base64.b64encode(image_file.read()).decode(‘utf-8’)
Finally, images were labeled by tracing polygons in some areas to give the contours of each object occurrence by means of Roboflow (https://roboflow.com/ accessed on 19 January 2026). Roboflow [41] is a cloud-based platform designed to simplify the development of computer vision models, especially by streamlining dataset management and model training for several tasks, such as object detection, image classification, and segmentation. It provides support for manual and semi-automatic image annotation, greatly easing the labeling process [30,42].
Roboflow makes it possible to manually select the contours of an object or a natural area and assign it with a label. In our experiment, the labeled objects belong to one of eight different categories. Each category was given a color as listed in Table 1. Figure 3 shows an example of several natural areas (on the left), which were labeled by marking their contours. The labels given are citrus groves (red), wells (yellow), roads (purple), land (cyan) and trees (green).
A color-coding scheme was used to assign a specific color to each area based on its corresponding semantic category. Hence, subsequently, for validation purposes, the category of each pixel can be unambiguously determined based on its assigned color. This paves the way for the validation of the outputs generated by the Llama model. E.g., RGB (128, 0, 128) represents the road, RGB (255, 255, 0) a well, or RBG (173, 216, 230) the olive grove. Figure 3 shows on the right the colored pixels for the identified areas (citrus groves, wells, roads, fields and trees).

5. Dataset Processing by Means of Llama-4

For the purpose of identifying and locating objects in aerial images, we queried Llama-4 Maverick, which is a 17 billion parameter model, and offers industry-leading performance for multimodal tasks like image recognition. The results were analyzed to determine how Llama-4 Maverick performs. Figure 4 illustrates the workflow adopted in this study. For each image, the model was tasked with the identification of each distinct area by means of a label (or a short caption) and the coordinates of the bounded area. The output was then compared with the color-coded image that represents the ground truth, and this step let us determine the precision of the model.
To run Llama-4, we used the Groq service, which is a popular model provider that has pioneered the fastest way to run open-source models (https://groq.com/llama-4-now-live-on-groq-build-fast-at-the-lowest-cost-without-compromise/ accessed on 19 January 2026). When running a model in Groq, the replies from the chosen model are nearly instantaneous, hence very convenient for our analysis. Listing 2 shows a Python code snippet to query Llama-4 using Groq when giving it an image to analyze.
Listing 2. Python snippet to query an LLM given an image as input.
1from groq import Groq
2client = Groq(API_KEY)
3chat_completion = client.chat.completions.create(
4    messages=[{
5        "role": "user",
6        "content": [
7            {"type": "text", "text": prompt},
8            {"type": "image_url",
9                "image_url": {
10                    "url": f"data:image/jpeg;base64,{base64_image}",
11                },
12            },
13        ],}],
14    model="meta-llama/llama-4-maverick-17b-128e-instruct",
15)
16response = chat_completion.choices[0].message.content
The said model was run and queried using the Groq cloud provider, hence using Groq APIs. The default settings were applied, i.e., the sampling temperature was set to 1.0 and the nucleus sampling parameter (top_p) to 1.0, according to the provider’s documentation. This choice avoids forcing some outputs and allows the model to generate responses with natural variability, without restricting the probability mass of candidate tokens.
The prompt given to the model consisted of the following text:
“Detect all objects and areas in the image, including citrus groves, olive groves, houses, roads, wells, meadows, fields, and trees. For each object, specify its type and the coordinates of its center in normalized format, expressed as [x_center, y_center], where both values range from 0 to 1. Output a CSV file only, without any explanation, markdown, or code.”
The construction of the prompt was carried out iteratively, by means of exploratory tests conducted on a small subset of images, with the objective of identifying a formulation that was reproducible and suitable for subsequent quantitative analysis. In an initial phase, the prompt consisted of asking the LLM to describe the image and identify the presence of some categories of interest (citrus groves, olive groves, roads, wells, etc.). Although this formulation produced semantically rich descriptions, the output was unstructured and difficult to analyze automatically.
Then, the prompt was progressively refined by introducing an explicit request for spatial localization of objects and an output in CSV format, without explanations or markup, to reduce lexical variability, simplify the automatic extraction of results, and make category normalization more robust. In particular, the model was asked to provide, for each recognized object or area, an approximate position in the form of normalized coordinates [x_center, y_center] relative to the provided image, with values ranging between 0 and 1, a choice consistent with the need to compare predictions with reference annotations.
Other tests were performed to ask for the generation of bounding boxes for the detected objects; however, a preliminary check showed that the coordinates of the object center were closer to the target object (hence more accurate) than the bounding boxes. Since the objective of the analysis was to check the presence of at least one point within the correct area, the coordinates proved to be more consistent with the adopted evaluation protocol. The final prompt adopted, therefore, requested the model to identify all objects or areas belonging to a closed set of categories of interest and to return, for each instance, the category name and the center coordinates in normalized format in a CSV file (the final prompt is shown in the box above).
Llama-4 Maverick answered our requests and created a CSV file for each image. The file consists of data having the format [label, x, y], where label equals the detected category and x and y correspond to a point in the area or object. Using Maverick, the resulting CSV file was well structured and defined. Upon analyzing the results, we observed that the model frequently responded using synonyms for identified categories or slight variations in the expression (e.g., singular vs. plural forms). For this reason, we performed a further step consisting of normalizing the output by grouping several equivalent text expressions, enabling a consistent analysis of the results.
Table 2 shows the category normalization schema adopted, where for each category we give its recognized text expression variants. The macro category “groves or tree” includes the three categories that were initially assigned to images with labels using Roboflow: citrus grove, olive grove, and tree. This grouping was necessary due to the model’s tendency to use these terms interchangeably, without distinguishing between them; hence, the output precision observed was lower than expected. Figure 5 shows an example in which the model correctly identified a road and two houses; however, it did not find a citrus grove, but only trees.
The model outputs were systematically analyzed to assess the validity of the results. For each response to the prompt, each detected object (or area) was analyzed. An object (or region) was deemed correctly detected if its spatial coordinates fall within an area of the segmentation map that was labeled with a color corresponding to the object’s assigned category. For example, let us suppose that the model has detected a house object at specific coordinates (x, y) to consider the result as correct; such coordinates should fall within a region marked blue found in the reference segmentation map (generated via Roboflow), where blue corresponds to the category house in the predefined color-to-class mapping. This evaluation step is crucial to quantitatively and qualitatively assess the model’s performance.
After checking the category and position of each detected object by comparing it with the color map of the ground truth, a code snippet was developed to have a visual representation of the results, therefore adding to images a circle and a label that identify each detected object. Each circle was centered at the specified coordinates (x, y) suggested by the model, with a radius ranging from 0 to 200 pixels (see Section 6 for details). The circle was colored green if the category was detected correctly and red otherwise.

6. Results

For our tests, 100 images, which were manually labeled for validation purposes, were given to Llama-4 Maverick with the task of finding each object occurrence. The images consist of aerial scenarios that have green areas, and typically they are cultivated landscapes. For each image, the set of objects found by prompt-based queries (see Section 5) was assessed according to the correspondence with the color-based annotations (see Section 4).
Initial tests revealed that the predicted points were not always precisely located within the target objects but were sometimes positioned near their boundaries. To address this, multiple validation strategies were devised to provide a more accurate assessment of the model behavior: three different validation tests were performed, each with a varying tolerance threshold for spatial accuracy, reflecting the fact that coordinates are often approximate rather than exact. The tolerance threshold ranges from 0 to 200 pixels; beyond this threshold, which is approximately 15 m, the detected points were considered incorrect.
The three evaluation scenarios are defined as follows:
1.
The 0-pixel tolerance point verification: the coordinates of the objects found by the model are within the area indicated by the ground truth. That is, the color of the pixel with such coordinates corresponds to the suggested category. This is the most precise case and yields the best results.
2.
The 130-pixel tolerance point verification: the point detected by the model is close to the reference category area (not within it). The distance between the suggested coordinates of the object and the labeled object is approximately 10 m; which corresponds to 130 pixels. This case is less precise, but still considered valid.
3.
The 200-pixel tolerance point verification: the point detected by the model is approximately 15 m away from the object indicated in the ground truth.
Across the three tolerance settings (0, 130, and 200 pixels), the confusion matrices (see Figure 6 and Figure 7) provide a consistent view of the model’s behavior under increasing spatial slack. As the tolerance radius increases, the numbers in the diagonal become bigger, indicating that a larger fraction of predictions falls within (or close to) the correct ground-truth regions, while the background column decreases, meaning fewer predictions are completely unrelated to any labeled area. Conversely, the background row (false negatives) remains substantial even when tolerance is set at 200 pixels, highlighting that many annotated regions are not covered by any predicted point within the allowed radius. Overall, these trends confirm that the main limitation is imprecise spatial localization (the LLM gives one coordinate for the identified object, not its area).
To better clarify the numbers in the results, we have to consider that the LLM could provide several times the coordinates and labels for a large area that in the ground truth is labeled as one object (see, e.g., Figure 8 second row, right column, where the label groves or trees is given several times for a large area). Under this condition, the number of objects classified as true positives (TPs), which are given as the diagonal in the confusion matrix, does not equal the number of ground-truth polygons. Multiple LLM predictions may fall within the same ground-truth polygon and are therefore all counted as correct detections if they satisfy the spatial and semantic criteria. This choice reflects the objective of our study, which focuses on assessing whether the model can identify the presence and approximate location of relevant objects or areas, rather than performing precise instance-level detection.
To evaluate the accuracy of the model, the coordinates corresponding to the detected objects were analyzed and validated against previously labeled and color-annotated images. This analysis generates visual outputs in which detected objects/areas are highlighted by circles with colored borders. These circles, which include the name of the detected category, serve as visual captions for quick interpretation of the results. The colors of the circles are defined as follows: a red-colored circle for the detected area that does not match the area previously labeled using Roboflow, and the point is considered incorrectly detected; a green-colored circle for the detected area that does match the area previously labeled using Roboflow, and the point is considered correctly detected.
Figure 8 presents some results to qualitatively illustrate the behavior of the model in different cases. The images shown are just a portion of the whole image given as input to highlight the meaningful detail. The distance between the detected object and the real one is the maximum allowed, which is 200 pixels (the third scenario above). The first image (top left) shows regions labeled groves and trees that were detected correctly (three occurrences), while the categories roads and meadow were not accurately identified. The second image (top right) shows labels groves and trees and wells that were not correctly associated. Note that the wells category is not actually present in the image.
The third image (second row on the left) exhibits a high error rate: while a field object was identified correctly, it has four false positive objects, which are not present in the image: the three houses and the road detected by the model do not actually appear and would remain not correct even under a relaxed spatial tolerance. This phenomenon is known as “hallucination” [43,44], which refers to the erroneous detection of objects that are not present in the input image. The fourth image (second row on the right) shows some objects that were detected correctly, such as groves or house; however, three road occurrences were incorrectly detected. This result leads to two notable observations: (i) there is only a single road in the image, yet the model identifies three of them, resulting in an overestimation of objects; (ii) all three predicted roads are located at an unreasonable distance from the actual road. In fact, despite the evaluation circle having a radius of 200 pixels, none of the predicted positions intersect the boundary of the real road.
The fifth image (third row on the left) shows that almost all the objects were detected correctly, yielding good results from the model. For the sixth image (third row on the right), the same terrain category was identified as two different objects, namely groves or trees and fields, although the visual inspection shows a uniform scene for all of them.
The seventh and eighth images (the last row) show similar examples of detection with many object occurrences that were identified correctly.
Based on the validation procedure described, precision, recall and accuracy metrics were computed for each test scenario. Figure 9 shows the metric values and provides a quantitative overview of the model performance. The first scenario, with a distance tolerance threshold of 0 pixels, exhibits notably the lowest precision, recall and accuracy. In contrast, the third scenario achieves relatively high precision (approaching 80%), while recall (56.1%) and accuracy (58.6%) remain limited, though (as expected) they obtain the highest values among the three tolerance levels. The observed trend in the metrics indicates that the model is generally capable of detecting a substantial number of objects correctly within each image; however, as previously discussed, the predicted positions tend to be spatially inaccurate.
Figure 10 presents three heatmaps that show precision, recall and accuracy metrics for each object category, evaluated in the three test scenarios. The precision values show that, in some cases, the increase in precision across tolerance levels is marginal, i.e., meadow category shows only a 55% improvement from tolerance 0 to tolerance 200. In contrast, road category exhibits a substantial precision gain of 357.14%. This highlights the low ability of the LLM to provide accurate coordinates of the object in relation to the origin point of the given image. Overall, the highest precision is achieved in the “groves or trees” macro-category, which includes multiple object types (e.g., olive groves, citrus groves, and trees). These objects are more sketchy than others and have less defined contours, and such characteristics may have contributed to the resulting high values.
Figure 10 on the right shows the recall values, which were computed for the same categories and test scenarios. We note that the values improved as the tolerance increased; e.g., in the road category, recall significantly improved from 18.8% to 60.5%, representing a 236.11% increase. However, in the third scenario, recall across all categories is between 49.4% and 60.5%, indicating a consistent but limited ability.
Figure 10 on the bottom shows the accuracy values. Similarly to the other metrics, the values improve when the tolerance is higher. Moreover, the categories having the highest values (groves and houses) are the ones for which precision was highest. Accuracy should better convey the ability of the LLM to recognize true values. While for the groves category the accuracy hits 70.6%, which can be considered quite good, the second-best value is just 41.4% for the houses category, which is not acceptable.

7. Discussion

Although multimodal LLMs such as Llama-4 have shown promising capabilities in analyzing satellite images, several limitations persist that reduce their reliability in real-world greenfield settings. The main critical issues are discussed below.
Undetected objects: one of the most common limitations is the inability to identify objects present in images. This problem occurs in the presence of objects partially obscured by other elements, i.e., roads, or objects that have unusual colors or shapes. As Figure 11 shows, the model did not label some objects, such as a well, a house, and a plot of land. Similarly, Figure 8 (bottom-right image) shows that the model did not label the road. Such situations lead to a significant loss of information and suggest a reduced sensitivity for the model. This is confirmed by the recall and accuracy values that are remarkably low (shown in the diagrams in Figure 9 and Figure 10).
Inaccuracy in spatial localization (coordinates error): Llama-4 gave an imprecise location for many identified objects, with discrepancies compared to their actual coordinates in the image. These errors arise from the difficulty of the model in correctly associating visual objects with their positions. This is confirmed by the precision values that improved as distance tolerance increased. Figure 12 shows an image marked using the two distance-tolerance values (130 and 200 pixels), and highlights that the wells are correctly detected only in the second case, since the model has given not very close coordinates for the wells.
Ambiguities in area classification: frequent classification errors were observed between areas with similar visual characteristics. E.g., the model confused trees with agricultural land, or olive groves with citrus groves. These semantic ambiguities indicate a limited ability of the model to understand the environmental context.
Indeterminacy in object counting: another critical issue detected is the inability of the model to accurately determine the number of objects present in a given area. Llama tends to overestimate or underestimate the actual number, compromising quantitative analysis. E.g., the fourth image in Figure 8 shows that four fields were detected; however, it can be observed that there are actually only two large fields. In another image, the last in Figure 8, only two fields were identified by the model, and instead at least five fields can be visually seen. The two images at the top of Figure 13 illustrate examples of incorrect counts. In the first case, only one-third of the houses are identified (underestimation), while in the second case, a single grove is mistaken for four separate ones (overestimation).
Phenomenon of hallucination: hallucination in satellite imagery refers to the behavior of artificial intelligence systems that detect, describe, or invent objects, structures, or patterns that are not actually present in the analyzed satellite image. During the experiments, several images were labeled with nonexistent objects. E.g., for the image in the second row and the left column in Figure 8, houses and roads were detected, even though they are not present in any part of the image. Moreover, the two images at the bottom of Figure 13 illustrate examples of hallucinatory phenomena. In the first case, a well is detected, and in the second, a field is identified; in both instances, these objects are absent from the image and from any other locations. These errors arise from incorrect spatial relationships or from confusion caused by noise, which leads the LLM to produce erroneous semantic interpretations of the image. It was observed that, out of 100 images, approximately 20% exhibit hallucination phenomena. The most frequently detected object category is wells. To mitigate this phenomenon, a filter could be put into place that, taking into account the known characteristics of the area, removes from the LLM’s answer the objects that can not be in it.

8. Conclusions

The proposed solution has been proposed to analyze aerial images and possibly deploy a set of drones to collect better images of an area of interest. Automation in image analysis can be achieved by recurring to an LLM. This study presented a detailed analysis of Llama-4 Mavericks when asked to detect objects in aerial images, with a particular focus on green areas. The images, sourced from Google Maps, depict real-world scenarios and, due to their large size, often contain multiple object categories within a single frame. Although the dataset used in this work consisted of only around 100 images, a relatively small sample, it contains 2594 objects and it proved sufficient for an initial low-cost evaluation of the model. In general, it can be considered representative of a larger set of real-world aerial images.
It is remarkable that an LLM, having had a large training and useful for so many different tasks, can be used for such a specific field, which is the detection of objects in aerial images, with somewhat useful results. More specifically, the results highlighted that several challenges need to be overcome to accurately identify objects and regions, and some issues are related to the incorrect location of detected targets. Furthermore, the findings underscore key weaknesses of Llama-4 Mavericks that most likely did not have sufficient training for such aerial images and for the objects we are interested in. Among the limitations that we have revealed, the suggestion of objects not present in the image is particularly noteworthy. This limitation could be mitigated by a filter assessing whether the named object is typical of the examined area.

Author Contributions

Conceptualization, E.S. and E.T.; methodology, E.S., E.T. and G.V.; software, S.C., E.S. and G.V.; validation, S.C., E.S., E.T. and G.V.; data curation, S.C. and E.S.; writing—original draft preparation, E.S. and G.V.; writing—review and editing, E.T.; visualization, S.C., E.S. and G.V.; supervision, E.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data derived from public domain resources.

Acknowledgments

We acknowledge the support of the University of Catania PIACERI project TEAMS, PNRR project CN-HPC, Big Data and Quantum Computing, Spoke 2 Fundamental Research and Space Economy, and Innovation Grant Agri@Intesa.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
  2. Giannilias, T.; Papadakis, A.; Nikolaou, N.; Zahariadis, T. Classification of Hacker’s Posts Based on Zero-Shot, Few-Shot, and Fine-Tuned LLMs in Environments with Constrained Resources. Future Internet 2025, 17, 207. [Google Scholar] [CrossRef]
  3. Nam, D.; Macvean, A.; Hellendoorn, V.; Vasilescu, B.; Myers, B. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
  4. Xu, S.; Wu, Z.; Zhao, H.; Shu, P.; Liu, Z.; Liao, W.; Li, S.; Sikora, A.; Liu, T.; Li, X. Reasoning before comparison: LLM-enhanced semantic similarity metrics for domain specialized text analysis. arXiv 2024, arXiv:2402.11398. [Google Scholar] [CrossRef]
  5. Wangsa, K.; Karim, S.; Gide, E.; Elkhodr, M. A Systematic Review and Comprehensive Analysis of Pioneering AI Chatbot Models from Education to Healthcare: ChatGPT, Bard, Llama, Ernie and Grok. Future Internet 2024, 16, 219. [Google Scholar] [CrossRef]
  6. Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
  7. Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar] [CrossRef]
  8. Du, S.; Tang, S.; Wang, W.; Li, X.; Guo, R. Tree-gpt: Modular large language model expert system for forest remote sensing image understanding and interactive analysis. arXiv 2023, arXiv:2310.04698. [Google Scholar] [CrossRef]
  9. He, Y.; Sun, Q. Towards automatic satellite images captions generation using large language models. arXiv 2023, arXiv:2310.11392. [Google Scholar] [CrossRef]
  10. Kömürcü, K.; Petkevičius, L. MiniCPM-V LLaMA Model for Image Recognition: A Case Study on Satellite Datasets. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 7892–7903. [Google Scholar] [CrossRef]
  11. Xie, J.; Wang, G.; Zhang, T.; Sun, Y.; Chen, H.; Zhuang, Y.; Li, J. LLaMA-Unidetector: A LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4409318. [Google Scholar]
  12. Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xView: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar] [CrossRef]
  13. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  14. Irvin, J.; Sheng, H.; Ramachandran, N.; Johnson-Yu, S.; Zhou, S.; Story, K.; Rustowicz, R.; Elsworth, C.; Austin, K.; Ng, A.Y. Forestnet: Classifying drivers of deforestation in indonesia using deep learning on satellite imagery. arXiv 2020, arXiv:2011.05479. [Google Scholar] [CrossRef]
  15. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  16. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
  17. Li, B.; Zhang, Y.; Chen, L.; Wang, J.; Pu, F.; Cahyono, J.A.; Yang, J.; Li, C.; Liu, Z. Otter: A multi-modal model with in-context instruction tuning. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 7543–7557. [Google Scholar] [CrossRef] [PubMed]
  18. Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; Cai, D. Pandagpt: One model to instruction-follow them all. arXiv 2023, arXiv:2305.16355. [Google Scholar] [CrossRef]
  19. Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
  20. Miller, L.; Pelletier, C.; Webb, G.I. Deep learning for satellite image time-series analysis: A review. IEEE Geosci. Remote Sens. Mag. 2024, 12, 81–124. [Google Scholar] [CrossRef]
  21. Sheehan, A.; Beddows, A.; Green, D.C.; Beevers, S. City scale traffic monitoring using worldview satellite imagery and deep learning: A case study of Barcelona. Remote Sens. 2023, 15, 5709. [Google Scholar] [CrossRef]
  22. Adegun, A.A.; Viriri, S.; Tapamo, J.R. Review of deep learning methods for remote sensing satellite images classification: Experimental survey and comparative analysis. J. Big Data 2023, 10, 93. [Google Scholar] [CrossRef]
  23. Li, Y. Multi-temporal analysis of land use change using GIS and satellite imagery: Implications for sustainable urban planning. Adv. Eng. Innov. 2025, 15, 21–25. [Google Scholar] [CrossRef]
  24. Ou, R.; Yan, H.; Wu, M.; Zhang, C. A method of efficient synthesizing post-disaster remote sensing image with diffusion model and LLM. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1549–1555. [Google Scholar]
  25. Osco, L.P.; Lemos, E.L.d.; Gonçalves, W.N.; Ramos, A.P.M.; Marcato Junior, J. The potential of visual chatgpt for remote sensing. Remote Sens. 2023, 15, 3232. [Google Scholar] [CrossRef]
  26. Kang, Y.; Zheng, B.; Shen, W. Research on Oriented Object Detection in Aerial Images Based on Architecture Search with Decoupled Detection Heads. Appl. Sci. 2025, 15, 8370. [Google Scholar] [CrossRef]
  27. Atik, M.E.; Duran, Z.; Özgünlük, R. Comparison of YOLO versions for object detection from aerial images. Int. J. Environ. Geoinform. 2022, 9, 87–93. [Google Scholar] [CrossRef]
  28. Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-lightweight yolo: Improving small object detection in yolo for aerial images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef]
  29. Marletta, D.; Midolo, A.; Tramontana, E. Detecting Photovoltaic Panels in Aerial Images by Means of Characterising Colours. Technologies 2023, 11, 174. [Google Scholar] [CrossRef]
  30. Calcagno, S.; Scaletta, E.; Tramontana, E.; Verga, G. YOLO-based Recognition of some Crop Categories from Real-World Aerial Images. In Proceedings of the International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Catania, Italy, 23–25 July 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
  31. Ju, H.; Park, I.; Nalcakan, Y.; Jin, Y.; Yeo, S.; Kim, S. Exploring the Limits of Large Language Models’ Ability to Distinguish Between Objects. Appl. Sci. 2025, 15, 4620. [Google Scholar] [CrossRef]
  32. Pritt, M.; Chern, G. Satellite image classification with deep learning. In Proceedings of the Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 10–12 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–7. [Google Scholar]
  33. Marletta, D.; Midolo, A.; Tramontana, E. Automatic Land Use and Land Cover Classification by Means of Characterising Colours. In Proceedings of the International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Reggio Emilia, Italy, 26–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 146–151. [Google Scholar]
  34. Wang, A.; Tian, P.; Wang, S. High resolution satellite imagery segmentation based on adaptively integrated multiple features. In Proceedings of the Automatic Target Recognition and Image Analysis, and Multispectral Image Acquisition (MIPPR), Wuhan, China, 15–17 November 2007; SPIE: Bellingham, WA, USA, 2007; Volume 6786, pp. 812–818. [Google Scholar]
  35. Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral Remote Sensing Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5227–5244. [Google Scholar] [CrossRef]
  36. Futerman, S.I.; Cohen, Y.; Laor, Y.; Argaman, E.; Aharon, S.; Eshel, G. Assessing field-scale rill erosion mitigation by cover crops in arable land using drone image analysis. Soil Tillage Res. 2025, 246, 106341. [Google Scholar] [CrossRef]
  37. Shen, N.; Feng, F.; Xu, C.; Li, X.; Chiriacò, M.V.; Lafortezza, R. Drone-based assessment of urban green space structure and cooling capacity. Urban For. Urban Green. 2025, 112, 128953. [Google Scholar] [CrossRef]
  38. Sulaiman, M.N.; Razif, M.R.M.; Hassan, C.A.Z.C.; Mustapha, N.H.M.; Azhar, S.W.S.; Azman, M.N. Wireless Surveillance with Human Detection Using Artificial Intelligence and Drone. J. Telecommun. Electron. Comput. Eng. (JTEC) 2025, 17, 51–58. [Google Scholar] [CrossRef]
  39. Effendi, M.R.; Al-Falah, R.S.; Sarbini; Ismail, N. IoT-Based Battery Monitoring System in Solar Power Plants with Secure Copy Protocol (SCP). In Proceedings of the 2021 7th International Conference on Wireless and Telematics (ICWT), Bandung, Indonesia, 19–20 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar]
  40. Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
  41. Alexandrova, S.; Tatlock, Z.; Cakmak, M. RoboFlow: A flow-based visual programming language for mobile manipulation tasks. In Proceedings of the International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5537–5544. [Google Scholar]
  42. Mungkan, P.; Evans, W.K. Development of an Image Processing System for Defect Detection in Nam Dok Mai Golden Mangoes. Eng. Technol. Horiz. 2025, 42, 420208. [Google Scholar] [CrossRef]
  43. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
  44. Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of multimodal large language models: A survey. arXiv 2024, arXiv:2404.18930. [Google Scholar] [CrossRef]
Figure 1. The workflow of a drone network used to gather images. Firstly, the initial LLM analysis extracts the coordinates of an area of interest; secondly, coordinates are sent to drones to capture images; thirdly, drones send detailed images to the server; fourthly, LLM analysis finds potential anomalies; lastly, alerts are sent to users.
Figure 1. The workflow of a drone network used to gather images. Firstly, the initial LLM analysis extracts the coordinates of an area of interest; secondly, coordinates are sent to drones to capture images; thirdly, drones send detailed images to the server; fourthly, LLM analysis finds potential anomalies; lastly, alerts are sent to users.
Futureinternet 18 00077 g001
Figure 2. An example of a detailed picture that could have been captured by a drone.
Figure 2. An example of a detailed picture that could have been captured by a drone.
Futureinternet 18 00077 g002
Figure 3. An image labeled using Roboflow, on the (left), and the same image colored according to the category of the identified object, on the (right).
Figure 3. An image labeled using Roboflow, on the (left), and the same image colored according to the category of the identified object, on the (right).
Futureinternet 18 00077 g003
Figure 4. Workflow of the experiments consisting of downloading aerial images, annotating images and color coding them, querying an LLM, and comparing the results of the LLM with the annotations.
Figure 4. Workflow of the experiments consisting of downloading aerial images, annotating images and color coding them, querying an LLM, and comparing the results of the LLM with the annotations.
Futureinternet 18 00077 g004
Figure 5. An example of labels represented on an image according to the object detection performed by Llama-4 Maverick.
Figure 5. An example of labels represented on an image according to the object detection performed by Llama-4 Maverick.
Futureinternet 18 00077 g005
Figure 6. Confusion matrix for the 200-pixel spatial tolerance threshold. Rows represent predicted classes, while columns represent ground-truth classes derived from the color-coded segmentation maps. The background label in the column indicates predictions falling outside any labeled region within the tolerance radius, whereas the background label in the row represents missed ground-truth objects (false negatives).
Figure 6. Confusion matrix for the 200-pixel spatial tolerance threshold. Rows represent predicted classes, while columns represent ground-truth classes derived from the color-coded segmentation maps. The background label in the column indicates predictions falling outside any labeled region within the tolerance radius, whereas the background label in the row represents missed ground-truth objects (false negatives).
Futureinternet 18 00077 g006
Figure 7. Confusion matrices for the 130-pixel (left) and 0-pixel (right) spatial tolerance thresholds (the different shades of blue show the scale of the numbers, as in the previous figure).
Figure 7. Confusion matrices for the 130-pixel (left) and 0-pixel (right) spatial tolerance thresholds (the different shades of blue show the scale of the numbers, as in the previous figure).
Futureinternet 18 00077 g007
Figure 8. Images analyzed by Llama-4 Mavericks for which the found object were marked with a 200-pixel tolerance-distance (green labels and circles show a true value, while red shows a false detection).
Figure 8. Images analyzed by Llama-4 Mavericks for which the found object were marked with a 200-pixel tolerance-distance (green labels and circles show a true value, while red shows a false detection).
Futureinternet 18 00077 g008
Figure 9. Values of precision, recall and accuracy for the whole set of images when considering each distance-tolerance level.
Figure 9. Values of precision, recall and accuracy for the whole set of images when considering each distance-tolerance level.
Futureinternet 18 00077 g009
Figure 10. Heatmaps for precision by category and tolerance (top-left), recall by category and tolerance (top-right), and accuracy (bottom).
Figure 10. Heatmaps for precision by category and tolerance (top-left), recall by category and tolerance (top-right), and accuracy (bottom).
Futureinternet 18 00077 g010
Figure 11. On the (left), the labels assigned by the Llama-4 (green circles show true positive, and red circles show false positive), and on the (right) the manually labeled image showing more occurrences of objects and a larger area covered by such objects (the colors indicate the type of object).
Figure 11. On the (left), the labels assigned by the Llama-4 (green circles show true positive, and red circles show false positive), and on the (right) the manually labeled image showing more occurrences of objects and a larger area covered by such objects (the colors indicate the type of object).
Futureinternet 18 00077 g011
Figure 12. On the left, the results obtained with a tolerance of 130 pixels; on the right are the results obtained with a tolerance of 200 pixels (green labels and circles show a true value, while red shows a false detection).
Figure 12. On the left, the results obtained with a tolerance of 130 pixels; on the right are the results obtained with a tolerance of 200 pixels (green labels and circles show a true value, while red shows a false detection).
Futureinternet 18 00077 g012
Figure 13. Examples of images containing incorrect counts (above) or hallucinatory phenomena (below).
Figure 13. Examples of images containing incorrect counts (above) or hallucinatory phenomena (below).
Futureinternet 18 00077 g013
Table 1. The set of colors identifying each a category of objects assigned to objects by means of Roboflow, and the number of occurrences of objects in each category, for a total of 2594 objects.
Table 1. The set of colors identifying each a category of objects assigned to objects by means of Roboflow, and the number of occurrences of objects in each category, for a total of 2594 objects.
ColorCategoryOccurrences
Futureinternet 18 00077 i001Citrus groves687
Futureinternet 18 00077 i002Trees407
Futureinternet 18 00077 i003Houses406
Futureinternet 18 00077 i004Wells68
Futureinternet 18 00077 i005Roads377
Futureinternet 18 00077 i006Fields257
Futureinternet 18 00077 i007Olive groves239
Futureinternet 18 00077 i008Meadows153
Table 2. Categories detected by Llama-4.
Table 2. Categories detected by Llama-4.
CategoryVariants
groves or treesgrove, citrus grove, citrus groves, citrus_grove, citrus_groves, olive grove, olive groves, olive_grove, olive_groves, tree, trees
housesbuilding, house, houses
fieldsfield, fields, soil, land, lands
roadsroad, roads
wellswell, wells, pond, water
meadowsmeadow, meadows, grassland, grass
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Calcagno, S.; Scaletta, E.; Tramontana, E.; Verga, G. Aerial Image Analysis: When LLMs Assist (And When Not). Future Internet 2026, 18, 77. https://doi.org/10.3390/fi18020077

AMA Style

Calcagno S, Scaletta E, Tramontana E, Verga G. Aerial Image Analysis: When LLMs Assist (And When Not). Future Internet. 2026; 18(2):77. https://doi.org/10.3390/fi18020077

Chicago/Turabian Style

Calcagno, Salvatore, Erika Scaletta, Emiliano Tramontana, and Gabriella Verga. 2026. "Aerial Image Analysis: When LLMs Assist (And When Not)" Future Internet 18, no. 2: 77. https://doi.org/10.3390/fi18020077

APA Style

Calcagno, S., Scaletta, E., Tramontana, E., & Verga, G. (2026). Aerial Image Analysis: When LLMs Assist (And When Not). Future Internet, 18(2), 77. https://doi.org/10.3390/fi18020077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop