1. Introduction
Large language models (LLMs) have been widely used to assist developers in several tasks, e.g., assessing the right level of security and privacy [
1], analyzing cybersecurity posts [
2], as a companion to quickly implement code [
3], to summarize documents, translate short texts, for image captioning [
4], for education and healthcare assistance [
5], annotating images [
6,
7], etc. Although LLMs exhibit remarkable proficiency in mimicking some human tasks related to the production of natural language, their application in remote sensing and satellite image interpretation has just begun. Current LLM-based systems are still unable to fully extract or comprehend information from aerial images and may generate inaccurate results. This is likely due to the lack of domain-specific knowledge, which limits their usability in this context [
8]. In fact, some man-made objects, e.g., airplanes and trucks seen in aerial images, can be detected due to their characteristic contours and the great importance they have had in training LLMs. Conversely, other man-made objects and especially natural objects are much more difficult to detect [
9,
10,
11]. Previous approaches have detected man-made objects, such as cars or airplanes, on aerial images using multi-layer neural networks [
12,
13]. Although the previous approaches have proposed extensive datasets and the training of neural-based models, their experiments focused on man-made objects, unlike ours, which aims at detecting especially natural objects, such as trees and uncultivated land. In another experiment, deforestation was detected using a deep learning model and images that were annotated according to forest loss [
14].
Unlike previous research, the objective of our experiments is to test the ability of an out-of-the-box LLM to detect several natural objects and some man-made objects from aerial images. This can be valuable support for achieving an advanced level of automation when large areas need to be monitored for security or maintenance goals and to discover unusual conditions that may require prompt intervention. In the proposed scenario, automation can be brought about by using a set of drones that scan a large area and capture images, then an LLM assists in identifying the conditions of objects of interest from the captured images; hence, a team can be possibly sent to a specific part for maintenance.
This paper aims to assess the ability of LLMs to detect natural objects in aerial images and label these objects according to given categories. To this end, we propose an approach that provides an LLM with real-world aerial imagery and asks it to provide detailed information about detected objects for the provision of assistance services, such as territorial monitoring. The extracted information concerns large uncultivated lands, private properties (e.g., land plots) or infrastructure in the vicinity of some residence (e.g., roads and adjacent spaces). The main functionalities of remote sensing in this context include (i) detection of objects present in the area; (ii) assessment of visibility conditions; and (iii) position of objects in the area. In a real-world use case, the assistance service is supported by a network of drones responsible for acquiring images in real time to monitor the territory. For our study, we used Llama-4 since it offers state-of-the-art capabilities and is the latest version available for the Llama series (it was released in April 2025). We have asked Llama-4 to label aerial images that predominantly have green areas, such as orchards and secondary roads.
The main contribution of this paper is a thorough evaluation of an available off-the-shelf LLM, i.e., Llama-4 Maverick, against the problem of identifying the categories of vegetation and distinguishing it from uncultivated areas or roads from aerial RGB images. For this to be achieved, the following steps were needed, whose outcomes are by themselves useful for further studies: (i) the composition and evaluation of a new dataset, together with the results of the analysis provided by Llama-4; (ii) a benchmark consisting of a complex set of labels for images, given that the annotated objects involve intricate land-cover categories, which were not explored in previous studies; (iii) a set of captioning labels for the above image set; and (iv) a comprehensive analysis that aims at object recognition, counting objects, and spatial position analysis of objects in aerial images.
Llama-4 delivered encouraging results in a coarse analysis. However, it exhibited marked limitations in (i) distinguishing similar terrain types, (ii) pinpointing object locations, and (iii) accurately counting object occurrences. Although it was somewhat reliable for a preliminary assessment, many results were inconsistent and often incorrect when finer-grained details were expected. This finding underscores the significant shortcomings that Llama-4 still faces in detailed remote-sensing tasks.
The remainder of the paper is structured as follows:
Section 2 discusses related work.
Section 3 shows a practical usage scenario.
Section 4 presents the dataset, the labeling of images via Roboflow, and the coloring of the areas.
Section 5 analyzes the dataset using Llama-4 Maverick.
Section 6 illustrates the results achieved.
Section 7 describes the strengths and weaknesses of the model.
Section 8 draws our conclusions.
2. Related Works
Recently, large multimodal language models (MLLMs) such as Llama [
15], Large Language and Vision Assistant (LLaVA) [
16], Otter [
17], and PandaGPT [
18] have shown strong capabilities in processing and interpreting images via structured text-based queries, achieving considerable success in general-purpose visual language tasks. These models analyze terrestrial images, i.e., acquired by cameras or mobile phones, while the analysis of aerial (or satellite) images represents a decidedly more complex and distinct area.
Satellite image analysis has been extensively studied [
19,
20,
21,
22,
23,
24,
25,
26], yet it remains an active field of research due to ongoing challenges and newly emerging requirements. This area demonstrates significant potential in multiple application domains, including traffic monitoring and management [
21], land use and land cover change detection [
23], and real-time disaster identification and response [
24,
25]. However, readily available satellite image collections focus on urban areas and are typically used to analyze, e.g., vehicles or houses [
27,
28,
29], hence the vegetation analysis in images is minimal. As a result, the usefulness of satellite images is limited to applications where it is crucial to detect traffic jams or the occurrence of fire, as none of the available datasets focuses on green areas or aims to distinguish between categories of trees or other types of vegetation.
Some datasets oriented toward aerial imagery have been previously curated and include green areas such as grasslands, trees, or agricultural fields; however, the corresponding research studies pursue different goals, and these areas are not associated with predefined categories. Two such extensive datasets have labeled man-made objects, i.e., xView [
12], and DIOR [
13]. Moreover, the authors aimed to build and train a deep neural network model for allowing it to detect such objects. In other experiments, the ForestNet dataset was proposed to detect forest loss, and for this goal, aerial images were annotated accordingly [
14]. Unlike such previous experiments, our goal is to detect natural objects from aerial images and evaluate an out-of-the-box LLM for such a task. Then, given that the existing datasets were not appropriate for our experiments, we have built an ad hoc dataset of images.
The first dataset that was created to focus on satellite images of green areas was described in [
30]. The authors accumulated a set of images centered on green zones and trained an artificial intelligence algorithm to recognize objects in complex imagery. This demonstrated that a well-trained AI system can perform effectively even on multi-label, complex, real-world images; however, it requires an extensive and detailed image collection and annotation process. This dataset was partially used in our work, and an additional coloring step was applied to the images to validate our data.
One major limitation of current LLMs is their inability to reason directly over a visual input, which often leads to inaccurate results. As discussed, e.g., by Du et al. [
8], LLMs struggle to learn effectively from data involving forest imagery. A core challenge when processing a visual input is object detection. Although modern detectors perform well on conventional remote-sensing benchmarks, their reliance on a fixed set of predefined classes limits their applicability in real-world settings. E.g., a model trained to detect aircraft could be unsuitable to recognize ground vehicles [
11]. Ju et al. found that while vision-oriented language models can identify single objects with high accuracy, their performance drops sharply in multi-object scenarios, as their accuracy falls from circa 100% to below 15% in complex scenes [
31]. This is a critical limitation in real-world image analysis, where scenes often contain a wide variety of objects (i.e., multi-label imagery).
Another limitation is related to the generation of semantically rich vision-based explanations. He et al. [
9] introduced an automatic captioning pipeline that guides LLMs to describe object annotations in remote-sensing images, hence showing the value of captions for rapid visual interpretation. However, their study was limited to a narrow range of objects (primarily houses and vehicles) and did not address more diverse scene elements. Our work extends this line of research by designing prompts that generate captions for a broader variety of objects and spatial patterns, thus providing a more comprehensive visual summary. Recently, Komurcu et al. [
10] evaluated several LLMs for satellite image interpretation tasks, reporting an average accuracy of 43%. Their study underscores the difficulties arising when analyzing multi-label scenes, and it has two key limitations: (i) the results are not broken down by class or scene type, and (ii) their most complex dataset includes images having only 512 × 512 pixels. In contrast, our experiments use 1024 × 1024 pixel imagery, introducing denser scenes with a higher number of labels per image; hence, they more accurately reflect real-world complexity.
Alternatively to the use of large language models (LLMs), pixel-based approaches are commonly employed for the analysis of satellite imagery [
32,
33]. However, these approaches present two main limitations: (i) they cannot incorporate additional sources of information, such as textual descriptions or contextual metadata; hence, there is a limit on their semantic understanding; and (ii) remote sensing images often exhibit high variability in resolution, scale, and spectral characteristics, this makes it difficult for traditional methods to generalize across different datasets [
34]. Still, pixel-based techniques continue to be studied and adopted due to the current challenges and limitations associated with the use of LLMs. According to [
35] current LLMs struggle with spectral data typical of satellite images while being better suited for RGB aerial images. For this reason, the previous work has shown that when having spectral data, an ad hoc training for a model is necessary to obtain a high level of accuracy. Our work instead focuses on evaluating an out-of-the-box LLM, without further training.
Multimodal LLMs can perceive multiple modalities, including text, images, and audio, and generate free-form textual responses in zero-shot and/or few-shot settings. BLIP-2 employed a trained bridging module while keeping the backbone networks frozen [
6], whereas KOSMOS-2 introduced explicit grounding through location tokens and web-scale training with spatial supervision [
7]. Both above approaches demonstrated that the integration of LLMs enables zero-shot text-driven image understanding, allowing object identification and, in the case of KOSMOS-2, spatial localization. In contrast, our contribution focuses on the evaluation of the accuracy of results; it does not rely on pre-training or task-specific training. Moreover, we test the LLM on aerial imagery, a domain that is often underrepresented during model training, and perform evaluations based on pixel-wise ground truth and spatial tolerance criteria. Then, we provide a detailed analysis of the results, including semantic ambiguities and hallucinations. Overall, our work highlights when an LLM is useful and when it is not, providing an example of its application in real-world scenarios such as land monitoring.
3. A Drone Network for Image Analysis Assistance
The proposed solution can be used when a large collection of aerial images should be analyzed to detect specific objects, and the leverage of innovative LLM-based tools, particularly LLaMA, can be handy. Once the areas of interest are known—for example, citrus orchards—these can be further documented using drones [
36,
37] thereby enriching the available information with greater spatial detail and real-time data acquisition. This integration enables accurate territorial documentation that can be shared among users, providing intelligent and adaptive feedback on the monitored regions. In this section, we describe the integration of our approach with a drone network and how such integration allows the system to acquire new data that can be used to generate and disseminate feedback to interested users. Each drone is equipped with a high-resolution camera and an onboard computing unit, i.e., a Raspberry Pi [
38]. By using dedicated scripts running on the drone, it can store the captured images locally and then subsequently upload them to a remote server whenever a WiFi connection becomes available. This design enables each drone to autonomously capture multiple images of the surveyed area and transmit them efficiently once connectivity is restored. The server-side processing pipeline involves querying LLaMA to analyze uploaded images, identify objects, classify them according to predefined categories, and infer their conditions. The LLM then generates a description and contextual feedback regarding the observed objects or environmental features.
The overall workflow, illustrated in
Figure 1, is as follows: (i) satellite images are analyzed through LLM queries to detect areas and objects belonging to the studied categories; (ii) the coordinates of the identified objects and area of interest are transmitted to a drone network, which performs targeted image acquisition; (iii) captured images are stored locally on the drones (in SD cards) and compressed into a single archive; (iv) once the acquisition is complete, drones can return to a base where WiFi connectivity allows images to be transferred to the server using Secure Copy Protocol (SCP) [
39]; (v) the server analyzes the images to detect unfavorable or anomalous conditions by re-engaging the LLM (e.g., evaluating citrus orchard health, identifying abandoned or uncultivated areas, assessing the state of roads or wells); and (vi) then, according to the previous result, the server generates information alerts (positive or negative) that are sent to authorized users via email or a notification, including the related image as evidence.
When step (v) above was executed, to analyze the captured images, the prompt used was the following:
| “Analyze this satellite image and describe in detail the condition of the crops present in it. Indicate: which crops are present in each plot; what growth stage or phenological cycle they are in; whether there are visible signs of water or nutritional stress (e.g., yellowing leaves, dry patches); any areas affected by pests or diseases; abandoned or uncultivated areas; traces of recent fires or burns; conditions of access roads to fields; conditions of wells.” |
Figure 2 shows an example of a picture given to the LLM, and the following are some results presented by the LLM as a reply to the above prompt:
Visible Signs of Stress: “Some plots show signs of stress, such as yellowing or dry leaves, particularly in the lower-left section of the image. This could indicate water or nutritional stress.”
Abandoned or Uncultivated Areas: “One plot in the lower-left section appears uncultivated or abandoned, as it lacks the uniform rows of trees observed in other plots.”
Access Roads to Fields: “The access roads to the fields are clearly visible, suggesting that they are reasonably maintained.”
Conditions of Wells: “A well is visible near the center. A small dark blue or black patch in the bottom-right area may indicate stagnant water.”
It should be noted that such feedback may be positive or negative, depending on the conditions identified in the areas analyzed. Therefore, the overall system shows how the combination of satellite imagery, drone-based data acquisition, and LLM reasoning can enhance the monitoring and management of territorial resources through adaptive, data-driven feedback mechanisms.
4. Dataset Creation and Annotation
A specific dataset was created with the purpose of collecting aerial images showing several categories of vegetation and man-made objects. This dataset was achieved by extracting images from Google Maps in selected areas where the desired requirements, in terms of categories and varieties, were satisfied; the images are well curated. Hence, there is no need for denoising blurry images [
40]. The created dataset was then used to query the model. The dataset consists of 100 real-world images, a subset of which was taken from the dataset in [
30], and others were obtained directly from Google. Each image was analyzed by domain experts to annotate man-made objects and natural areas to accurately locate and distinguish each occurrence. The total number of objects in the whole set of images is 2594, and the number of objects in each category is shown in
Table 1. The selected images have two key characteristics: (i) the images are high-resolution ones, and (ii) each image contains at least four different objects; hence, four distinct labels were attributed. Therefore, each image is multi-label.
The following details the steps of the pipeline put into work: image extraction, image preprocessing, image annotation, and coloring. To begin with, images were extracted from Google Maps. This source was chosen because the available images have a high resolution, useful for visual analysis, and because easy access to images of large areas is possible. The images were taken with a zoom of 21 (the maximum possible zoom) to have as many details as possible. The size of the downloaded images was 10,240 × 7680 pixels, and each pixel corresponds to an area of approximately 0.076 m. The geographical area of interest was Italy. Secondly, the images were divided into four portions, and each portion was 1024 × 1024 pixels. In such a way, we obtained several detailed images, each having a size smaller than 1.5 MB (this was a limitation imposed by the model Llama, used in a subsequent step). Then the image was encoded in base64 (as needed to send it to Llama-4). Listing 1 shows a Python version 3.13 code snippet to convert an image to base64 format.
| Listing 1. Python snippet for encoding images in base64.
|
| 1 | import base64 |
| 2 | def encode_image(image_path): |
| 3 | with open(image_path, "rb") as image_file: |
| 4 | return base64.b64encode(image_file.read()).decode(‘utf-8’) |
Finally, images were labeled by tracing polygons in some areas to give the contours of each object occurrence by means of Roboflow (
https://roboflow.com/ accessed on 19 January 2026). Roboflow [
41] is a cloud-based platform designed to simplify the development of computer vision models, especially by streamlining dataset management and model training for several tasks, such as object detection, image classification, and segmentation. It provides support for manual and semi-automatic image annotation, greatly easing the labeling process [
30,
42].
Roboflow makes it possible to manually select the contours of an object or a natural area and assign it with a label. In our experiment, the labeled objects belong to one of eight different categories. Each category was given a color as listed in
Table 1.
Figure 3 shows an example of several natural areas (on the left), which were labeled by marking their contours. The labels given are citrus groves (red), wells (yellow), roads (purple), land (cyan) and trees (green).
A color-coding scheme was used to assign a specific color to each area based on its corresponding semantic category. Hence, subsequently, for validation purposes, the category of each pixel can be unambiguously determined based on its assigned color. This paves the way for the validation of the outputs generated by the Llama model. E.g., RGB (128, 0, 128) represents the road, RGB (255, 255, 0) a well, or RBG (173, 216, 230) the olive grove.
Figure 3 shows on the right the colored pixels for the identified areas (citrus groves, wells, roads, fields and trees).
5. Dataset Processing by Means of Llama-4
For the purpose of identifying and locating objects in aerial images, we queried Llama-4 Maverick, which is a 17 billion parameter model, and offers industry-leading performance for multimodal tasks like image recognition. The results were analyzed to determine how Llama-4 Maverick performs.
Figure 4 illustrates the workflow adopted in this study. For each image, the model was tasked with the identification of each distinct area by means of a label (or a short caption) and the coordinates of the bounded area. The output was then compared with the color-coded image that represents the ground truth, and this step let us determine the precision of the model.
To run Llama-4, we used the Groq service, which is a popular model provider that has pioneered the fastest way to run open-source models (
https://groq.com/llama-4-now-live-on-groq-build-fast-at-the-lowest-cost-without-compromise/ accessed on 19 January 2026). When running a model in Groq, the replies from the chosen model are nearly instantaneous, hence very convenient for our analysis. Listing 2 shows a Python code snippet to query Llama-4 using Groq when giving it an image to analyze.
| Listing 2. Python snippet to query an LLM given an image as input.
|
| 1 | from groq import Groq |
| 2 | client = Groq(API_KEY) |
| 3 | chat_completion = client.chat.completions.create( |
| 4 | messages=[{ |
| 5 | "role": "user", |
| 6 | "content": [ |
| 7 | {"type": "text", "text": prompt}, |
| 8 | {"type": "image_url", |
| 9 | "image_url": { |
| 10 | "url": f"data:image/jpeg;base64,{base64_image}", |
| 11 | }, |
| 12 | }, |
| 13 | ],}], |
| 14 | model="meta-llama/llama-4-maverick-17b-128e-instruct", |
| 15 | ) |
| 16 | response = chat_completion.choices[0].message.content |
The said model was run and queried using the Groq cloud provider, hence using Groq APIs. The default settings were applied, i.e., the sampling temperature was set to 1.0 and the nucleus sampling parameter (top_p) to 1.0, according to the provider’s documentation. This choice avoids forcing some outputs and allows the model to generate responses with natural variability, without restricting the probability mass of candidate tokens.
The prompt given to the model consisted of the following text:
| “Detect all objects and areas in the image, including citrus groves, olive groves, houses, roads, wells, meadows, fields, and trees. For each object, specify its type and the coordinates of its center in normalized format, expressed as [x_center, y_center], where both values range from 0 to 1. Output a CSV file only, without any explanation, markdown, or code.” |
The construction of the prompt was carried out iteratively, by means of exploratory tests conducted on a small subset of images, with the objective of identifying a formulation that was reproducible and suitable for subsequent quantitative analysis. In an initial phase, the prompt consisted of asking the LLM to describe the image and identify the presence of some categories of interest (citrus groves, olive groves, roads, wells, etc.). Although this formulation produced semantically rich descriptions, the output was unstructured and difficult to analyze automatically.
Then, the prompt was progressively refined by introducing an explicit request for spatial localization of objects and an output in CSV format, without explanations or markup, to reduce lexical variability, simplify the automatic extraction of results, and make category normalization more robust. In particular, the model was asked to provide, for each recognized object or area, an approximate position in the form of normalized coordinates [x_center, y_center] relative to the provided image, with values ranging between 0 and 1, a choice consistent with the need to compare predictions with reference annotations.
Other tests were performed to ask for the generation of bounding boxes for the detected objects; however, a preliminary check showed that the coordinates of the object center were closer to the target object (hence more accurate) than the bounding boxes. Since the objective of the analysis was to check the presence of at least one point within the correct area, the coordinates proved to be more consistent with the adopted evaluation protocol. The final prompt adopted, therefore, requested the model to identify all objects or areas belonging to a closed set of categories of interest and to return, for each instance, the category name and the center coordinates in normalized format in a CSV file (the final prompt is shown in the box above).
Llama-4 Maverick answered our requests and created a CSV file for each image. The file consists of data having the format [label, x, y], where label equals the detected category and x and y correspond to a point in the area or object. Using Maverick, the resulting CSV file was well structured and defined. Upon analyzing the results, we observed that the model frequently responded using synonyms for identified categories or slight variations in the expression (e.g., singular vs. plural forms). For this reason, we performed a further step consisting of normalizing the output by grouping several equivalent text expressions, enabling a consistent analysis of the results.
Table 2 shows the category normalization schema adopted, where for each category we give its recognized text expression variants. The macro category “groves or tree” includes the three categories that were initially assigned to images with labels using Roboflow: citrus grove, olive grove, and tree. This grouping was necessary due to the model’s tendency to use these terms interchangeably, without distinguishing between them; hence, the output precision observed was lower than expected.
Figure 5 shows an example in which the model correctly identified a road and two houses; however, it did not find a citrus grove, but only trees.
The model outputs were systematically analyzed to assess the validity of the results. For each response to the prompt, each detected object (or area) was analyzed. An object (or region) was deemed correctly detected if its spatial coordinates fall within an area of the segmentation map that was labeled with a color corresponding to the object’s assigned category. For example, let us suppose that the model has detected a house object at specific coordinates (x, y) to consider the result as correct; such coordinates should fall within a region marked blue found in the reference segmentation map (generated via Roboflow), where blue corresponds to the category house in the predefined color-to-class mapping. This evaluation step is crucial to quantitatively and qualitatively assess the model’s performance.
After checking the category and position of each detected object by comparing it with the color map of the ground truth, a code snippet was developed to have a visual representation of the results, therefore adding to images a circle and a label that identify each detected object. Each circle was centered at the specified coordinates (x, y) suggested by the model, with a radius ranging from 0 to 200 pixels (see
Section 6 for details). The circle was colored green if the category was detected correctly and red otherwise.
6. Results
For our tests, 100 images, which were manually labeled for validation purposes, were given to Llama-4 Maverick with the task of finding each object occurrence. The images consist of aerial scenarios that have green areas, and typically they are cultivated landscapes. For each image, the set of objects found by prompt-based queries (see
Section 5) was assessed according to the correspondence with the color-based annotations (see
Section 4).
Initial tests revealed that the predicted points were not always precisely located within the target objects but were sometimes positioned near their boundaries. To address this, multiple validation strategies were devised to provide a more accurate assessment of the model behavior: three different validation tests were performed, each with a varying tolerance threshold for spatial accuracy, reflecting the fact that coordinates are often approximate rather than exact. The tolerance threshold ranges from 0 to 200 pixels; beyond this threshold, which is approximately 15 m, the detected points were considered incorrect.
The three evaluation scenarios are defined as follows:
- 1.
The 0-pixel tolerance point verification: the coordinates of the objects found by the model are within the area indicated by the ground truth. That is, the color of the pixel with such coordinates corresponds to the suggested category. This is the most precise case and yields the best results.
- 2.
The 130-pixel tolerance point verification: the point detected by the model is close to the reference category area (not within it). The distance between the suggested coordinates of the object and the labeled object is approximately 10 m; which corresponds to 130 pixels. This case is less precise, but still considered valid.
- 3.
The 200-pixel tolerance point verification: the point detected by the model is approximately 15 m away from the object indicated in the ground truth.
Across the three tolerance settings (0, 130, and 200 pixels), the confusion matrices (see
Figure 6 and
Figure 7) provide a consistent view of the model’s behavior under increasing spatial slack. As the tolerance radius increases, the numbers in the diagonal become bigger, indicating that a larger fraction of predictions falls within (or close to) the correct ground-truth regions, while the background column decreases, meaning fewer predictions are completely unrelated to any labeled area. Conversely, the background row (false negatives) remains substantial even when tolerance is set at 200 pixels, highlighting that many annotated regions are not covered by any predicted point within the allowed radius. Overall, these trends confirm that the main limitation is imprecise spatial localization (the LLM gives one coordinate for the identified object, not its area).
To better clarify the numbers in the results, we have to consider that the LLM could provide several times the coordinates and labels for a large area that in the ground truth is labeled as one object (see, e.g.,
Figure 8 second row, right column, where the label groves or trees is given several times for a large area). Under this condition, the number of objects classified as true positives (TPs), which are given as the diagonal in the confusion matrix, does not equal the number of ground-truth polygons. Multiple LLM predictions may fall within the same ground-truth polygon and are therefore all counted as correct detections if they satisfy the spatial and semantic criteria. This choice reflects the objective of our study, which focuses on assessing whether the model can identify the presence and approximate location of relevant objects or areas, rather than performing precise instance-level detection.
To evaluate the accuracy of the model, the coordinates corresponding to the detected objects were analyzed and validated against previously labeled and color-annotated images. This analysis generates visual outputs in which detected objects/areas are highlighted by circles with colored borders. These circles, which include the name of the detected category, serve as visual captions for quick interpretation of the results. The colors of the circles are defined as follows: a red-colored circle for the detected area that does not match the area previously labeled using Roboflow, and the point is considered incorrectly detected; a green-colored circle for the detected area that does match the area previously labeled using Roboflow, and the point is considered correctly detected.
Figure 8 presents some results to qualitatively illustrate the behavior of the model in different cases. The images shown are just a portion of the whole image given as input to highlight the meaningful detail. The distance between the detected object and the real one is the maximum allowed, which is 200 pixels (the third scenario above). The first image (top left) shows regions labeled groves and trees that were detected correctly (three occurrences), while the categories roads and meadow were not accurately identified. The second image (top right) shows labels groves and trees and wells that were not correctly associated. Note that the wells category is not actually present in the image.
The third image (second row on the left) exhibits a high error rate: while a field object was identified correctly, it has four false positive objects, which are not present in the image: the three houses and the road detected by the model do not actually appear and would remain not correct even under a relaxed spatial tolerance. This phenomenon is known as “hallucination” [
43,
44], which refers to the erroneous detection of objects that are not present in the input image. The fourth image (second row on the right) shows some objects that were detected correctly, such as groves or house; however, three road occurrences were incorrectly detected. This result leads to two notable observations: (i) there is only a single road in the image, yet the model identifies three of them, resulting in an overestimation of objects; (ii) all three predicted roads are located at an unreasonable distance from the actual road. In fact, despite the evaluation circle having a radius of 200 pixels, none of the predicted positions intersect the boundary of the real road.
The fifth image (third row on the left) shows that almost all the objects were detected correctly, yielding good results from the model. For the sixth image (third row on the right), the same terrain category was identified as two different objects, namely groves or trees and fields, although the visual inspection shows a uniform scene for all of them.
The seventh and eighth images (the last row) show similar examples of detection with many object occurrences that were identified correctly.
Based on the validation procedure described, precision, recall and accuracy metrics were computed for each test scenario.
Figure 9 shows the metric values and provides a quantitative overview of the model performance. The first scenario, with a distance tolerance threshold of 0 pixels, exhibits notably the lowest precision, recall and accuracy. In contrast, the third scenario achieves relatively high precision (approaching 80%), while recall (56.1%) and accuracy (58.6%) remain limited, though (as expected) they obtain the highest values among the three tolerance levels. The observed trend in the metrics indicates that the model is generally capable of detecting a substantial number of objects correctly within each image; however, as previously discussed, the predicted positions tend to be spatially inaccurate.
Figure 10 presents three heatmaps that show precision, recall and accuracy metrics for each object category, evaluated in the three test scenarios. The precision values show that, in some cases, the increase in precision across tolerance levels is marginal, i.e., meadow category shows only a 55% improvement from tolerance 0 to tolerance 200. In contrast, road category exhibits a substantial precision gain of 357.14%. This highlights the low ability of the LLM to provide accurate coordinates of the object in relation to the origin point of the given image. Overall, the highest precision is achieved in the “groves or trees” macro-category, which includes multiple object types (e.g., olive groves, citrus groves, and trees). These objects are more sketchy than others and have less defined contours, and such characteristics may have contributed to the resulting high values.
Figure 10 on the right shows the recall values, which were computed for the same categories and test scenarios. We note that the values improved as the tolerance increased; e.g., in the road category, recall significantly improved from 18.8% to 60.5%, representing a 236.11% increase. However, in the third scenario, recall across all categories is between 49.4% and 60.5%, indicating a consistent but limited ability.
Figure 10 on the bottom shows the accuracy values. Similarly to the other metrics, the values improve when the tolerance is higher. Moreover, the categories having the highest values (groves and houses) are the ones for which precision was highest. Accuracy should better convey the ability of the LLM to recognize true values. While for the groves category the accuracy hits 70.6%, which can be considered quite good, the second-best value is just 41.4% for the houses category, which is not acceptable.
7. Discussion
Although multimodal LLMs such as Llama-4 have shown promising capabilities in analyzing satellite images, several limitations persist that reduce their reliability in real-world greenfield settings. The main critical issues are discussed below.
Undetected objects: one of the most common limitations is the inability to identify objects present in images. This problem occurs in the presence of objects partially obscured by other elements, i.e., roads, or objects that have unusual colors or shapes. As
Figure 11 shows, the model did not label some objects, such as a well, a house, and a plot of land. Similarly,
Figure 8 (bottom-right image) shows that the model did not label the road. Such situations lead to a significant loss of information and suggest a reduced sensitivity for the model. This is confirmed by the recall and accuracy values that are remarkably low (shown in the diagrams in
Figure 9 and
Figure 10).
Inaccuracy in spatial localization (coordinates error): Llama-4 gave an imprecise location for many identified objects, with discrepancies compared to their actual coordinates in the image. These errors arise from the difficulty of the model in correctly associating visual objects with their positions. This is confirmed by the precision values that improved as distance tolerance increased.
Figure 12 shows an image marked using the two distance-tolerance values (130 and 200 pixels), and highlights that the wells are correctly detected only in the second case, since the model has given not very close coordinates for the wells.
Ambiguities in area classification: frequent classification errors were observed between areas with similar visual characteristics. E.g., the model confused trees with agricultural land, or olive groves with citrus groves. These semantic ambiguities indicate a limited ability of the model to understand the environmental context.
Indeterminacy in object counting: another critical issue detected is the inability of the model to accurately determine the number of objects present in a given area. Llama tends to overestimate or underestimate the actual number, compromising quantitative analysis. E.g., the fourth image in
Figure 8 shows that four fields were detected; however, it can be observed that there are actually only two large fields. In another image, the last in
Figure 8, only two fields were identified by the model, and instead at least five fields can be visually seen. The two images at the top of
Figure 13 illustrate examples of incorrect counts. In the first case, only one-third of the houses are identified (underestimation), while in the second case, a single grove is mistaken for four separate ones (overestimation).
Phenomenon of hallucination: hallucination in satellite imagery refers to the behavior of artificial intelligence systems that detect, describe, or invent objects, structures, or patterns that are not actually present in the analyzed satellite image. During the experiments, several images were labeled with nonexistent objects. E.g., for the image in the second row and the left column in
Figure 8, houses and roads were detected, even though they are not present in any part of the image. Moreover, the two images at the bottom of
Figure 13 illustrate examples of hallucinatory phenomena. In the first case, a well is detected, and in the second, a field is identified; in both instances, these objects are absent from the image and from any other locations. These errors arise from incorrect spatial relationships or from confusion caused by noise, which leads the LLM to produce erroneous semantic interpretations of the image. It was observed that, out of 100 images, approximately 20% exhibit hallucination phenomena. The most frequently detected object category is wells. To mitigate this phenomenon, a filter could be put into place that, taking into account the known characteristics of the area, removes from the LLM’s answer the objects that can not be in it.
8. Conclusions
The proposed solution has been proposed to analyze aerial images and possibly deploy a set of drones to collect better images of an area of interest. Automation in image analysis can be achieved by recurring to an LLM. This study presented a detailed analysis of Llama-4 Mavericks when asked to detect objects in aerial images, with a particular focus on green areas. The images, sourced from Google Maps, depict real-world scenarios and, due to their large size, often contain multiple object categories within a single frame. Although the dataset used in this work consisted of only around 100 images, a relatively small sample, it contains 2594 objects and it proved sufficient for an initial low-cost evaluation of the model. In general, it can be considered representative of a larger set of real-world aerial images.
It is remarkable that an LLM, having had a large training and useful for so many different tasks, can be used for such a specific field, which is the detection of objects in aerial images, with somewhat useful results. More specifically, the results highlighted that several challenges need to be overcome to accurately identify objects and regions, and some issues are related to the incorrect location of detected targets. Furthermore, the findings underscore key weaknesses of Llama-4 Mavericks that most likely did not have sufficient training for such aerial images and for the objects we are interested in. Among the limitations that we have revealed, the suggestion of objects not present in the image is particularly noteworthy. This limitation could be mitigated by a filter assessing whether the named object is typical of the examined area.