Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach

Landauer, Jürgen; Klassen, Sarah

doi:10.3390/geomatics5040052

Open AccessArticle

Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach

by

Jürgen Landauer

^1,* and

Sarah Klassen

^2,3

¹

Landauer AI Research, 71642 Ludwigsburg, Germany

²

Archaeology Centre, University of Toronto, Toronto, ON M5S 2S2, Canada

³

Institute of Behavioral Science, Center for Collaborative Synthesis in Archaeology, University of Colorado Boulder, Boulder, CO 80309, USA

^*

Author to whom correspondence should be addressed.

Geomatics 2025, 5(4), 52; https://doi.org/10.3390/geomatics5040052

Submission received: 11 August 2025 / Revised: 2 October 2025 / Accepted: 4 October 2025 / Published: 7 October 2025

Download

Browse Figures

Versions Notes

Abstract

We investigate the applicability of visual foundation models, a recent advancement in artificial intelligence, for archaeological remote sensing. In contrast to earlier approaches, we employ a strictly zero-shot methodology, testing the hypothesis that such models can perform archaeological feature detection without any fine-tuning or other adaptation for the remote sensing domain. Across five experiments using satellite imagery, aerial LiDAR, and drone video data, we assess the models’ ability to detect archaeological features. Our results demonstrate that such foundation models can achieve detection performance comparable to that of human experts and established automated methods. A key advantage lies in the substantial reduction of required human effort and the elimination of the need for training data. To support reproducibility and future experimentation, we provide open-source scripts and datasets and suggest a novel workflow for remote sensing projects. If current trends persist, foundation models may offer a scalable and accessible alternative to conventional archaeological prospection.

Keywords:

landscape archaeology; automatic feature detection; foundation models

1. Introduction: Foundation Models in Archaeological Remote Sensing

The term Foundation Model was popularized in 2021 by an article by Bommasani et al. [1] from the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Foundation models are characterized by their training on massive, broad datasets, often consisting of billions of elements, and their ability to generalize across a wide range of downstream tasks. In other words, these models are not limited to the purposes for which they were originally trained, but can adapt to new, unseen applications.

A widely known example of such models is large language models (LLMs), such as ChatGPT developed by OpenAI [2]. These models accept text as input and generate meaningful textual output. More recently, the field has seen the development of so-called multimodal foundation models, which extend input and output modalities to include combinations of text, images, video, and audio. This paper focuses specifically on a subset of multimodal foundation models that process imagery or video as input, hence called visual large language models (vLLM).

This category has emerged only recently, made possible through advances in algorithms such as CLIP (Contrastive Language–Image Pretraining) by Radford et al. [3]. During training, CLIP connects images with written descriptions by analyzing hundreds of millions of captioned images. This allows the models to “understand” both pictures and text in a shared way. State-of-the-art systems such as GPT-4.1 [2] or Google Gemini [4] exemplify this new generation of multimodal models.

Importantly, these models are not limited to correlating images and text but are also capable of a degree of spatial reasoning—that is, the ability to identify and reason about the location of objects within images. They can detect, describe, and in some cases localize objects, thereby enabling new applications in domains where spatial context is critical.

It should also be noted that not all vision foundation models are text-conditioned. For instance, the Segment Anything Model (SAM) developed by Meta AI [5] operates on images and image coordinates instead.

1.1. Vision Foundation Models in Remote Sensing

Interest in foundation models within the remote sensing research community has increased significantly since around 2023, following the release of models such as Prithvi by IBM and NASA [6] and RemoteCLIP [7]. Since then, numerous foundation models for the remote sensing domain have been developed, although detailing them lies beyond the scope of this article. Readers are instead referred to recent survey articles such as those by Mai et al. [8], Xiao et al. [9] and especially Lu et al. [10], Huo et al. [11], and Tao et al. [12], which provide the most up-to-date perspectives.

A key observation across the majority of projects discussed in these surveys is the reliance on large-scale remote sensing datasets sourced from the domain of interest. These datasets are typically used either to train a foundation model from scratch or to fine-tune an existing model for domain-specific applications. The rationale behind this approach stems from the fact that general-purpose foundation models, such as ChatGPT and Gemini, are primarily trained on image data taken from the internet that include only a very limited number of remote sensing images—particularly with respect to multispectral or hyperspectral imagery. Early empirical results supported the view that such models, without adaptation, are not well suited to remote sensing tasks.

1.2. Applications in Archaeological Remote Sensing

Huo et al. [11] articulate this prevailing sentiment in the remote sensing community by stating that “fine-tuning is essential for foundation models adapting to specific domains…” (p. 5). This, however, presents particular challenges for sub-domains such as archaeological remote sensing. First, archaeological datasets are typically small, often consisting of only a few dozen identified sites, which is insufficient for training foundation models at scale. Second, the process of training or fine-tuning AI models requires significant technical expertise and computational resources—both of which are often unavailable to archaeological research teams.

These limitations may explain the relatively small number of published studies applying foundation models to archaeological remote sensing. Nonetheless, several efforts have begun to explore this intersection. Within the GeoPACHA project, for instance, a domain-specific model named DeepAndes was developed based on DINOv2. This model was trained on over three million multispectral satellite images to support archaeological investigations in the Andean region [13,14,15]. Although the authors do not explicitly describe their model as a foundation model, the work by Canedo et al. [16] is also notable. Their study used a large dataset of LiDAR and orthophotos to train a vision transformer model capable of detecting ancient burial mounds.

However, some other researchers have challenged the assumption that training or fine-tuning is a prerequisite for effective application of foundation models in this field. Motivated perhaps by the lack of training data, they instead employ foundation models in a so-called “zero-shot” [17] setting, where no training at all is applied.

For example, Abate et al. [18] successfully utilized Sentinel satellite imagery in combination with ChatGPT, which was tasked with generating scripts for downstream analysis of archaeological features. Similarly, Ciccone [19] employed ChatGPT to develop a software assistant for detecting archaeological anomalies in multispectral orthophotos. Arnold and Tilton [20] examined the use of various multimodal foundation models, including ChatGPT and Google Gemini, to analyze image collections related to cultural heritage, with remote sensing data forming a subset of the imagery used in their study.

2. Foundation Models Used in This Study

We began by identifying a broad selection of visual foundation models based on selection criteria such as support for zero-shot inference, public availability, and recency. The initial shortlist included GPT-4.1 by OpenAI [2], Gemini 2.0-flash by Google [4], PaliGemma 2 [21], Qwen-VL 2.5 [22], DeepSeek-VL 2 [23], Microsoft Florence-2 [24], YOLO-World [25], and SAM 2.1 [5]. Preliminary experiments indicated that GPT-4.1, Gemini 2.0-flash, and SAM 2.1 consistently outperformed the other models in terms of detection quality. Consequently, to maintain clarity and focus, we restrict the presentation of results in this study to these three models.

3. Experiments

The first three experiments evaluate the efficacy of GPT-4.1 and Gemini 2.0-flash in detecting archaeological sites using satellite imagery and aerial LiDAR data. The final two experiments examine the applicability of SAM 2.1 to aerial LiDAR and drone-acquired video data.

Although all five experiments operate on image data, their designs differ substantially. Experiments 1–3 address the question “Given a certain image with defined dimensions, what type of object is shown?”, whereas the SAM experiments focus on “What shapes do the objects in this image have, regardless of their type?”.

Our goal with this diverse set of experiments is to highlight the wide range of foundation models and their potential application domains. In particular, we aim to demonstrate both the diversity of these models and their adaptability across different types of remote sensing data.

Readers are also encouraged to consult the accompanying supplementary materials (see Section 5) for a more comprehensive understanding of the results, particularly with respect to video output presented in Experiment 5.

3.1. Experiment 1: Detection of Castles in Bavaria, Germany in Satellite Imagery

3.1.1. Methodology

This experiment focused on detecting medieval castles and ruins in the German state of Bavaria. A dataset comprising 379 known castle coordinates was obtained from the Bavarian state heritage authority (Bayerische Schlösser-und Seenverwaltung). Satellite imagery was downloaded from Microsoft Bing, which offers image resolutions of up to 30 cm per pixel at little or no cost [26]. One of its limitations is that the actual resolution for any given location is not indicated due to Bing’s use of unified global imagery from undisclosed sources [27]. Visual inspection of the data suggested an effective resolution of approximately 50 cm per pixel, which was deemed sufficient for this study.

Each of the 379 castle locations was downloaded as a GeoTIFF file with dimensions of 768 × 786 pixels (approximately 150 × 150 m²), using the highest available resolution. This ensured that all castles were completely within the image with a suitable margin of 15 to 30 percent. To establish a control dataset for negative cases, 1000 random coordinates across Bavaria were similarly sampled and processed. While this yields a 379:1000 ratio, it must be noted that castles similarly to the majority of archaeological feature types obviously occupy less than one percent of the entire landscape. Thus, the distribution only partly represents the actual spatial prevalence of archaeological sites. The size of the negative sample was rather determined based on the feasibility of manual inspection.

We then developed a set of scripts for Google Colab, a popular data science platform, that pair each of these images with a standardized text prompt and submit it to the selected foundation model. Finding an optimal prompt requires extensive experimentation; however, we empirically determined that the version in Box 1 yielded satisfactory results.

Box 1. Gemini 2.0-flash prompt for Bavarian castle detection, text highlights by authors

“You are analyzing a satellite image that may contain archaeological features from Germany, such as castles, ruins, or other ancient man-made structures.

Important Note: The majority of images will likely contain no archaeological features at all. Therefore avoid making detections unless there is evidence.

For each distinct object or feature only if confidently detected, return the following in JSON format:

-: Object Type—classify the object (e.g., enclosure, hillfort, natural formation).
-: Confidence Score—your estimated probability (%) that the classification is correct.
-: Bounding Box—provide coordinates in the format [x_min, y_min, x_max, y_max]. Make sure the bounding box tightly encompasses the object.
-: Reason—textually explain why you think the given object is found.

If no archaeological features are confidently detected, return an empty list”

This prompt emphasizes the avoidance of false positives by instructing the model to withhold detections unless supported by sufficient evidence. In particular, we observed that the use of terms such as “confidently” led the model to return only results with confidence scores of 80% or higher. It also specifies the expected output fields—object type, bounding-box coordinates, and classification confidence—formatted in JSON, a widely used data description language. This allows for scalable, automated post-processing across large datasets.

Requiring a justification (“Reason”) for each detection promotes explicit, stepwise reasoning from the model and often improves the overall quality—a technique commonly known as chain-of-thought prompting [28]. Beyond enhancing quality, these justifications also provide valuable insights into the model’s internal reasoning, which in turn informed our iterative prompt refinement. Examples of such outputs are presented below.

The model outputs were automatically processed to visualize detections by overlaying bounding boxes and the associated classification metadata on each image. This approach allowed for clear visual inspection of detected objects and facilitated subsequent analyses.

3.1.2. Quantitative Results

Initial experiments revealed that the predicted bounding boxes were relatively inaccurate, as discussed in more detail in the following section. Consequently, we employed a streamlined evaluation scheme to quantify model performance. All output fields except for the object type were disregarded, reducing the task to a binary image classification problem. Images were automatically categorized according to the following criteria: images from the positive dataset containing at least one detection were labeled as true positives (TP), whereas images with no detections were labeled as false negatives (FN). Conversely, in the negative dataset, images with detections were classified as false positives (FP), and those without detections as true negatives (TN).

Regarding object types, it is worth noting that expanding the prompt with semantically related terms (e.g., ‘ruin’, ‘enclosure’, ‘hillfort’) in addition to ‘castle’ generally improved results. The foundation models did not treat these terms as a fixed set, but rather as illustrative examples. Consequently, they produced not only the prompted keywords but also closely related variations, such as ‘castle tower’, as shown in the first example below. For evaluation purposes our software mapped all outputs containing any of these keywords to the ‘castle’ category.

Table 1 reports the statistical evaluation for each of the foundation models tested. In addition to precision and recall, we report the F1 score, the harmonic mean of precision and recall, instead of accuracy. The F1 score is particularly appropriate in the context of archaeological remote sensing [29,30], as it balances two critical concerns: false positives (i.e., low precision), which waste time and resources during the necessary manual validation process; and false negatives (i.e., low recall), which risk the loss of potentially irreplaceable cultural heritage.

Before interpreting the results, it is important to establish a performance baseline by considering the hypothetical outcomes of a human specialist. The Bing satellite imagery used in this study was obviously captured during the summer, when many castle sites are partially or entirely obscured by tree cover and dense vegetation. Furthermore, the dataset includes sites that are so heavily deteriorated that identifying them from a satellite perspective is extremely challenging (see example images below). Based on visual inspection, we estimate that even experienced human analysts would be able to detect only approximately two-thirds of the castles in the dataset (recall ≈66%). Detection rates would likely vary depending on the individual analyst’s expertise in interpreting satellite imagery.

For the control dataset of random landscape, it is reasonable to assume that optimal human performance would yield fewer than 1% false positives (i.e., precision ≈99%). However, this rate may rise with analyst fatigue, particularly when processing large areas over extended periods. Accounting for this, the hypothetical maximum F1 score achievable by a specialist is estimated at 80%.

The results for both models given in Table 1 are therefore very promising and indicate that they are indeed capable of identifying a large percentage of castles or castle ruins in satellite imagery without any prior training or configuration. GPT leads with a recall rate of 64%, which is very close to a hypothetical human analyst, while Gemini achieves a significantly lower result of only 38%.

The two models also differ significantly in their false positive rates. GPT shows a rate of around 10%, which is much higher than the expected 1% rate of human analysts. In contrast, Gemini’s false positive rate is only 0.2%, which more closely aligns with the human baseline. However, it is important to interpret this result with caution: the absolute number of false positives produced by Gemini is low (only two cases) which limits the statistical robustness of this finding. Further experiments with larger datasets would be necessary to confirm the reliability of this trend.

3.1.3. Qualitative Results and Preliminary Assessment

Table 2 and Table 3 present selected example outputs for GPT-4.1 and Gemini 2.0-flash, respectively. Both models exhibit certain shared characteristics. For instance, it appears that detections often do not encompass the entire structure of a castle but instead focus on distinctive architectural elements, such as a tower or keep (e.g., Burgruine Thanstein, Burgstall Wildstein, Burg Hartenstein). The precision of the generated bounding boxes varies considerably, ranging from relatively accurate delineations (e.g., Ruine Botenlauben) to significantly misaligned or imprecise boxes (e.g., Burg Rothenfels), and in some cases, the bounding boxes do not intersect with the actual castle remains at all (e.g., Burgruine Thanstein, Burgruine Natternberg).

Notably, the models on two occasions succeed in detecting features partially obscured by vegetation (eBurgstall Dietrichstein, Burgstall Tännesberg below). We cannot, however, be sure whether this suggests a degree of robustness to challenging image conditions or is just coincidence, facing the limitations of bounding box precision mentioned above.

Regarding the generated “Reason” texts, many appear to be drawn from stylistic patterns typical of archaeological descriptions and often capture the visual context with a reasonable degree of accuracy. However, further research with larger and more diverse datasets is necessary to identify consistent patterns and draw more definitive conclusions about model behavior and reliability in archaeological remote sensing tasks.

In the case of false positives, GPT-4.1 exhibited clear instances of so-called hallucinations (Figure 1). This phenomenon, common among foundation models [31], refers to evidently incorrect or implausible outputs such as detections in locations devoid of any archaeological features, whose underlying causes are not yet fully understood. For the two false positives produced by Gemini 2.0-flash (Figure 2), a more interpretable rationale can be proposed: it is plausible that the model was misled by modern housing structures in the imagery, which may share superficial spatial or geometric characteristics with archaeological remains.

We only present the result images here and omit object types, confidence scores, and reason texts due to their hallucinatory quality. Interested readers can, however, review all outputs in detail in the dataset repository described in Section 5 below.

3.2. Experiment 2: Detection of Angkorian Temples in Satellite Imagery

3.2.1. Methodology

To assess the generalizability of the models, we conducted a second benchmark using a thematically and geographically distinct dataset: Angkorian temple structures in Cambodia. This dataset includes 100 temple coordinates, sourced from previous archaeological studies [32,33]. Unlike the castles used in the previous experiment, the majority of these temples are not widely documented online or in academic literature in ways that would be accessible to Foundation models during training. We therefore hypothesize that most of them can be considered “previously unseen” by the models.

In addition, visual inspection revealed that fewer than 10% of them were listed on Google Maps, which may have served as a data source for foundation model training.

As before, each location was processed into a 480 × 480-pixel image (approximately 140 × 140 m²) using the same Bing download method. Again, a negative control sample of 1000 random coordinates from across Cambodia was also created and processed identically. The same scripts were used for model evaluation but the prompts were modified, as we found empirically that this yielded better results. The prompt used here is given in Box 2 below.

It is important to note that we made significant modifications to the prompt regarding the objects to be detected. Specifically, we included not only temple but also reservoir and moat. This adjustment was based on empirical observations showing that the models often detected not the temple itself, but associated structures commonly found with Angkorian temples, namely temple moats and reservoirs. Aside from these changes, both the prompt structure and the evaluation strategy remain consistent with those described in Section 3.1.

Box 2. GPT-4.1 prompt for Angkorian temple detection, text highlights by authors

“You are analyzing a satellite image that may contain archaeological features from Cambodia, such as ancient Buddhist temples or other ancient man-made structures.

For each distinct object or feature, return the following in JSON format:

-: Object Type—classify the object (e.g., temple, reservoir, moat, unknown).
-: Confidence Score—your estimated probability (%) that the classification is correct.
-: Bounding Box—provide coordinates in the format [x_min, y_min, x_max, y_max]. Make sure the bounding box tightly encompasses the object.
-: Reason—textually explain why you think the given object is found.”

3.2.2. Quantitative Results

Table 4 shows that GPT-4.1 again emerged as the best-performing model, successfully detecting 57% of the temple sites. However, this rate was accompanied by a considerable number of false positives (9.8%). In contrast, Gemini 2.0-flash identified only approximately one third of the temples (32%), but it achieved a substantially lower false positive rate of just 2.3%. Overall, these results confirm a trend similar to that observed in the castle detection experiment discussed earlier.

In general, when scanning unknown areas for previously undocumented temples, archaeologists face a trade-off between higher detection rates and increased false positives. This dilemma is particularly evident here with GPT-4.1 with its false positive rate of 9.8%, raising concerns about its practical applicability in large-scale surveys where Gemini 2.0-flash seems to be better suited.

Additionally, the F1 scores in this experiment were generally lower than those in experiment 1, ranging between 41 and 45 percent. This reduction in performance is possibly due to the more degraded and fragmented condition of the Angkorian temples (in comparison with the well-preserved Bavarian castles). These structures, having endured the harsh tropical climate and recent military conflicts, present an exceptionally challenging context for automated detection.

This also explains why we consider the obtained F1 scores—although all below 50%—to represent strong results. As with any automated approach, they must be interpreted relative to a human expert baseline. In our earlier research [34], conducted in a similar context, we found that experienced researchers achieved F1 scores between 56% and 88%, while students reached approximately 38%. In this light, both models already perform within the range of student analysts, but without the susceptibility to fatigue effects, as discussed above.

3.2.3. Qualitative Results and Preliminary Assessment

In addition to the statistical evaluation presented above, we report several observations based on a visual inspection of the AI-generated results (see Table 5 and Table 6 below). As in the previous section, the four tables below provide examples of true positives and false negatives for both models.

As above, it is evident that the predicted bounding boxes are frequently imprecise—often either too small or only partially overlapping with the detected object. However, GPT4.1 tended to produce more accurate bounding boxes overall. When comparing the correctly identified temples, we observed that the two models often detected different true positives. There was considerable variation in both directions; temples identified by one model were sometimes missed by the other. Similarly, the labeling of detected features differed between the models. Gemini classified 51 features as “reservoir” and 32 as “temple,” whereas GPT labeled 24 as “temple,” 21 as “moat,” and 12 as “reservoir.” These labels correspond to the texts given as reasons for the AI prediction by each model. As with the castle detections, the vocabulary used appears to originate from scholarly literature on Angkorian archaeology. Notably, while the term “moat” is present in the Gemini reason texts, it did not label any detection as a moat.

A useful perspective is that archaeological datasets are inherently incomplete. Archaeologists can never assume that our survey or analytical results are exhaustive, given the fragmentary nature of the material record. From this standpoint, even if an AI model fails to detect a substantial portion of known features, for instance, missing 43% of temples, the identification of any previously unrecorded structures still constitutes a significant contribution. The value lies not necessarily in comprehensive detection, but in expanding the known corpus and demonstrating new methodological possibilities.

In addition to detection, the interpretive rationale provided by the AI models can offer valuable support to archaeological analysis. While these explanations are not always accurate, they help guide the researcher’s eye and prompt critical evaluation of identified features. In this sense, the model functions much like a human collaborator: even when its reasoning is flawed, it contributes to the interpretive process by surfacing new hypotheses and framing points of discussion. This mirrors the way archaeologists often debate the significance of features in the field or lab, and suggests that AI can play a productive role in that ongoing interpretive dialogue.

Regarding the false positives (or “hallucinations”) produced by the two models (see Figure 3 and Figure 4 below), we were unable to identify consistent reasons or patterns that might explain the misclassifications. One exception involves a set of relatively small detections by Gemini 2.0-flash, with diameters of less than 10 to 15 m on the ground. These are generally too small to represent Angkorian temples and could potentially be filtered out during post-processing, which would likely improve overall detection quality.

In conclusion, our findings demonstrate that foundation models are capable of detecting Angkorian archaeological structures in satellite imagery. The performance observed here is broadly consistent with our earlier research on identifying Angkorian temple reservoirs [34], where F1 scores ranged from 15% to 45%, depending on the region. A key distinction, however, lies in the level of human effort required: whereas the earlier study took several months to complete, the present experiment produced comparable results within a matter of days—representing a reduction in manual effort by several orders of magnitude.

3.3. Experiment 3: Finding English Hillforts in LiDAR Imagery

3.3.1. Methodology

Aerial LiDAR has become an increasingly important tool in archaeological research, particularly for identifying features in forested or otherwise visually obscured landscapes.

LiDAR data is typically provided either as raw point clouds or in raster formats such as Digital Elevation Models (DEMs). When working with visual foundation models which are typically trained on large collections of online images, it is advantageous for input imagery to resemble the visual domain these models were exposed to during training. In this context, hillshade visualizations appear to be the most prevalent form of LiDAR-derived imagery in both academic literature and online platforms.

Accordingly, we used the RVT library [35,36] to generate hillshade visualizations from the raw LiDAR data in our experiments. Future research may investigate whether alternative visualization techniques can yield improved detection results when employing foundation models.

The LiDAR dataset of English hillforts used in this experiment was derived from an earlier project [37], which in turn was based on the coordinates available in the Atlas of Hillforts compiled by Lock and Ralston [38]. From the 1136 English hillforts with LiDAR coverage documented in the Atlas, a random subset of 300 sites was selected for further analysis.

In addition to the hillfort sites, 1000 random land patches from England were also sampled to serve as negative examples. LiDAR data with 1 m resolution for both hillfort and non-hillfort locations was obtained from the UK Department for Environment, Food and Rural Affairs (DEFRA) website (see Section 5 below). For each location, hillshade visualizations were generated at a resolution of 768 × 768 pixels (or 768 × 768 m²).

The same data processing scripts as mentioned above were employed in this study. The prompt text used for model inference was slightly modified, as shown in Box 3.

Box 3. GPT-4.1 and Gemini 2.0-flash prompt for hillfort detection in LiDAR, text highlights by authors

“You are analyzing a LiDAR image that may contain archaeological features from England, such as enclosures, hillforts, or other ancient man-made structures.

For each distinct object or feature only if confidently detected, return the following in JSON format:

-: Object Type—classify the object (e.g., enclosure, hillfort, natural formation, un-known).
-: Confidence Score—your estimated probability (%) that the classification is correct.
-: Bounding Box—provide coordinates in the format [x_min, y_min, x_max, y_max]. Make sure the bounding box tightly encompasses the object.
-: Reason—textually explain why you think the given object is found.

If no archaeological features are confidently detected, return an empty list.”

3.3.2. Quantitative Results

The results evaluation remained unchanged, and Table 7 shows that, in principle, both models were capable of detecting hillforts in Lidar imagery. GPT-4.1 achieved a notably high detection rate, identifying 95% of the sampled hillforts. However, this high sensitivity came at the cost of a very large number of false positives—813 out of 1000 random land patches were incorrectly classified as hillforts. This result implies an implausibly high prevalence of hillforts across the English landscape, rendering GPT-4.1 currently unsuitable for practical applications in hillfort detection.

In contrast, Gemini correctly identified approximately 50% of the hillforts in the sample. While its detection rate was lower, it exhibited a significantly better false positive rate, misclassifying only 66 out of 1000 random patches which again makes Gemini more appropriate for use in real-world archaeological workflows, particularly when validating features across broader landscape units such as districts or entire regions. We speculate that Gemini may have been exposed to a greater volume of LiDAR data during its training than GPT, which could account for its comparatively restrained and more precise performance.

As noted in the previous sections, the reported F1 scores should be regarded as strong results, especially when compared to realistic human expert baselines, which we estimate here at F1 values below 65%. This discrepancy largely arises because the original dataset from the Atlas of Hillforts includes many hillforts that are undetectable in LiDAR data, either due to advanced decay or because they have been built over by modern structures [37].

3.3.3. Qualitative Results and Preliminary Assessment

The examples presented in Table 8 and Table 9 below illustrate that bounding boxes are frequently imprecise, consistent with the results of the previous experiments. Both models classified their detections as either “hillfort” or “enclosure”, two types of defensive sites that can be difficult to distinguish, even for experts in the field. GPT identified 121 enclosures and 165 hillforts, while Gemini detected 96 enclosures and 53 hillforts.

Although the reasoning texts provided by the models did not offer any clear justification for these classifications, they occasionally included surprisingly accurate image descriptions. Notably, both GPT-4.1 and Gemini 2.0-flash repeatedly referenced concentric wall structures. A particularly illustrative example is site EN1597, which corresponds to the Barwick-in-Elmet hillfort in Yorkshire. GPT-4.1 reported two detections at this location, and indeed, the Atlas of Hillforts confirms that this Iron Age site later became the location of a medieval motte-and-bailey castle—clearly visible in the southern portion of the LiDAR image. In contrast, the bounding box detections continue to appear misaligned and do not correspond well with the known locations of the two sites.

As with the previous experiments, the false positives (see examples in Figure 5 and Figure 6) do not exhibit any consistent or readily identifiable characteristics. Further investigation is necessary to better understand the underlying causes of these misclassifications and to determine whether they stem from systematic model behavior, data-related artifacts, or contextual ambiguities in the imagery.

3.4. Experiment 4: Delineating the Dimensions of Archaeological Sites in LiDAR

3.4.1. Methodology

The remaining two experiments employ a different foundation model than the previous ones—specifically, the Segment Anything Model (SAM) in its version v2.1 [5]. Unlike the earlier experiments, which focused primarily on identifying rough bounding boxes for archaeological features, the aim here is to obtain precise outlines or boundaries of the features of interest. SAM is designed to segment or delineate all elements within an image without requiring prior training or semantic understanding of the image content.

For this experiment we used the entire 1136 images of hillforts in England from our earlier project [37], again with a 768 × 786-pixel resolution (or 768 × 768 m²). As in the previous experiment, we generated hillshade visualizations from the LiDAR data and presented these images to SAM.

SAM does not accept textual prompts; instead, it operates using so-called point prompts: one or more coordinates within an image that guide the model in performing segmentation. In this case, we used the central coordinate of each image, which approximately corresponds to the centroid of the hillfort as recorded in the Atlas of Hillforts.

3.4.2. Results and Preliminary Assessment

Figure 7 below shows that the results of this experiment were very promising. We estimate that approximately 60% of the 1136 hillforts were delineated correctly. This is a notable outcome, especially considering that even human experts often struggle to delineate hillforts accurately. Such difficulties arise due to the extent of erosion, destruction, or modern construction over the archaeological features.

Moreover, there exists a degree of ambiguity in archaeological interpretation, with the precise boundaries of many hillforts still a matter of scholarly debate. It should also be acknowledged that SAM occasionally produces errors. Further analysis, beyond the scope of this initial study, is required to understand the causes of these inaccuracies. Common issues include failure to detect outer ring walls (EN0140, for example) or the generation of irregularly shaped segmentations (examples are EN0019 and EN0034).

Despite these limitations, SAM demonstrates significant potential for archaeological applications. It may be employed for the semi-automatic documentation of archaeological sites, or as a valuable tool for generating training data for more conventional neural networks, thereby substantially reducing the effort needed for manual annotation in remote sensing projects. For a full list of results see the accompanying material (Section 5 below).

3.5. Experiment 5: Finding Potsherds in Drone (UAV) Imagery

In this section we advance to the use of drone imagery, a technique increasingly employed in archaeological research due to its ability to reveal surface-level features across broad areas. Potsherds are commonly interpreted as indicators of past human activity, owing to their durability and ubiquity in archaeological contexts. Zones with high concentrations of sherds on the ground indicate higher human activity and can be prioritized for further archaeological prospection or excavation.

Automation of the detection process has long been a desirable goal, especially for large archaeological sites where manual fieldwalking is impractical due to the scale involved. Recent efforts in this area include those by Orengo and García-Molsosa [30], as well as Agapiou et al. [39], who have explored the application of convolutional neural networks (CNNs). However, such approaches face several notable limitations. Chief among these is the labor-intensive requirement to curate large annotated datasets (often comprising thousands of training examples of sherds), as well as the limited generalizability of CNN-based models across different sites due to variations in vegetation cover or illumination conditions.

Here the potential of the Segment Anything Model (SAM v2.1) as an alternative approach to overcome these limitations. SAM as a foundation model has been trained on a huge number of images and hence promises robustness to variations in image characteristics such as shadows, white balance, and illumination. Our aim is to evaluate to what extent this model can reliably detect potsherds in drone imagery.

3.5.1. Methodology

We acquired drone video footage in AVI format from an archaeological site in Greece (Figure 8, site location not disclosed for site protection reasons). The footage was decomposed into individual frames, which were subsequently processed using SAM. The model generates segmentations of virtually all visible items in a given scene, not only potsherds (Figure 9, left). Therefore, additional post-processing steps were required to isolate potsherds from other surface objects.

First, we applied a filtering step based on object area, with thresholds empirically determined for the specific site. Objects smaller than 1 cm² or larger than 500 cm² were excluded, effectively removing most non-relevant features.

Second, based on the color distribution of potsherds, we filtered objects with deviating color ranges. To achieve this, the images were first converted into the HSV (hue, saturation, value) color space, which has been shown to be more effective for color filtering than Red-Green-Blue images [40]. Objects with hue values falling outside an empirically defined range for potsherds were subsequently removed.

Third, we applied non-maximum suppression to remove overlapping segmentations, retaining only the segment with the maximum area within each overlapping group. The combined effect of these steps yielded high-precision delineation of potsherds, as illustrated in Figure 9 above (right) and Figure 10 below.

3.5.2. Results and Preliminary Assessment

A quantitative evaluation of performance is challenging here due to the extraordinarily large number of potsherds at the test site. Nevertheless, based on visual inspection, we estimate that approximately 80% of the sherds were successfully detected, with very few false positives (Figure 10). Annotated video material supporting this estimate is available in the accompanying material to this article (Section 5) and we encourage readers to validate our findings or test the workflow on their own datasets, including drone footage or orthophotos.

In summary, SAM demonstrated effectiveness in delineating and quantifying potsherds from drone imagery, reducing the manual effort typically required from weeks to merely hours, as described in the projects cited above. To process footage from a given site, users need only define appropriate filter parameters—such as the hue range corresponding to the pottery types present at the site. In our case, this calibration process was completed in under two hours. Nevertheless, SAM remains computationally demanding; processing a 10-min video may require up to five hours. A promising alternative is FastSAM [41], which seeks to substantially reduce processing time while maintaining comparable segmentation performance.

4. Discussion and Conclusions

In this study, we examined the efficacy of visual foundation models for detecting archaeological features in remote sensing data using a zero-shot approach—that is, without any fine-tuning specific to the research domain. Our key finding is that, across five experiments involving different types of remote sensing data and archaeological site types, the models demonstrated detection performance comparable to that of human experts and previously established automated detection methods.

For archaeologists, a major advantage of this approach lies in the elimination of the need for training data, resulting in a substantially lower entry barrier. In addition, the required human effort is measured in hours rather than months, making the method highly accessible. While our results were consistent across three foundation models and three types of remote sensing data, it remains to be seen whether this trend generalizes to other modalities or archaeological contexts. We therefore encourage the research community to replicate and expand upon our work. To support such efforts, we provide a suite of scripts designed to simplify access to the tested foundation models and streamline the experimental workflow.

If future experiments confirm the observed trend, this approach has the potential to democratize archaeological remote sensing. Researchers without a background in artificial intelligence or software development could effectively leverage foundation models by following the workflow we propose: First, a small sample can be tested using versions of ChatGPT or Gemini available on the internet along with prompt variations as demonstrated in this study. Second, if initial results appear promising, we recommend extending the study to a larger dataset that includes both positive and negative cases to evaluate false positive rates. This can be facilitated using the scripts provided in this article. Third, for landscape-scale analyses, the remote sensing data can be partitioned into a spatial grid, with each grid cell processed automatically using the same scripts.

Only if this workflow fails to produce satisfactory results should researchers consider fine-tuning foundation models or resorting to other automated detection methods, both of which entail substantially greater effort and expertise.

The foundation models evaluated in this study are subject to frequent updates, often on a monthly or quarterly basis, which incorporate revised training data and model improvements. One key observation from our experiments is that the recency of a model’s release appears to be a critical determinant of performance. For example, earlier versions of Gemini were unable to produce usable results for LiDAR data, whereas version 2.0 (released in February 2025) demonstrated markedly improved outcomes. As the training data used for these models are typically undisclosed, we can only speculate that recent versions may have incorporated a greater volume of remote sensing data—particularly LiDAR—into their training corpora. This could also plausibly explain the significant performance differences observed between GPT-4.1 and Gemini 2.0-flash in experiment 3 above.

Despite the promising results presented in this study, several open issues remain that require further investigation. First, foundation models typically demand substantial memory and computational resources, which are generally only available via cloud-based environments rather than at the user’s local site. This introduces external costs, although these are expected to decrease over time due to growing competition in the cloud services sector. In our project, the total cloud computing expenditure amounted to approximately USD 100. Furthermore, uploading data to the cloud raises privacy concerns, as many service providers reserve the right to reuse submitted data, which may not align with research ethics or data protection regulations.

Second, there are unresolved challenges intrinsic to the models themselves. As demonstrated, the quality of bounding box outputs is often inadequate—objects may be correctly identified, but precise spatial localization remains elusive. Additionally, the phenomenon of hallucinations, wherein the model produces spurious or implausible detections, underscores the necessity for manual verification, at least on a sampling basis, to ensure result reliability.

In addition to these limitations, a further critical direction for future research concerns the applicability of foundation models to non-visual remote sensing modalities such as multispectral, thermal, or seismic data. In our experiments, meaningful results were only achieved when the data was either natively within the visual color spectrum or could be translated into an equivalent visual representation such as greyscale hillshades derived from LiDAR. This represents a significant constraint on current applicability. Addressing this limitation will be essential to broaden the usefulness of foundation models across the full spectrum of remote sensing data types.

5. Data Availability

The source code developed for this project, along with download links for the datasets used and outputs generated, is publicly available on GitHub at https://github.com/juergenlandauer/FoundationModelsArchaeology accessed on 3 October 2025. Detailed usage instructions are provided within the repository. The source code is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). The sources of the datasets used in this study are listed in Table 10 below.

Author Contributions

Conceptualization: J.L. and S.K.; Data curation: J.L. and S.K.; Formal analysis: J.L. and S.K.; Methodology: J.L. and S.K.; Project administration: J.L.; Resources: J.L. and S.K.; Software: J.L.; Validation: J.L. and S.K.; Visualization: J.L.; Roles/Writing—original draft: J.L.; and Writing—review & editing: J.L. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; Arx, S.V.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E. On the Opportunities and Risks of Foundation Models. arxiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arxiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Austria, Vienna, 18–24 July 2021; Volume 139. [Google Scholar] [CrossRef]
Google DeepMind. Gemini Technical Overview. 2025. Available online: https://deepmind.google/technologies/gemini/ (accessed on 6 August 2025).
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Li, W.; Lee, H.; Wang, S.; Hsu, C.; Arundel, S.T. Assessment of a new GeoAI foundation model for flood inundation mapping. In Proceedings of the 6th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, Hamburg, Germany, 13 November 2023. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–16. [Google Scholar] [CrossRef]
Mai, G.; Huang, W.; Sun, J.; Song, S.; Mishra, D.; Liu, N.; Gao, S.; Liu, T.; Cong, G.; Hu, Y.; et al. On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence. arxiv 2023, arXiv:2304.06798. [Google Scholar] [CrossRef]
Xiao, A.; Xuan, W.; Wang, J.; Huang, J.; Tao, D.; Lu, S.; Yokoya, N. Foundation models for remote sensing and earth observation: A survey. IEEE Geosci. Remote Sens. Mag. early access. 2025. [Google Scholar] [CrossRef]
Lu, S.; Guo, J.; Zimmer-Dauphinee, J.R.; Nieusma, J.M.; Wang, X.; Wernke, S.A.; Huo, Y. Vision foundation models in remote sensing: A survey. arxiv 2025. [Google Scholar] [CrossRef]
Huo, C.; Chen, K.; Zhang, S.; Wang, Z.; Yan, H.; Shen, J.; Hong, Y.; Qi, G.; Fang, H.; Wang, Z. When Remote Sensing Meets Foundation Model: A Survey and Beyond. Remote Sens. 2025, 17, 179. [Google Scholar] [CrossRef]
Tao, L.; Zhang, H.; Jing, H.; Liu, Y.; Yan, D.; Wei, G.; Xue, X. Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques. Remote Sens. 2025, 17, 162. [Google Scholar] [CrossRef]
Guo, J.; Zimmer-Dauphinee, J.; Nieusma, J.M.; Lu, S.; Liu, Q.; Deng, R.; Cui, C.; Yue, J.; Lin, Y.; Yao, T.; et al. DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes. arxiv 2025, arXiv:2504.20303. [Google Scholar] [CrossRef]
Wernke, S.A.; Van Valkenburgh, P.; Zimmer-Dauphinee, J.; Whitlock, B.; Spence Morrow, G.; Smith, R.; Smit, D.; Ortega, G.R.; Jara, K.R.; Plekhov, D.; et al. Large-scale, collaborative imagery survey in archaeology: The Geospatial Platform for Andean Culture, History and Archaeology (GeoPA-CHA). Antiquity 2024, 98, 155–171. [Google Scholar] [CrossRef]
Zimmer-Dauphinee, J.; VanValkenburgh, P.; Wernke, S.A. Eyes of the machine: AI-assisted satellite archaeological survey in the Andes. Antiquity 2024, 98, 245–259. [Google Scholar] [CrossRef]
Canedo, D.; Hipólito, J.; Fonte, J.; Dias, R.; Pereiro, T.D.; Georgieva, P.; Gonçalves-Seco, L.; Vázquez, M.; Pires, N.; Fábrega-Álvarez, P.; et al. The Synergy between Artificial Intelligence, Remote Sensing, and Archaeological Fieldwork Validation. Remote Sens. 2024, 16, 1933. [Google Scholar] [CrossRef]
Palatucci, M.; Pomerleau, D.; Hinton, G.; Mitchell, T. Zero-shot Learning with Semantic Output Codes. In Proceedings of the Advances in Neural Information Processing Systems 22—Proceedings of the 2009 Conference, Vancouver, BC, Canada, 7–10 December 2009; Volume 22, pp. 1410–1418. [Google Scholar]
Abate, N.; Visone, F.; Sileo, M.; Danese, M.; Minervino Amodio, A.; Lasaponara, R.; Masini, N. Potential Impact of Using ChatGPT-3.5 in the Theoretical and Practical Multi-Level Approach to Open-Source Remote Sensing Archaeology, Preliminary Considerations. Heritage 2023, 6, 7640–7659. [Google Scholar] [CrossRef]
Ciccone, G. ChatGPT as a Digital Assistant for Archaeology: Insights from the Smart Anomaly Detection Assistant Development. Heritage 2024, 7, 5428–5445. [Google Scholar] [CrossRef]
Arnold, T.B.; Tilton, L. Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models. arxiv 2024, arXiv:2411.04663. [Google Scholar] [CrossRef]
Steiner, A.; Pinto, A.S.; Tschannen, M.; Keysers, D.; Wang, X.; Bitton, Y.; Gritsenko, A.; Minderer, M.; Sherbondy, A.; Long, S.; et al. PaliGemma 2: A family of versatile VLMs for transfer. arxiv 2024, arXiv:2412.03555. [Google Scholar] [CrossRef]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5 VL Technical Report. arxiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D.; Gao, H.; Ma, Y.; Wu, C.; Wang, B.; et al. DeepSeek VL2: Mixture of Experts vision language models for advanced multimodal understanding. arxiv 2024, arXiv:2412.10302. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar] [CrossRef]
Microsoft. Bing Satellite Imagery. Bing Maps. 2025. Available online: https://www.bing.com/maps (accessed on 6 October 2025).
Lesiv, M.; See, L.; Laso Bayas, J.C.; Sturn, T.; Schepaschenko, D.; Karner, M.; Moorthy, I.; McCallum, I.; Fritz, S. Characterizing the spatial and temporal availability of very high resolution satellite imagery in Google Earth and Microsoft Bing Maps as a source of reference data. Land 2018, 7, 118. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar] [CrossRef]
Cowley, D.C. (Ed.) Remote sensing for archaeology and heritage management—Site discovery, interpretation and registration. In Remote Sensing for Archaeological Heritage Management; Archaeolingua: Budapest, Hungary, 2011. [Google Scholar]
Orengo, H.A.; Garcia-Molsosa, A. A brave new world for archaeological survey: Automated machine learning-based potsherd detection using high-resolution drone imagery. J. Archaeol. Sci. 2021, 132, 105432. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2023, 43, 1–55. [Google Scholar] [CrossRef]
Klassen, S.; Weed, J.; Evans, D. Semi-supervised machine learning approaches for predicting the chronology of archaeological sites: A case study of temples from medieval Angkor, Cambodia. PLoS ONE 2018, 13, e0205649. [Google Scholar] [CrossRef]
Klassen, S.; Carter, A.K.; Evans, D.H.; Ortman, S.; Stark, M.T.; Loyless, A.A.; Polkinghorne, M.; Heng, P.; Hill, M.; Wijker, P.; et al. Diachronic modeling of the population within the medieval Greater Angkor Region settlement complex. Sci. Adv. 2021, 7, eabf8441. [Google Scholar] [CrossRef]
Landauer, J.; Klassen, S.; Wijker, A.P.; van der Kroon, J.; Jaszkowski, A.; der Vaart, W.B.V.-V. Beyond the Greater Angkor Region: Automatic large-scale mapping of Angkorian-period reservoirs in satellite imagery using deep learning. PLoS ONE 2025, 20, e0320452. [Google Scholar] [CrossRef]
Kokalj, Ž.; Hesse, R. Airborne Laser Scanning Raster Data Visualization—A Guide to Good Practice; ZRC SAZU: Ljubljana, Slovenia, 2017; Available online: https://iaps.zrc-sazu.si/en/publikacije/airborne-laser-scanning-raster-data-visualization-1#v (accessed on 9 May 2025).
Štular, B.; Kokalj, Ž.; Oštir, K.; Nuninger, L. Visualisation of LiDAR-derived relief models for detection of archaeological features. J. Archaeol. Sci. 2012, 39, 3354–3360. [Google Scholar] [CrossRef]
Landauer, J.; Maddison, S.; Fontana, G.; Posluschny, A.G. Archaeological site detection: Latest results from a deep learning-based Europe-wide hillfort search. J. Comput. Appl. Archaeol. 2025, 8, 42–58. [Google Scholar] [CrossRef]
Lock, G.; Ralston, I. Atlas of hillforts of Britain and Ireland. 2024. Available online: https://hillforts.arch.ox.ac.uk (accessed on 1 August 2025).
Agapiou, A.; Vionis, A.; Papantoniou, G. Detection of Archaeological Surface Ceramics Using Deep Learning Image-Based Methods and Very High-Resolution UAV Imageries. Land 2021, 10, 1365. [Google Scholar] [CrossRef]
Gonzalez, R.C.; Woods, R.E. Digital image processing, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arxiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]

Figure 1. False positives from GPT4.1 (bounding boxes in yellow).

Figure 2. False positives from Gemini 2-0 flash (bounding boxes in yellow).

Figure 3. False positives from GPT4.1 (bounding boxes in yellow).

Figure 4. False positives from Gemini 2.0-flash (bounding boxes in yellow.

Figure 5. False positives from GPT4.1 (bounding boxes in yellow).

Figure 6. False positives from Gemini 2.0-flash (bounding boxes in yellow).

Figure 7. Examples of English hillforts in LiDAR, yellow overlay produced by SAM 2.1. All images have ground size 768 m × 768 m and are north facing, LiDAR © DEFRA, ID numbers starting with EN are taken from the Atlas of Hillforts.

Figure 8. Example drone image. Challenging scene with high number of potsherds and shadows. © all images in this section: authors.

Figure 9. (Left): SAM2.1 output in blue, all items on the ground segmented. (Right): Only potsherds remain after filtering.

Figure 10. Potsherd detections shown with bounding boxes (blue). Occasional false positives, e.g., large stones on top edge of image. Some potsherds were missed in the bottom right corner, for example.

Table 1. Statistics for Bavarian castle experiment.

Model	TP	FN	TN	FP	Precision	Recall	F1
GPT-4.1	244	135	899	101	71%	64%	67%
Gemini 2.0-flash	144	235	998	2	99%	38%	55%

Table 2. True positives detected by GPT4.1 (bounding boxes in yellow, detection type and reason text given underneath each image). All images in this section are ground size 150 m × 150 m and are north-facing; © Microsoft Bing.

Burgruine Thanstein	Burgstall Dietrichstein	Burg Hartenstein
Detection: “Castle tower”	Detection: “Enclosure”	Detection: “Castle”
“The circular stone structure with a central open area and surrounding buildings resembles a castle tower or keep, typical of medieval fortifications in Germany.”	“The image shows a faint, roughly rectangular shape with straight edges and corners, which is unusual for natural formations. The pattern suggests the remains of an ancient man-made enclosure, possibly stone or earthworks, partially obscured by forest cover.”	“The image shows a complex of buildings with red roofs and surrounding walls, typical of a castle structure. The layout and fortification-like walls suggest an ancient man-made defensive structure, consistent with castles found in Germany. “
Burgruine Waldeck	Burgruine Natternberg
Ruins	Castle
“The image shows a complex of stone structures with partial walls and foundations, typical of archaeological ruins. The layout and construction style suggest an ancient man-made site rather than natural formations or modern buildings.”	“The structure has a distinct roof and walls visible from above, located in a clearing surrounded by dense forest, which is typical for castles or fortified buildings in Germany.”

Table 3. True positives detected by Gemini-2.0-flash (bounding boxes in yellow, detection type and reason text given underneath each image).

Ruine Botenlauben	Schloss Gosheim	Burgstall Tännesberg
Detection: “Castle”	Detections: “Castle” and “Enclosure”	Detection: “Enclosure”
“The structure has towers, walls, and what appears to be a keep, all consistent with the layout of a castle.”	“The image shows a curved wall-like structure surrounding an area, resembling a historic enclosure.” (reason for detection to the right) “The structure appears to be a castle complex with multiple buildings and a distinct architectural style associated with historic castles.” (reason for detection to the left)	“The circular feature with a bank-like structure suggests a possible archaeological enclosure.”
Burgstall Wildstein	Burg Rothenfels
Detection: “Ruins” (2×)	Detection: “Castle”
“There appear to be structural ruins in the image. They are not very clear but seem to consist of stones arranged to create a structure.” (for detection to the left) “A stone ruin in the form of a circular structure is visible next to the small ruins. The structure is composed of large stone blocks or rocks, which might have been the base of an ancient building or an enclosure.” (for detection to the right)	“The structure has the layout of a castle, with distinct architectural features such as towers, walls, and a large building complex. It is well-preserved and exhibits a complex layout, indicating it’s likely a castle.”

Table 4. Statistics for Angkorian temples experiment.

Model	TP	FN	TN	FP	Precision	Recall	F1
GPT-4.1	57	43	902	98	37%	57%	45%
Gemini 2.0-flash	32	68	977	23	58%	32%	41%

Table 5. True positives detected by GPT4.1 (bounding boxes in yellow, detection type and reason text given underneath each image). All images in this section are ground size 140 m × 140 m and are north facing, © Microsoft Bing.


Detection: “Temple”	Detection: “Temple”	Detection: “Moat”
“The central structure has a rectangular shape with surrounding features that resemble an enclosure or platform, typical of ancient temple layouts in Cambodia. The presence of a distinct central mound and the surrounding cleared area suggest a man-made archaeological site rather than natural formations.”	“The central square structure with a distinct raised platform and symmetrical layout is characteristic of ancient Buddhist temples in Cambodia. The surrounding cleared area and presence of trees suggest a man-made site rather than a natural formation.”	The feature is a rectangular water-filled ditch surrounding a central area, typical of moats found around ancient temples or settlements in Cambodia, indicating it may have been used for protection or water management.

Detection: “Reservoir”
“The central dark green rectangular feature resembles a water body or reservoir, which is common in ancient Cambodian landscapes for irrigation or ritual purposes.”

Table 6. True positives detected by Gemini-2.0-flash (bounding boxes in yellow, detection type and reason text given underneath each image).


Detection: “Temple”	Detection: “Temple”	Detection: “Temple”
The structure has a distinct square shape with what appears to be a central building surrounded by a moat which is characteristic of Khmer temple architecture.	The structure appears to be a moated temple site common in Khmer architecture. The rectangular shape with a possible causeway and surrounding moat is characteristic of such sites. There appear to be multiple enclosures or buildings within the complex.	The image shows a rectangular structure with a central elevated area surrounded by what appears to be a moat or embankment. The shape and layout are consistent with known Khmer temple complexes although heavily overgrown.

Table 7. Statistics for hillforts in LiDAR experiment.

Model	TP	FN	TN	FP	Precision	Recall	F1
GPT-4.1	286	14	187	813	26%	95%	42%
Gemini 2.0-flash	149	151	934	66	69%	50%	58%

Table 8. True positives detected by GPT4.1 (bounding boxes in yellow, detection type and reason text given underneath each image). All images in this section have ground size 768 m × 768 m and are north-facing, LiDAR © DEFRA, ID numbers starting with EN are taken from the Atlas of Hillforts.

EN0140: Blewburton Hill, Berkshire	EN0476: Camp Hill, Northumberland
Detection: “Hillfort”	Detection: “Hillfort”
“The image shows a large, elongated mound with multiple concentric earthwork banks and ditches typical of hillfort defensive structures. The shape and terracing suggest man-made fortifications rather than natural formations.”	“The feature is an enclosed elevated area with distinct embankments and possible internal structures, typical of hillforts in England, showing clear man-made defensive earthworks.”
EN3145: Nadbury Camp, Warwickshire	EN1616: Castle Hill, Almondbury, Yorkshire
Detection: “Enclosure”	Detection: “Hillfort”
“The feature in the center of the image shows a distinct, roughly rectangular raised area with clear boundaries, which is characteristic of ancient enclosures. The surrounding terrain and field boundaries also support this interpretation as a man-made structure rather than a natural formation.”	“The image shows a large, roughly oval-shaped earthwork with multiple concentric banks and ditches, typical of hillfort defensive structures. The internal divisions and the elevated position relative to the surrounding terrain further support this classification.”

Table 9. True positives detected by Gemini-2.0-flash (bounding boxes in yellow).

EN3836: Bury Hill, Upper Clatford	EN3590: Dudsbury Camp, Dorset	EN1076: Chatton Law Camp, Northumberland
Detection: “ Hillfort”	Detection “Enclosure”	Detection “Enclosure”
“The image shows a clear double-ditched circular enclosure, a classic feature of hillforts in England. The concentric rings suggest a defensive structure, and the elevated position (implied by the LiDAR data) further supports this identification.”	“A clearly defined circular earthwork is visible, indicating a high probability of an enclosure.”	“A circular earthwork enclosure is visible. There are two concentric rings with what appears to be a clear ditch between them.”
EN1597: Barwick-In-Elmet, Yorkshire	EN0406: Old Sarum, Wiltshire
Detections: “Enclosure” (2x)	Detection: “Hillfort”
“A circular earthwork enclosure is visible in the LiDAR data. The raised rim and depressed interior are consistent with archaeological enclosures.” “A large semi-circular enclosure or possible earlier stage hillfort is visible in the LiDAR data. The raised rim suggests an earthwork construction.”	“This is a clear example of a hillfort with multiple concentric rings of defenses. The LiDAR data clearly shows the raised banks and ditches characteristic of these structures. The central raised area also suggests a defended interior.”

Table 10. Data sources used in this article. All URLs last accessed on 6 August 2025.

Experiment	Data Source	License	Available from
1: Bavarian castles	Coordinates: Bayerische Schlösser-und Seenverwaltung/Bayerische Vermessungsverwaltung	Creative Commons (CC BY-ND)	https://gdk.gdi-de.org/geonetwork/srv/api/records/b1c27b44-f60d-497f-a8cf-b555033db245 accessed on 3 October 2025
1: Bavarian castles	Imagery: Microsoft Bing Satellite	Microsoft Bing Maps terms of use available at https://www.microsoft.com/en-us/maps/product/print-rights accessed on 3 October 2025	Microsoft Bing Maps API (see https://learn.microsoft.com/en-us/bingmaps/rest-services/ accessed on 3 October 2025).
2: Cambodian temples	Not publicly available due to the ethics of archaeological site protection
3 + 4: English hillforts	Coordinates: Atlas of Hillforts in Britain and Ireland [38]	Creative Commons (CC BY-SA 4.0)	https://hillforts.arch.ox.ac.uk accessed on 3 October 2025
3 + 4: English hillforts	Imagery: British Environment Agency National LIDAR Programme	Open Government License (see https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ accessed on 3 October 2025)	https://environment.data.gov.uk/dataset/2e8d0733-4f43-48b4-9e51-631c25d1b0a9 accessed on 3 October 2025
5: Potsherds in drone video	Video: Author J.L.	Creative Commons (CC BY-SA 4.0)	GitHub of article (see above)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Landauer, J.; Klassen, S. Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach. Geomatics 2025, 5, 52. https://doi.org/10.3390/geomatics5040052

AMA Style

Landauer J, Klassen S. Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach. Geomatics. 2025; 5(4):52. https://doi.org/10.3390/geomatics5040052

Chicago/Turabian Style

Landauer, Jürgen, and Sarah Klassen. 2025. "Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach" Geomatics 5, no. 4: 52. https://doi.org/10.3390/geomatics5040052

APA Style

Landauer, J., & Klassen, S. (2025). Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach. Geomatics, 5(4), 52. https://doi.org/10.3390/geomatics5040052

Article Menu

Visual Foundation Models for Archaeological Remote Sensing: A Zero-Shot Approach

Abstract

1. Introduction: Foundation Models in Archaeological Remote Sensing

1.1. Vision Foundation Models in Remote Sensing

1.2. Applications in Archaeological Remote Sensing

2. Foundation Models Used in This Study

3. Experiments

3.1. Experiment 1: Detection of Castles in Bavaria, Germany in Satellite Imagery

3.1.1. Methodology

3.1.2. Quantitative Results

3.1.3. Qualitative Results and Preliminary Assessment

3.2. Experiment 2: Detection of Angkorian Temples in Satellite Imagery

3.2.1. Methodology

3.2.2. Quantitative Results

3.2.3. Qualitative Results and Preliminary Assessment

3.3. Experiment 3: Finding English Hillforts in LiDAR Imagery

3.3.1. Methodology

3.3.2. Quantitative Results

3.3.3. Qualitative Results and Preliminary Assessment

3.4. Experiment 4: Delineating the Dimensions of Archaeological Sites in LiDAR

3.4.1. Methodology

3.4.2. Results and Preliminary Assessment

3.5. Experiment 5: Finding Potsherds in Drone (UAV) Imagery

3.5.1. Methodology

3.5.2. Results and Preliminary Assessment

4. Discussion and Conclusions

5. Data Availability

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI