Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites

Verdonck, Lieven; Dabas, Michel; Bui, Marc

doi:10.3390/rs17173092

Open AccessArticle

Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites

by

Lieven Verdonck

^1,2,3,*,

Michel Dabas

²

and

Marc Bui

⁴

¹

Faculty of Classics, University of Cambridge, Cambridge CB3 9DA, UK

²

Archéologie et Philologie d’Orient et d’Occident, UMR 8546 CNRS-ENS-EPHE (PSL), École Normale Supérieure, 75230 Paris CEDEX 05, France

³

Department of Archaeology, Ghent University, 9000 Ghent, Belgium

⁴

Archéologie et Philologie d’Orient et d’Occident, UMR 8546 CNRS-ENS-EPHE (PSL), École Pratique des Hautes Études, 75230 Paris CEDEX 05, France

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3092; https://doi.org/10.3390/rs17173092

Submission received: 15 May 2025 / Revised: 21 August 2025 / Accepted: 29 August 2025 / Published: 4 September 2025

(This article belongs to the Special Issue Multiscale and Multitemporal High Resolution Remote Sensing for Archaeology and Heritage: From Research to Preservation)

Download

Browse Figures

Versions Notes

Abstract

In recent decades, technological developments in archaeological geophysics have led to growing data volumes, so that an important bottleneck is now at the stage of data interpretation. The manual delineation and classification of anomalies are time-consuming, and different methods for (semi-)automatic image segmentation have been proposed, based on explicitly formulated rulesets or deep convolutional neural networks (DCNNs). So far, these have not been used widely in archaeological geophysics because of the complexity of the segmentation task (due to the low contrast between archaeological structures and background and the low predictability of the targets). Techniques based on shallow machine learning (e.g., random forests, RFs) have been explored very little in archaeological geophysics, although they are less case-specific than most rule-based methods, do not require large training sets as is the case for DCNNs, and can easily handle 3D data. In this paper, we show their potential for geophysical data analysis. For the classification on the pixel level, we use ilastik, an open-source segmentation tool developed in medical imaging. Algorithms for object classification, manual reclassification, post-processing, vectorisation, and georeferencing were brought together in a Jupyter Notebook, available on GitHub (version 7.3.2). To assess the accuracy of the RF classification applied to geophysical datasets, we compare it with manual interpretation. A quantitative evaluation using the mean intersection over union metric results in scores of ~60%, which only slightly increases after the manual correction of the RF classification results. Remarkably, a similar score results from the comparison between independent manual interpretations. This observation illustrates that quantitative metrics are not a panacea for evaluating machine-generated geophysical data interpretation in archaeology, which is characterised by a significant degree of uncertainty. It also raises the question of how the semantic segmentation of geophysical data (whether carried out manually or with the aid of machine learning) can best be evaluated.

Keywords:

geophysics; archaeology; semantic segmentation; shallow machine learning; random forest; geophysical data interpretation; semi-automated interpretation; archaeological prospection; ground-penetrating radar survey; magnetometry

1. Introduction

Geophysical methods record variations in subsurface properties which may point to archaeological traces. This occurs by means of sensors at or near the surface. In this paper, data acquired with three geophysical techniques are analysed. Magnetometer survey (the most frequently used method) is based on the measurement of local differences in the Earth’s magnetic field. These differences are caused by heating above 600 °C (e.g., kilns) or by the enhancement of the magnetic properties of iron oxides (e.g., by fires or bacteria in organic waste) [1,2]. Ground-penetrating radar (GPR) is based on the transmission of electromagnetic waves (microwaves) into the soil. When the wave meets a transition between layers or structures with contrasting electrical properties (mainly related to the moisture content, e.g., a wall foundation contrasting with the surrounding soil), part of its energy is reflected. By recording reflections originating from different depths, we can map buried structures in 3D [3,4]. Earth resistance survey inserts an electrical current into the ground by means of two metal probes and measures the resulting voltage difference with two additional probes. By dividing the voltage by the current, the resistance encountered by the current can be determined. The resistance again strongly depends on the moisture content: a wall or floor structure retains little moisture and causes a contrast compared to the surrounding soil [5]. A fourth widely used technique, frequency-domain electromagnetic prospection (FDEM), simultaneously measures magnetic and electrical soil properties [6].

The scale, sample density, sensitivity, and speed of geophysical measurements in archaeology have in recent decades been enhanced by technological developments such as the use of vehicle-towed sensor arrays, faster electronics, and centimetre-precise positioning. This has led to growing data volumes which can be processed and visualised efficiently by means of increasingly powerful computers and software. Consequently, an important bottleneck is now at the stage of data interpretation. To transform geophysical data into archaeological information, it is usually assumed that some form of interpretative mapping is necessary [7]. Based on this, questions regarding the character, chronology, and evolution of the human activity at the investigated site or landscape can be answered by juxtaposing geophysical data with other sources of information, such as remote sensing data, historical sources, or excavations. The first step in the mapping process is the delineation of the significant geophysical anomalies. In a second step, the anomalies can be classified, which can take many forms. Most often, they are classified as basic archaeological structures (e.g., pits, ditches, roads, walls, pavements, kilns, drains, postholes, areas with debris), geological structures, and recent disturbances (e.g., [8]). In some studies, classes are based on the strength and polarity of the anomalies [9,10]. As part of a thorough archaeological interpretation, the function and chronology of buildings or settlement parts can be analysed (e.g., [11]). When performed manually, delineation and classification are time-consuming, and different methods for the automatic or semi-automatic segmentation of geophysical datasets have been proposed. These can broadly be divided in two categories: ‘hand-engineered’ algorithms, where the user formulates the rules to detect the relevant structures, and machine learning-based techniques, where an algorithm is trained on the basis of examples, so that it develops a model which allows it to analyse new datasets without further human intervention.

In some papers belonging to the first category (employing rule-based algorithms), the geophysical map in raster format is first converted to a binary image by applying a user-defined threshold. Regions exceeding the threshold are converted to vector polygons and can be further processed, for example, to eliminate small objects or merge neighbouring polygons into a single one [12]. After vectorisation, the properties of the polygons, mostly their size and shape (e.g., roundness), are analysed to classify them into archaeological object types, for example walls or pits [13]. In Pregesbauer et al. [14], the magnetogram was segmented by merging pixels with similar amplitudes. When the resulting raster objects reached a certain predefined scale, they were classified as magnetic structures or background based on their size and magnetic value, and vectorisation occurred only as the final step. Hinterleitner et al. [15] first classified iron litter and pit structures based on the amplitude difference and the distance between neighbouring magnetic maxima and minima and subsequently vectorised these anomalies. After transforming magnetometer results to pseudo-gravity data, Linford and Linford [16] created polygons by connecting local maxima in the magnitude of the horizontal gradient. Verdonck [17] matched rectangular templates to GPR depth slices and applied a threshold to the resulting correlation image. The regions of interest were converted to 2D polygons and, after vertical extrusion, to 3D volumes. A full 3D approach to detect wall structures was proposed by Leckebusch et al. [18], who calculated the 3D gradient of GPR data volumes and fitted planar surfaces to the anomalies using least median of squares regression and analytical regression. The resulting planes were combined, and the corner points defined 3D vector objects.

At the other end of the spectrum, deep convolutional neural networks (DCNNs) [19] are powerful machine learning algorithms capable of achieving great accuracy and speed when performing the semantic segmentation of images containing urban or natural scenes in a fully automatic way [20,21]. A drawback is that they rely on very large, manually labelled training sets [22]. An experiment with a DCNN for the semantic segmentation of archaeological GPR depth slices was carried out by Küçükdemirci and Sarris [23]. They employed U-Net, a commonly used network developed in biomedical imaging, and trained it with 2000 samples in which they manually labelled two classes (structures of archaeological interest and the background where there is an absence of structures). Although the results are promising and DCNNs will most probably play an important role in geophysical data interpretation in the future, they still fail to extract some structures identified by human experts, mainly due to the absence of large quantities of annotated data covering all possible anomaly classes in archaeological geophysics [24].

So far, none of the above methods have been widely used in archaeological geophysics, because of different reasons. The most obvious one is the complexity of the segmentation task. Compared to, e.g., medical images, geophysical datasets are often characterised by small amplitude differences between the archaeological structures and the background without significant anomalies, and the contrast is weakened by the presence of noise (e.g., instrument noise, variations caused by surface unevenness or external disturbances [25]). In many cases where the classification seems relatively straightforward for a human interpreter, there is a large amplitude range which could belong to either class (Figure 1). This makes the distinction difficult for a computer algorithm. Moreover, in geophysics the relative spatial resolution (considering the target dimension) is usually low, and the shape and dimensions of the targets are not very predictable. More widespread adoption may also be hindered by the fact that software is not publicly available, that the algorithms are difficult to implement without experience in programming, or that they rely on powerful computational resources. Given the difficulty to extract archaeological information with high accuracy in a completely automated way or using a widely applicable rule set, solutions allowing the human expert to easily interact with the algorithm and manually complement and modify the results seem to be the way forward.

In recent years, several free and open-source tools for medical image segmentation and classification have been developed, which can at least partially overcome these limitations. They are based on machine learning but require only few user annotations, are interactive, have an intuitive graphical user interface (GUI), can handle 2D and 3D data nearly in real time, and do not require programming expertise. To our knowledge, these shallow machine learning techniques have not been used for the semantic segmentation of geophysical maps in archaeology. In the present paper, we describe how such algorithms can be used and adapted to interpret near-surface geophysical data. We show their potential by applying them to archaeological datasets collected with the three abovementioned geophysical methods and discuss the limitations and possible improvements.

2. Interactive Image Segmentation Based on Shallow Machine Learning

To perform computer vision tasks such as image classification, object detection, or semantic segmentation, machine learning algorithms first calculate certain properties or ‘features’ of the input data (e.g., a geophysical image). Deep neural networks create these features automatically: they apply simple modules that transform the raw data into an internal representation, which becomes increasingly abstract at higher levels of representation [26]. This makes them powerful but comes at the cost of large quantities of densely labelled training data. In contrast, in shallow machine learning, the classifier learns from a set of predefined features, which emphasise different aspects of the image that are useful for the discrimination. Because shallow machine learning techniques require only sparse user annotations, they often represent an optimal compromise between training speed and prediction accuracy (i.e., how well can the learnt model generalise to the segmentation of new, unseen data). In medical imaging, shallow machine learning-based semantic segmentation has been implemented in free software tools such as ilastik [27], Microscopy Image Browser [28], Trainable Weka Segmentation (TWS) [29], and LABKIT [30]; the last two are plugins for the Fiji image processing toolbox [31]. These tools include filters to extract a range of generic features belonging to different categories: noise reduction filters such as Gaussian smoothing or the bilateral filter [32], edge detectors (e.g., the Laplacian filter, the gradient magnitude, the Gabor filter, or the difference of Gaussians), and filters extracting texture information (e.g., the calculation of the eigenvalues of the Hessian matrix and of the structure tensor). Before the application of a filter, Gaussian smoothing at different scales is performed.

The user provides training data by drawing sparse manual annotations (brush strokes) on pixels representing two classes (archaeological structures and background; Figure 2a), although more classes can be defined. The classifier is trained almost in real time and predicts the probability that each pixel in the image belongs to a certain class (Figure 2b). Consequently, the user receives immediate feedback and can improve the classification by adding more labels, in an iterative way, preferably where the classification was wrong or at transitions between different classes, where uncertainty is high (Figure 2c) [27]. This interactivity also plays an important role in the segmentation algorithm presented by Urschler et al. [33], which was applied to archaeological 3D GPR data by Bornik and Neubauer [34]. In their approach, user annotations form seed regions, and a weighted total variation formulation attempts to create homogenous regions by minimising the total variation (the total gradient magnitude of the image) and only allowing changes near strong edges in the original image. By contrast, the method presented here describes each pixel of the input data with different features. From a few observations of pixel features and the corresponding labels, provided in the training data, the learning algorithm derives a function or decision surface which allows it to discriminate between different classes and then to predict the class of unobserved pixels from their feature set.

The default classifier in the abovementioned shallow machine learning tools is the random forest (RF), which derives models that can easily be generalised to unseen data without overfitting the training set and is computationally efficient so that it can deal with large datasets. The RF is usually composed of 100 to 200 decision trees. A decision tree consists of subsequent splits (tree nodes), which recursively partition the training set by applying thresholds to its different features, based on the annotations provided by the user. In this way, homogeneous subsets, corresponding to the user-defined classes, are created. To classify a new pixel with known feature values, it is propagated through the tree [35]. Classification accuracy can be improved by using a collection of decision trees (a RF). To create a weakly correlated, strongly randomised forest, the trees are grown from a bootstrap sample (a random sample with replacement, i.e., a training pixel can be used to grow more than one tree) drawn from the complete training set. Moreover, to select the best split at each node, a random selection of features is used instead of all features [36,37]. By aggregating the votes of every tree in the forest, the probability that a pixel belongs to a particular class is obtained. Thresholding the probability map results in the segmentation of the image into regions of archaeological relevance and background (Figure 2d). In archaeological remote sensing, RFs have been used, for example, by Orengo et al. [38] to detect settlement mounds in multisource and multitemporal satellite data.

After segmentation at the pixel level, objects can be created by merging segmented pixels that are connected horizontally or vertically or also diagonally (1- and 2-connectivity, respectively) [39]. Of these objects, features can be calculated to facilitate classification (for example the size, mean intensity, or convexity). This is like most rule-based object classification approaches in archaeological geophysics, although it is entirely machine learning-based: the operator selects the features to be employed for object classification, but the classification is generated by an RF, trained with sparse labels drawn by the user on objects belonging to different classes.

3. Interpreting Geophysical Data with Interactive Segmentation Tools

In this section, we propose a strategy for the computer-aided interpretation of geophysical data, based on shallow machine learning methods. For the pixel classification in 2D and 3D (Section 3.1 and Section 3.5, respectively), the workflow in ilastik was followed [40,41]. Similar results were obtained with the other toolkits for bio-image segmentation described in Section 2, which are also based on an RF classifier (TWS, LABKIT, and Microscopy Image Browser). Figure S1a,b in the Supplementary Data shows the pixel classification of the data shown in Figure 2a (this article) using TWS and LABKIT, respectively, carried out using the same manual annotations as in Figure 2a and the same features as those used in ilastik (for a list of the features, see the Supplementary Data, Section A.1). The result is virtually identical to that obtained with ilastik (Figure 2d). Also, when the full range of features available in TWS is used (including, for example, the bilateral filter [32], the Kuwahara filter [42], and a range of Gabor filters [43] for improved edge detection), the result is very similar (Figure S1c). The decision to use ilastik was based on the fact that it offers the possibility to apply advanced thresholding of the probability maps and to create individual objects from the resulting segmentation. Moreover, it provides a fast computation of the pixel features (also in 3D) and RF classification (see Section 3.1).

The rest of the workflow in 2D (Section 3.2, Section 3.3 and Section 3.4) and 3D (Section 3.5) was implemented as a Jupyter Notebook (version 7.3.2; see the Supplementary Data and https://github.com/lrverdon/Shallow-Machine-Learning-for-Archaeological-Geophysics, accessed on 28 August 2025), making use of the Python libraries scikit-learn (version 1.6.1) and scikit-image (version 0.24.0) for machine learning and image processing, respectively [39,44], and of the napari viewer for the visualisation and annotation (version 0.4.18) [45].

3.1. Pixel Classification and Creation of Objects

The ilastik software supports a number of image formats; for our experiments we imported geophysical maps in TIFF, PNG, JPG, and GeoTIFF formats. In the software, six generic image features can be calculated, at different scales. To select the relevant features and scales resulting in an accurate segmentation, we compared different techniques [46,47,48,49], available in ilastik (Supplementary Data, Section A.2). Because the accuracy was highest when selecting all available features and scales, whereas this did not considerably slow down the segmentation, we used the complete feature set for all case studies presented in this paper. We tried other classifiers than the RF (e.g., k-nearest neighbours and support vector machines), whose implementation in ilastik is based on scikit-learn, but the training was considerably slower compared to the VIGRA parallelised RF implementation [50].

We trained the RF by drawing manual annotations belonging to the ‘archaeological structures’ and ‘background’ classes (Figure 2a). The RF classifier almost instantaneously produces class probability maps for the portion of the geophysical map that is displayed (Figure 2b), if that portion is small enough (not more than a few thousand pixels in the x- and y-direction). By zooming in on different image portions and adding labels, a good compromise between accurately detecting the archaeological structures and minimising the number of false positive pixels was usually reached in less than ten iterations. Attempts to distinguish between more than two classes in the pixel classification phase were not successful. Better results were obtained by further refining the classification at the object level (Section 3.2).

The probability maps predicted by the RF were converted into a segmentation by applying hysteresis thresholding [51] to the ‘archaeological structures’ class (Figure 3a). Hysteresis thresholding selects pixels with probability values above a low threshold if they are connected to at least one pixel with a probability above a high threshold. This allows regions with low probability to be eliminated, while low-probability margins of objects with high probability are preserved. Of the objects created in that way (Figure 3b), only those larger than five pixels or ten voxels were kept (size threshold applied to magnetometer and 3D GPR data, respectively).

We also implemented the pixel classification in the Jupyter Notebook, using napari, which equally supports a large number of input image formats (including GeoTIFFs). The results were nearly identical to the ones obtained with ilastik (compare Figure 2d with Figure S1d in the Supplementary Data), but the computation time was much longer; as such it is recommended to perform the pixel classification of large 2D images or 3D data in ilastik and export the binary segmentation image, for example, in HDF5, a format for storing and managing large amounts of data and metadata in a single file using a hierarchical system [52]. The processing steps below, which are not available in ilastik or have been adapted to geophysical data analysis, can then be performed in the Jupyter Notebook.

3.2. Object Classification

After hysteresis thresholding, the created objects are classified. The approach taken here assists in the classification of basic structures (e.g., pits, walls, ditches, non-anthropogenic structures, recent anomalies), for which features such as the size or shape can be calculated. A more profound archaeological classification (e.g., assigning building functions) can only be carried out by a human interpreter.

The objects created through the pixel classification and thresholding may encompass different archaeological classes (e.g., a wall and a floor are connected as a single object), which complicates the classification. To divide objects into smaller, more homogeneous parts, watershed segmentation can be performed (Figure 4). First, for each pixel in the object, the distance to the background is calculated. From this distance transformation, the opposite is taken, and the local minima are used for constructing catchment basins. The object is then separated along the lines where the basins meet, similarly to a geological watershed [53]. The only parameter to be set by the user is the footprint (the region within which the local minimum is searched for), which allows for adjusting the size of the final objects. The objects after the watershed segmentation should not be too small, so that they preserve their discriminative characteristics (e.g., low circularity of a ditch, as opposed to a pit).

The Jupyter Notebook makes use of the collection of object features available in scikit-image, which is more comprehensive than in ilastik. Furthermore, other features, not included in scikit-image, can be calculated. For example, in Figure 3c,d the objects were classified using the standard features ‘area’, ‘eccentricity’, and ‘solidity’, and an additional feature was created to discriminate small magnetic dipoles. It subtracts the object’s minimum intensity from its maximum intensity and divides the difference by the area of the object. The RF is then trained by drawing a few annotations on objects belonging to different classes (Figure 3c), and the classification is improved in an iterative way until the final result is obtained (Figure 3d).

3.3. Manual Classification and Correction of Magnetic Anomaly Boundaries

The pixel classification by the RF (Section 3.1) may result in an imperfect binary image, and objects may be classified wrongly because the classes are not perfectly separated by the object features (Section 3.2). In our workflow, these errors can be curated manually, using the same tools as for the annotation. First, in the Jupyter Notebook, entire objects can be reclassified by labelling them (Figure 5a,b). Secondly, the correction can occur at the pixel level, reclassifying parts of existing objects, changing their extent, or creating entirely new objects. The arrows in Figure 5d point to examples of faint structures not detected by the pixel classifier and manually classified as ditches.

Algorithms such as ilastik and scikit-learn allow for a fast, interactive image segmentation based on visual analysis. However, geophysical maps are not always true representations of the subsoil due to, for instance, anomalies extending beyond the real size of the buried objects. To obtain an optimal interpretation, existing methods taking into account the geophysical characteristics of the anomalies, allowing a more accurate reconstruction of the objects’ dimensions, should be included in the workflow [54]. For example, to determine the edges of objects in magnetometer data, magnetic anomalies can be converted to pseudogravity anomalies. The magnitude of the horizontal gradient derived from these pseudogravity anomalies can be calculated, and local maxima can be identified and connected, to find the edges of the buried structures [16]. Alternatively, the full width at half maximum of the magnetic anomalies can be calculated, which is an indicator of the structures’ boundaries [55]. In this study, we determined the boundaries of the anomalies by calculating the horizontal gradient magnitude of the magnetometer data (the square root of the sum of the two squared horizontal derivatives [56,57]) without a pseudogravity transformation. The horizontal gradient magnitude shows the edges more clearly than the magnitude of the analytic signal (the vector addition of the two horizontal derivatives and the vertical derivative of the magnetic field [58]). To obtain the corrected boundaries of objects in magnetometer data, we used a heuristic approach, which consisted of selecting within each object the largest values of the horizontal gradient. Those values were used to calculate the threshold for the delineation of the anomaly. This made the edges of the anomalies correspond well with the maxima in the horizontal gradient map (Figure 5c; see the Supplementary Data, Section A.3, for more details and an illustration).

3.4. Post-Processing and Vectorisation

In Python libraries such as scikit-image, utilities operating on raster images are available for refining the results obtained in the previous steps, e.g., morphological closing to fill gaps within regions classified as relevant or morphological opening to remove small objects resulting from noise and simplify the shape of the anomalies. For example, in Figure 5c, morphological closing was performed on the ditches to give them a more continuous appearance. The final step is to convert the objects to vector data (polygons). We extracted contours and converted them to polygons using the opencv (version 4.10.0) and shapely packages (version 2.0.7), respectively. Tools developed for semantic segmentation in medical imaging such as ilastik cannot store real-world coordinates. To enable exports as georeferenced vector files, the Jupyter Notebook reads the coordinate reference system and coordinates from the input image if this is a GeoTIFF and also provides the possibility to enter these manually. Using this information, vector files suitable for visualisation and analysis in a geographical information system (GIS) were created, in SHP format, by means of the geopandas library (version 1.0.1).

3.5. Three-Dimensional Workflow

The machine learning tools presented in this paper are capable of handling 3D geophysical datasets as volumes, instead of treating them as separate 2D images. The 3D workflow is similar to the one for the 2D case. A 3D data volume can be imported as a single file (e.g., a 3D TIFF) or as a stack of 2D images. All pixel features can be calculated in 3D. To obtain a good classification, annotations need to be made across the entire volume. In ilastik, class probability maps are predicted nearly in real time also when analysing 3D data, so that the close interaction between the user and the algorithm is maintained. The quality of the prediction can be verified in three orthogonal slicing planes. In scikit-image, most object features are defined in 2D and 3D, and this also holds true for all morphological operations, although the computation time can be larger. The result of the segmentation can be visualised in 3D in the napari viewer. Generating triangular surface meshes from the segmentation volume (one mesh for each class or one for all classes merged together) is not possible in napari but can occur in software packages such as ParaView. The mesh can be saved, for example, in the OBJ format for the processing (e.g., mesh decimation), analysis, or visualisation in other software. To generate a mesh in real-world coordinates, the local coordinates used in napari need to be transformed. Care should be taken when performing a translation to large coordinates, as most free software (e.g., ParaView and Blender) uses single-precision floating-point numbers. These cannot store sufficient significant digits to guarantee the necessary precision for large coordinates. Consequently, object details are no longer correctly represented (e.g., coordinates larger than 10⁷ m are rounded off to 0.5 m). This problem can be avoided when using (paying) software packages using double-precision numbers, such as Rhino. A workaround is to perform the translation manually by adding an offset to the coordinates of the vertices in the OBJ file (for a more in-depth discussion, see [59]).

4. Results

Below, we illustrate the potential of the proposed workflow by means of three datasets, collected with different geophysical methods at sites from different regions and periods. These datasets and the results presented in this paper are available at https://zenodo.org/records/15422270, accessed on 28 August 2025.

4.1. Boviolles

The first dataset was collected at the oppidum of Boviolles (Meuse, France), occupied from the end of the second until the end of the first century BCE, when the occupation shifts to a nearby, lower-lying Roman town. Its defensive structures enclose an area of ~67 ha and include a 7 m wide and 3 to 5 m high murus gallicus (a rampart with a grille of timber beams inside and a stone wall as its façade; Figure 6a, no. 1) [60]. Between 1999 and 2012, magnetometer prospections by Geocarta covered 53 ha of the fortification. They were conducted with different instruments: a Geometrics G858 caesium gradiometer (1999–2005) measuring the total field and motorised arrays of Foerster FEREX (2004–2008) and Grad-01-1000L Bartington fluxgate gradiometers (2009–2012) measuring the vertical component of the magnetic field, with vertical distances between the sensors of 0.8, 0.65, and 1 m, respectively. The sample interval along the lines varied between 0.02 and 0.10 m; the distance between the lines was 0.5 m. Notwithstanding these differences in equipment and field methods, the surveys were seamlessly integrated into a single map. Processing included the subtracting of the median from each line and an interpolation onto a regular 0.25 m × 0.25 m grid. The magnetometer survey revealed a ~1200 m long and ~7 m wide ditch, which forms the northern boundary of the most elevated part of the fortification (Figure 6a, no. 2). It also detected smaller ditches, some of which may border streets (no. 3), and two parcel boundaries of more recent origins (no. 4). In the NW part, strong bipolar anomalies may originate in a recent fence. Numerous pits produce a clear contrast with the magnetically homogeneous Tithonian limestone geology [61]. They indicate cellars, wells, extraction pits, and artisanal structures and suggest a dense habitation which may have been divided into insulae (blocks) by the streets. Pits of different sizes and shapes and a few post holes were also brought to light by excavations, but no masonry structures were found, and no house plans were observed. The finds point to diverse artisanal practices (e.g., weaving, bone working, iron and copper metallurgy), and Italian amphorae are evidence of trade with the Mediterranean [60].

To perform the semantic segmentation, a TIFF file with the gradiometer measurements (in nT) was imported, and the methodology described in Section 3 was followed. The time needed to train the pixel and object classifiers is given in Table 1. For the 250 m × 250 m area indicated by red square no. 5 (Figure 6a), the workflow is illustrated by Figure 2, Figure 3 and Figure 5. The result after hysteresis thresholding shows that the ‘archaeological structures’ class includes all but the very subtle traces, which are hardly distinguishable from the background (Figure 3b). Details of archaeological interest, such as the interruption in the ditch indicating the passage of a road (Figure 6a, no. 6), are clearly represented (Figure 6b).

In Figure 7, we zoom in on two areas, ~1 ha each (for their location, see Figure 6a, nos. 7–8), to compare the proposed method for pixel classification (Figure 7b,e) with the application of a simple threshold (Figure 7c,f). As can be seen in Figure 7c, when setting the threshold at 1.75 nT, several pit structures detected by the RF classifier are not, or only partially, identified (see the arrows in Figure 7b). Slightly lowering the threshold to 1.5 nT leads to the selection of small noisy regions caused by striping artefacts (see the arrows in Figure 7f) that remain excluded in the results generated by our method (Figure 7e). This demonstrates the ability of the RF classifier in combination with hysteresis thresholding to select the relevant structures while also eliminating the noise. Employing a simple threshold would require additional processing steps to achieve the same result.

To assess the accuracy of the RF pixel and object classification, we zoom in on a 250 m × 250 m area (Figure 6a, no. 9; Figure 8a) and compare the result generated by the classifier (Figure 8b) with a detailed manual segmentation (Figure 8d), excluding the areas used for training the RF from the comparison. For this quantitative evaluation, we defined three classes (circular anomalies, linear anomalies, and background; bipolar anomalies were considered as background). We used the intersection over union (IoU) metric, which measures the ratio of M ∩ A/M ∪ A, where M is the manual annotation, and A is the segmentation by the algorithm. The mean IoU (the average of the IoU metrics for each class) is 56.1%, but there are large differences between the scores of the three classes (see Table 2 and the confusion matrix in Figure 9). Whereas the IoU of the circular anomalies (pits) is 62.5%, the score of the ‘linear anomaly’ class (ditches) is only 6.5%. In the test area, almost no manually identified linear structures were detected by the machine classifier (Figure 8b,d) because the amplitude of these anomalies hardly exceeds the background. Moreover, during training, only clear examples of ditches were provided to the classifier.

To improve the detection of the linear anomalies, in a second experiment we included examples of very weak ditches in the training data (similar to the ones near nos. 1–2 in Figure 8a). This only marginally improved the IoU of the linear anomalies (9.4%; see Table 2 and Figure 8c). This limited improvement is partially due to the fact that some weak linear anomalies are data acquisition artefacts (e.g., nos. 3–4 in Figure 8a), which are not retained as relevant in the manual interpretation. Moreover, the IoU of the circular anomalies deteriorated significantly, as the selection of weak anomalies during training resulted in many circular false positives. In this case, the removal of the false positives would be more time consuming than manually delineating a limited number of ditch structures not detected by the algorithm. However, part of these false positives may represent real, small-sized pit structures. The manual annotation in Figure 8d cannot be considered an objective ground truth, which makes the quantitative assessment of the RF classification difficult. Additional information (e.g., from excavations) could reduce this uncertainty.

4.2. Interamna Lirenas

The Roman town Interamna Lirenas (Lazio, Italy) was founded in the late fourth century BCE and was occupied until the sixth century CE. It extends over ~25 ha, on both sides of the NW-SE running Via Latina. A number of streets running perpendicularly to this main axis define the insulae of the town. Since 2010 the Faculty of Classics of the University of Cambridge has run a fieldwork project, including field walking, geophysical prospection, and excavations. After a fluxgate gradiometer survey of the complete urban area had been conducted in 2010–2012 by the British School at Rome, it was entirely mapped through a GPR survey in 2015–2017 by Ghent University. The surveys demonstrated a large proportion of small domestic units, indicating a high population density. The occupation reached its maximum by the first century BCE. From the third century CE, the urban area and the population decreased, but the town existed into the fifth century. This is explained by its location along the Liri river, between inland cities to the north and the coast to the south.

The GPR survey was carried out using a Sensors & Software Spidar system comprising fifteen 500 MHz antennae, with a sample density of 0.05 m × 0.05 m [62]. Basic processing included dewow, airwave alignment, time zero correction, amplitude scaling, frequency filtering, and background removal. The data were migrated, the wave velocity was estimated by means of a migration velocity analysis, and a time-to-depth conversion was performed. The geophysical prospection provided evidence for the original planned layout (forum, street network, plot types of different sizes), and public buildings, such as the theatre, the basilica (both constructed in the second half of the first century BCE), and three bathhouses. Furthermore, it revealed shops and houses of varying dimensions. The town’s role as a trade hub explains the many courtyard buildings, which may have fulfilled residential but also commercial functions (horrea or warehouses) [11].

An area of 2.78 ha in the central part of Interamna Lirenas, to the south of the Via Latina (Figure 10a), was analysed with the techniques presented in this paper. This area includes strong GPR reflections: a bath complex (Figure 10a, no. 1), a large house with a portico (no. 2), a possible macellum (no. 3), and a small temple (no. 4), as well as less clear anomalies further away from the main road interpreted as houses (no. 5). Because the oscillatory nature of the GPR traces with zero crossings between positive and negative amplitudes would complicate the segmentation of continuous anomalies in the vertical direction [63], the envelope was calculated, and a data volume of 200 m × 150 m × 1.2 m with a spatial resolution of 0.05 m × 0.05 m × 0.01 m and spanning a depth of 0.25–1.45 m was imported in ilastik and napari as a series of TIFF files.

The segmentation and classification were performed in 3D, according to the description in Section 3. To train the RF voxel classifier, 35 annotations were made in 7 of the 120 depth slices, representing ‘background’ and ‘archaeological structures’ classes. This took approximately 45 min. for the complete 3D dataset (Table 1). Hysteresis thresholding was applied to the probability volume of the ‘archaeological structures’ class, with a low threshold of 50% and a high threshold of 80%. Segmented voxels, connected horizontally, vertically, or diagonally in 3D, were merged into individual objects (Figure 4b).

Because these were often large and included structures belonging to different object classes (wall foundations, floors), the watershed segmentation was applied by means of a footprint of 20 × 20 × 20 voxels, resulting in more homogeneous objects (Figure 4c). The results show that the proposed method can select all clear archaeological structures. In Figure 4b, weak anomalies such as the ones near nos. 1–2 are detectable by a human interpreter but were not selected by the algorithm. Diffuse reflective areas 3–5, underlying clear floor structures in shallower time slices, are also difficult to delineate for a human interpreter, whereas anomalies 6–7 may be produced by irregularities in the surface of the prospection area.

We performed a more detailed object classification of the ‘archaeological structures’ class. A first analysis of the time slices showed that most of the objects were wall foundations and floors. We decided to let the RF classifier distinguish between those two object classes only and to manually reclassify objects to be interpreted as streets, columns, drains, etc., as tests demonstrated that letting the RF distinguish minority classes resulted in many wrong classifications.

The success of the object classification depends on the selection of object features which can discriminate between the different classes. The features included in scikit-image measure the characteristics of the objects themselves (for example their shape), but it may also be helpful to look at the relationships between objects, for instance, by examining where they are located in relation to each other. Given the fact that Interamna Lirenas is a planned town, with the Via Latina as a central axis and streets running perpendicularly to it, we designed features that use this directionality, to distinguish between objects belonging to linear structures oriented according to the orthogonal grid (mainly walls) and non-linear objects such as floors. These features were calculated as follows. First, the two dominant (perpendicular) orientations of the linear structures were found using the Hough transform [64]. Secondly, rectangular structuring elements (SEs) were defined with a size of 60 × 5 pixels (3 m × 0.25 m) and orientations equal to the ones obtained in the first step. Thirdly, two methods were tested to determine the extent to which each pixel belongs to a structure following the direction of the orthogonal grid. (1) The rank-max opening (RMO) was applied to each GPR time slice [65,66,67], using a rank equal to 225 pixels or 75% of the SE (see the Supplementary Data, Section A.4, for a more detailed description). (2) Because of the long computation time of the RMO (4 h for the dataset from Interamna Lirenas), we also calculated the 2D normalised cross-correlation [17,68] between each time slice and the two abovementioned SEs. This much faster approach (computation time: 8 min) produced similar classification results as the RMO. For both methods, from the resulting ‘directionality’ volume, two 3D object features were derived, calculating the mean and the maximum directionality of the voxels corresponding to each object (Figure 11). These were used to classify the objects, together with the ‘roundness’ feature, which we derived as the ratio of the ‘equivalent_diameter_area’ to the ‘axis_major_length’ object properties in scikit-image. The RF was trained by drawing ~150 annotations in seven slices (Figure 12a,b) in five iterations. This took approximately one hour. The results presented below (Figure 12 and Figure 13 and Table 3) are based on object features calculated using the RMO.

To assess the accuracy of the proposed method, we compared the RF object classification shown in Figure 12b (0.25 ha) with an existing manual interpretation of the complete Roman town by an expert (A. Launaro). To enable the calculation of the IoU metric in scikit-learn, the manual interpretation shapefiles [69] were converted to raster labels. The mean IoU of the three classes (background, wall foundations, and floors) was 56.6%. As at Boviolles, this value hides big differences between classes (Table 3): the IoU of the ‘walls’ class is 45.1%, and for the ‘floors’ class it is 32.9%. In Figure 12b, the arrows indicate wrongly classified objects, e.g., walls interpreted by the algorithm as floors. Manually correcting segmentation errors (Figure 12c) slightly raised the mean IoU (60.8%) and the IoU of the individual classes (50.4% and 39.8% for the ‘walls’ and ‘floors’ classes, respectively), see Table 3. Remarkably, these scores are not much lower than the comparison between two independent, fully manual interpretations (by A. Launaro and L. Verdonck), which resulted in a mean IoU of 61.8%. For the manual mapping, a depth slice is shown in Figure 12d.

Figure 10b illustrates the result after applying the RF segmentation and manual object reclassification to the complete 2.78 ha area (manual pixel annotation of regions undetected by the RF pixel classifier was only conducted in the area indicated by the red square in Figure 10a). For its inclusion in a GIS, the 3D segmentation was projected onto a 2D map, which was vectorized and georeferenced (see Section 3.4). Each grid point in the map was assigned a class label other than ‘background’ if that class occurred at that position in at least one depth slice of the 3D volume. When different classes had been assigned to a grid point at different depths, the most frequently occurring class was selected.

Three-dimensional representations are often more difficult to interpret than two-dimensional plan views. To make the visualisation more easily understandable, the 3D object shapes were simplified by applying morphological operations. To the ‘walls’ class, a 3D morphological closing followed by a 3D morphological opening were applied, and the surfaces of floors and streets were low-pass filtered to given them a more even appearance. The visualisation in the napari viewer is shown in Figure 13.

4.3. Vieil-Évreux

The Gallo-Roman site Gisacum (Vieil-Évreux, Normandy, France) was founded probably at the end of the first century BCE or the beginning of the first century CE. Buildings from the first century CE, centred around a temple, were demolished when the site became one of the largest sanctuaries of Gaul in the beginning of the second century CE. The residential areas formed an irregular polygon, which had a perimeter of 5.6 km and enclosed an area of 230 ha. In the centre of the polygon, the monumental sanctuary and other monumental buildings, such as a theatre, a bath complex, the forum, and two fana (temples), were located, as well as a ~2 km long aqueduct. In ~200 CE, the sanctuary was rebuilt so that it consisted of three temples and extended over at least 6 ha. From the mid-third century onwards, the city experienced a decline, and the sanctuary was demolished in the fourth century [70]. At the site, electrical resistance measurements, magnetometer prospection and GPR survey were conducted (e.g., Ref. [71]). Prospections with the Automatic Resistivity Profiling (ARP) system [72] were undertaken over the western fanum and in a field to the west of it. This field of 11 ha, surveyed by Terra NovA in 2005, is situated over part of the residential area forming one side of the polygon. Figure 14a shows the measurements conducted with channel 2 of the ARP system (where the electrodes form a square array with a side length of ~1 m). A reading was taken every ~0.2 m along lines ~1 m apart. The data were converted to apparent resistivity, filtered by calculating the median of five successive measurements along the lines to eliminate outliers, and interpolated onto a regular 0.25 m × 0.25 m grid.

The data from Vieil-Évreux show clear anomalies caused by buried remains of houses and streets but also strong variations in background resistivity, which complicated the pixel classification in ilastik. Marking the edges of archaeological structures with labels representing the ‘archaeology’ and ‘background’ classes (Figure 15a) resulted in the selection of numerous anomalies caused by soil variations, giving the image a noisy appearance (Figure 15b). Suppressing these regions by labelling a few as ‘background’ (Figure 15a, arrows) lead to a loss of archaeological information (see the arrows in Figure 15c). The best result was obtained after applying hysteresis thresholding and a size threshold to the results in Figure 15b, as illustrated in Figure 14b. The lower spatial resolution of the geophysical data causes edges to be blurred, leading to irregular object shapes. It also results in objects with uncertain classifications or encompassing different classes (e.g., ‘wall’ and ‘floor’ or ‘street’ and ‘floor’). Since it was difficult to select object features effectively discriminating between classes (e.g., between ‘wall’ and ‘soil variation’), the object classification was carried out manually, by first categorising all objects in the ‘walls’ class and then selecting objects belonging to other classes. In total, training the RF classifier and performing the manual object classification took less than three hours.

5. Discussion

A few observations can be made with regard to the results presented in the previous section. The first one relates to the accuracy of the proposed method. When visually inspecting the maps produced by the machine learning classifier, important segmentation errors can be seen (Figure 3d, Figure 8b and Figure 12b). They mainly belong to two categories: (1) structures weakly contrasting with the background, which are visible to a human interpreter but cannot be detected by the RF pixel classifier because labelling them as relevant structures during training would produce many false positives elsewhere in the map, and (2) wrongly classified objects and objects extending over structures with different semantic classes, a problem which in 3D datasets can also occur in the vertical direction. To create more homogeneous objects, watershed segmentation can be performed as described in Section 3.2, but this operation is never perfect. Manual corrections can optically improve the segmentation, as is shown in Figure 5 and Figure 12c. This contrasts with the relatively low scores resulting from the quantitative evaluation, where the mean IoU does not exceed 60% also after the manual intervention (Table 3). Here, it is important to observe that in our experiments the IoU was barely higher when comparing two manual interpretations by different human experts (see Section 4.2). This is in line with the fact that, in addition to the abovementioned segmentation errors by the algorithm, some other differences between the RF and human interpretation cannot be considered erroneous but illustrate the uncertainty about the nature and extent of archaeological structures that is inherent to geophysical data interpretation. For example, at Interamna Lirenas, shallow furrows caused slight variations in the physical contact between the GPR antennae and the soil, which are visible as artefacts in the deeper slices. At some locations the orientation of the furrows is the same as the archaeological structures (Figure 4b, nos. 6–7), which makes them hardly distinguishable. Another example is the delineation of weakly reflecting floor surfaces in GPR data, which often fade into the background without clear borders. In these cases, machine learning algorithms propose one of several plausible segmentations [73]. Metrics such as the IoU are appropriate for evaluating the semantic segmentation of images for which relatively objective ground truth segmentation masks are available, for example natural or urban scenes [74,75]. However, they are less suitable as evaluation measures in complex archaeological contexts. The uncertainty can be reduced by combining different non-invasive data sources if they reveal the same buried structures or by evidence from excavations, although for large, densely occupied sites, usually no more than a tiny fraction can be investigated with invasive means, and also the interpretation of excavation results is subjective [76].

Table 1 shows the time that was required to process the datasets in Section 4 using the methods presented in this paper. The time needed for interactively training the pixel classifier, applying hysteresis thresholding, extracting objects, and training the object classifier was 75 min for Boviolles and 129 min for Interamna Lirenas, consisting of 120 GPR depth slices, each with approximately the same number of pixels as the Boviolles magnetometer plot. This illustrates how shallow machine learning can efficiently generate segmentations of dense volumes based on sparse training labels in only a few 2D slices. After the interactive training phase, when further iterations no longer result in an improved RF segmentation, manual user intervention can further improve algorithm results (Table 1). Whereas objects can easily be reclassified by drawing a small label on the wrongly classified object, manual pixel annotations to select regions undetected by the algorithm (Section 3.3) are time-consuming if a detailed delineation is aimed for. However, this should be contrasted with the time needed to run a fully manual analysis. Our experiments indicate that shallow machine learning can halve the time required for entirely manual interpretations; the 3D GPR dataset from Interamna Lirenas was analysed approximately three times faster. Because it is difficult to compare the quality of machine learning-based and entirely manual interpretation, as is described in the previous paragraph, these time savings are given as an indication, based on our experiments, and should not be seen as objective measurements.

Our method is raster-based. At the end of the workflow, the segmentation in the raster format can be converted into shapefiles for further analysis in a GIS. This is different from most existing approaches in archaeological geophysics, which first vectorize the image and analyse the resulting polygons. Nowadays, for most vector processing tools, raster-based equivalents exist. For example, operations such as buffering, merging, or removing holes have raster-based counterparts in mathematical morphology (e.g., dilation, erosion, and combinations of these). Calculating properties of polygons in GIS software finds its parallel in the measurement of region properties in raster images, in libraries such as scikit-image. Compared to vector processing, the computation speed of raster-based algorithms is usually lower and the random access memory (RAM) consumption is higher. Table 4 illustrates the computation times and RAM usage for the most important steps in our processing scheme (when this was carried out with the computer described in Table 4).

In ilastik, building the RF and predicting the probabilities for the complete dataset at once leads to long run times and high memory consumption, especially in 3D. Limiting the displayed portion of the data to 1500 × 1000 pixels strongly reduces the run time and RAM usage. Assuming that training the pixel and object classifiers takes five iterations, the total computation time when segmenting the Boviolles data (2D) was approximately eight minutes, and the RAM usage always remained below 5 GB. For the 3D dataset from Interamna Lirenas, the computation took 40 min (with the calculation of the directionality object feature based on template matching and not on the rank-max opening, whose implementation needs further optimisation; see Section 4.2). The memory usage did not exceed 40 GB, except when the watershed segmentation was performed.

Datasets larger than the memory can be manually cut up into smaller parts. The classifier can be trained interactively on a few parts and applied to the other parts with little further user input required (e.g., using the batch processing function in ilastik). In addition, Python libraries for parallel computing such as dask allow for the automatic handling of arrays larger than the memory [77]. In this way, it was possible to perform the watershed segmentation (taking up 90 GB of RAM when based on the entire array of 3000 × 4000 × 120 voxels) on a machine with 16 GB of RAM, by dividing the data in blocks of 500 × 500 × 20 voxels, whereas the computation time did not increase. Figure 16 shows the result, which is similar to the one created with the entire datacube (Figure 4c), apart from some edge effects at the boundaries between the blocks (see the arrows in Figure 16a). However, complex processing steps are less straightforward when splitting the dataset by means of dask. For example, carrying out the RF object classification of the individual blocks may produce a different result from the classification of the entire cube if no representative training set is available for each block. Moreover, in ilastik, whose C++ implementation of the 3D feature calculation is many times faster than equivalent algorithms in Python, there is no functionality that allows us to automatically cut up large arrays. Therefore, the classification of a 3D dataset comparable in size to the one from Interamna Lirenas requires at least ~30 GB of RAM. Segmentation tools such as ilastik, TWS, and LABKIT provide a command line option as an alternative for the GUI, which opens the possibility to conduct operations on a high-performance computing cluster (see, e.g., [29,30,41]). To maintain maximum interactivity, training happens by means of the GUI, and the pretrained classifier is then used for further processing.

In our view, the limitations of the raster approach are outweighed by the possibility to harness the power of machine learning tools for semantic image segmentation. In addition, when analysing 3D data, the presented methods can be applied directly to the data volume, without analysing 2D images separately. In this way, the geophysical anomaly can be segmented in its full 3D geometry, classified on the basis of its 3D properties, and post-processed in 3D, if necessary.

The techniques proposed in this paper do not replace, but should be used in combination with, existing geophysical data analysis. In Section 3.3, it was described how the size of magnetometer anomalies detected by the RF can be corrected so that they more closely represent the edges of the buried objects. Another example is the calculation of attributes used in seismic and GPR data analysis, such as the coherence, similarity, symmetry, or texture-based attributes [78,79]. Some attributes share similarities with the standard features used for the pixel classification in ilastik, TWS, or LABKIT. This is the case, for example, for attributes based on the eigenvalues of the structure tensor [80,81]. If we adopt the definition of ‘attribute’ formulated by Chopra and Marfurt [82] (‘any measure of seismic data that helps us visually enhance or quantify features of interpretation interest’), the standard features in the software packages discussed in this paper can be considered as attributes. However, there is a large potential in using more complex attributes in the seismic and GPR literature as pixel or object features to explore how these can improve the quality of the segmentation.

Below this is illustrated with an example from the Interamna Lirenas dataset. For the horizontal slice in Figure 17a, two attributes were calculated. (1) Similarity is a measure of coherency between neighbouring traces, within a predefined time window [83]. It is high when traces are similar and low when there is an edge. After calculating the instantaneous amplitude of the data [57], we determined the similarity, using a time window of 5 ns (Figure 17b). (2) Phase symmetry is an attribute based on the analysis of local phase information, obtained by convolving the data with wavelets based on Log-Gabor functions. In this way, areas of symmetry (such as wall structures) can be detected. For each pixel in a 2D image (e.g., a GPR depth slice), this occurs for different wavelet scales and different orientations, and the average of the results is taken. We employed the algorithm developed by Kovesi [84,85] using only the two perpendicular orientations of the street grid and most wall structures at Interamna Lirenas (Figure 17c). In Figure 18, the pixel classification using the standard features in ilastik is compared to the classification using the similarity, phase symmetry, and the cross-correlation of the GPR slice and rectangular templates in the same two perpendicular orientations (as described in Section 4.2). Whereas, in general, the two images appear similar, the continuity of the linear structures is improved by the use of GPR attributes, as is shown by the arrows in Figure 18b.

6. Conclusions

This paper stems from the observation that there is an increasingly recognised need for computer-aided tools when mapping large geophysical datasets. Often, fully manual interpretation has become an almost insurmountable hurdle, so that geophysical measurements are not effectively transformed into information contributing to the reconstruction of the past. The proposed workflow aims to facilitate this interpretation process. It is uniform for 2D and 3D data and is based on a close interaction between the user and the algorithm. The two phases in the analysis (the delineation of ‘archaeological traces’ against the ‘background’ and the subsequent classification of objects into more specific classes such as pits, ditches, walls, floors, geological variations, etc.) each rely on an RF classifier, which enables predictions nearly in real time. This RF classification is iteratively improved by adding annotations until no visible progress is made. The interactive nature of shallow machine learning differs from DCNNs, which—once trained—can automatically generate a segmentation.

Significant manual input is necessary to correct RF classifications, due to the complex nature of geophysical data (low signal-to-noise ratio, low resolution, and low predictability). Nevertheless, our tests show that compared to the fully manual interpretation, significant time savings are possible, except for low-resolution images where pixel classification is difficult or for datasets where few discriminating object features can be identified, as was the case at Vieil-Évreux. It is to be expected that powerful deep learning methods will be used more frequently in archaeology in the future. They could eventually make shallow machine learning approaches redundant, although several conditions must be met (training sets need to become comprehensive enough, and training data must be standardised when resulting from a collaborative effort between different teams). Deep learning methods will likely reduce the manual input required to correct the computer-generated classification. Nevertheless, for detailed interpretations human intervention will always remain necessary as it appears difficult to teach all possible cases to an algorithm.

Geophysical data are often ambiguous. Even careful data acquisition and processing cannot eliminate the uncertainty that accompanies interpretation. This has implications for the evaluation and comparison of computer-aided interpretation techniques. Measuring their performance by means of quantitative metrics designed for evaluating the segmentation of less ambiguous data, such as people, animals, or vehicles, is less meaningful since no single, objective ground truth exists. This observation is not necessarily problematic if we see machine learning as a helpful tool within the process of archaeological interpretation, which is inevitably to some extent subjective, rather than as a solution that should ultimately replace human interpretation and make it fully objective.

If quantitative metrics such as the IoU are less appropriate, the question arises what criteria can be defined to assess the quality of a semantic segmentation and how standardisation, transparency, and reproducibility can improve the quality of geophysical data interpretation, whether carried out manually or with the aid of machine learning algorithms. To obtain a high-quality segmentation, an important aspect is to maximally take into account the geophysical characteristics of the input data. Examples are the determination of the edge of magnetometer anomalies and the calculation of GPR attributes, which can be used as pixel features. Also, the use of efficient object features affects the result. Here, it is important to strike the right balance between customization for specific (categories of) archaeological contexts and wide applicability within machine learning workflows. These topics cannot be covered in full within the scope of this paper. Nevertheless, we hope that the methods proposed provide a useful tool towards facilitating semantic segmentation and demonstrate how discussions about the merits of automated interpretation methods cannot be dissociated from the fundamental question of how we can distil archaeological knowledge from geophysical measurements in the best possible way.

Supplementary Materials

Supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs17173092/s1: this includes further details on concepts discussed in this paper (textual description, Figures S1–S4, Table S1) and a description of the software used for the semantic segmentation in this study, with reference to a GitHub repository.

Author Contributions

Conceptualization, L.V.; methodology, L.V.; software, L.V.; validation, L.V.; formal analysis, L.V.; investigation, L.V.; resources, L.V. and M.D.; data curation, L.V.; writing—original draft preparation, L.V.; writing—review and editing, M.D. and M.B.; visualisation, L.V.; supervision, M.D. and M.B.; project administration, M.D.; funding acquisition, L.V. and M.D. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Lieven Verdonck was funded by a Marie Skłodowska-Curie individual fellowship as part of the EU-funded project ‘Paris Region Fellowship Programme’ (Horizon 2020-FP7 COFUND), grant agreement no. 945298.

Data Availability Statement

The three test datasets analysed in this paper (Boviolles, Interamna Lirenas, and Vieil-Évreux), as well as the interpretations presented and detailed instructions on how to execute the workflow, are publicly available in the Zenodo repository at https://zenodo.org/records/15422270, accessed on 28 August 2025. The software developed to analyse the data is available at https://github.com/lrverdon/Shallow-Machine-Learning-for-Archaeological-Geophysics, accessed on 28 August 2025.

Acknowledgments

We would like to thank Geocarta (Paris), who collected and processed the data from Boviolles and Vieil-Évreux, which was analysed in this paper. A. Launaro (University of Cambridge) is thanked for making the shapefiles containing the manual interpretation of the GPR depth slices from Interamna Lirenas publicly available. We are grateful to the anonymous reviewers, whose comments helped us to improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DCNN	Deep convolutional neural network
GPR	Ground-penetrating radar
IoU	Intersection over union
RF	Random forest

References

Aspinall, A.; Gaffney, C.; Schmidt, A. Magnetometry for Archaeologists; AltaMira Press: Lanham, MD, USA, 2008. [Google Scholar]
Fassbinder, J.W.E. Magnetometry for archaeology. In Encyclopedia of Geoarchaeology; Gilbert, A.S., Ed.; Springer: Dordrecht, The Netherlands, 2017; pp. 499–514. [Google Scholar]
Jol, H.M. Ground Penetrating Radar: Theory and Applications; Elsevier: Amsterdam, The Netherlands, 2009. [Google Scholar]
Leckebusch, J. Ground-penetrating radar: A modern three-dimensional prospection method. Archaeol. Prospect. 2003, 10, 213–240. [Google Scholar] [CrossRef]
Schmidt, A. Earth Resistance for Archaeologists; AltaMira Press: Lanham, MD, USA, 2013. [Google Scholar]
De Smedt, P.; Saey, T.; Lehouck, A.; Stichelbaut, B.; Meerschman, E.; Monirul Islam, M.; Van De Vijver, E.; Van Meirvenne, M. Exploring the potential of multi-receiver EMI survey for geoarchaeological prospection: A 90 ha dataset. Geoderma 2013, 199, 30–36. [Google Scholar] [CrossRef]
Schmidt, A.; Linford, P.; Linford, N.; David, A.; Gaffney, C.; Sarris, A.; Fassbinder, J. EAC Guidelines for the Use of Geophysics in Archaeology: Questions to Ask and Points to Consider; Europae Archeologia Consilium: Namur, Belgium, 2015. [Google Scholar]
Wallner, M.; Doneus, M.; Kowatschek, I.; Hinterleitner, A.; Köstelbauer, F.; Neubauer, W. Interdisciplinary investigations of the Neolithic circular ditch enclosure of Velm (Lower Austria). Remote Sens. 2022, 14, 2657. [Google Scholar] [CrossRef]
Lambers, L.; Fassbinder, J.W.E.; Lambers, K.; Bourgeois, Q. The iron-age burial mounds of Epe-Niersen, the Netherlands: Results from magnetometry in the range of ±1.0 nT. In Proceedings of the 12th International Conference on Archaeological Prospection, Bradford, UK, 12–16 September 2017; Jennings, B., Gaffney, C., Sparrow, T., Gaffney, S., Eds.; Archaeopress: Oxford, UK, 2017; pp. 132–134. [Google Scholar]
Linford, N.; Linford, P.; Payne, A.; Newsome, S.; Bristow, M. Recent geophysical survey of English monastic sites. In Advances in On- and Offshore Archaeological Prospection, Proceedings of the 15th International Conference on Archaeological Prospection, Kiel, Germany, 28 March–1 April 2023; Wunderlich, T., Hadler, H., Blankenfeldt, R., Eds.; Kiel University Publishing: Kiel, Germany, 2023; pp. 179–182. [Google Scholar]
Launaro, A.; Millett, M. Interamna Lirenas: A Roman Town in Central Italy Revealed; McDonald Institute for Archaeological Research: Cambridge, UK, 2023. [Google Scholar]
Schmidt, A.; Tsetskhladze, G. Raster was yesterday: Using vector engines to process geophysical data. Archaeol. Prospect. 2013, 20, 59–65. [Google Scholar] [CrossRef]
De Smedt, P.; Garwood, P.; Chapman, H.; Deforce, K.; De Grave, J.; Hanssens, D.; Vandenberghe, D. Novel insights into prehistoric land use at Stonehenge by combining electromagnetic and invasive methods with a semi-automated interpretation scheme. J. Archaeol. Sci. 2022, 143, 105557. [Google Scholar] [CrossRef]
Pregesbauer, M.; Trinks, I.; Neubauer, W. An object oriented approach to automatic classification of archaeological features in magnetic prospection data. Near Surf. Geophys. 2014, 12, 651–656. [Google Scholar] [CrossRef]
Hinterleitner, A.; Kastowsky-Priglinger, K.; Löcker, K.; Neubauer, W.; Pregesbauer, M.; Sandici, V.; Trinks, I.; Wallner, M. Automatic detection, outlining and classification of magnetic anomalies in large-scale archaeological magnetic prospection data. Archaeol. Pol. 2015, 53, 296–299. [Google Scholar]
Linford, N.; Linford, P. The application of semi-automated vector identification to largescale archaeological data sets considering anomaly morphology. In Proceedings of the 12th International Conference on Archaeological Prospection, Bradford, UK, 12–16 September 2017; Jennings, B., Gaffney, C., Sparrow, T., Gaffney, S., Eds.; Archaeopress: Oxford, UK, 2017; pp. 138–139. [Google Scholar]
Verdonck, L. Detection of buried roman wall remains in ground-penetrating radar data using template matching. Archaeol. Prospect. 2016, 23, 257–272. [Google Scholar] [CrossRef]
Leckebusch, J.; Weibel, A.; Bühler, F. Semi-automatic feature extraction from GPR data for archaeology. Near Surf. Geophys. 2008, 6, 75–84. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Kristin, D.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Ke, R.; Aviles-Rivero, A.I.; Pandey, S.; Reddy, S.; Schönlieb, C.-B. A three-stage self-training framework for semi-supervised semantic segmentation. IEEE Trans. Image Process. 2022, 31, 1805–1815. [Google Scholar] [CrossRef]
Küçükdemirci, M.; Sarris, A. Deep learning based automated analysis of archaeo-geophysical images. Archaeol. Prospect. 2020, 27, 107–118. [Google Scholar] [CrossRef]
Küçükdemirci, M.; Sarris, A. GPR Data Processing and Interpretation Based on Artificial Intelligence Approaches: Future Perspectives for Archaeological Prospection. Remote Sens. 2022, 14, 3377. [Google Scholar] [CrossRef]
Schmidt, A.; Dabas, M.; Sarris, A. Dreaming of perfect data: Characterizing noise in archaeo-geopysical measurements. Geosciences 2020, 10, 382. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–443. [Google Scholar] [CrossRef]
Berg, S.; Kutra, D.; Kroeger, T.; Straehle, C.N.; Kausler, B.X.; Haubold, C.; Schiegg, M.; Ales, J.; Beier, T.; Rudy, M.; et al. Ilastik: Interactive machine learning for (bio)image analysis. Nat. Methods 2019, 16, 1226–1232. [Google Scholar] [CrossRef]
Belevich, I.; Joensuu, M.; Kumar, D.; Vihinen, H.; Jokitalo, E. Microscopy Image Browser: A Platform for Segmentation and Analysis of Multidimensional Datasets. PLoS Biol. 2016, 14, 1002340. [Google Scholar] [CrossRef]
Arganda-Carreras, I.; Kaynig, V.; Rueden, C.; Eliceiri, K.W.; Schindelin, J.; Cardona, A.; Seung, S. Trainable Weka Segmentation: A machine learning tool for microscopy pixel classification. Bioinformatics 2017, 33, 2424–2426. [Google Scholar] [CrossRef]
Arzt, M.; Deschamps, J.; Schmied, C.; Pietzsch, T.; Schmidt, D.; Tomancak, P.; Haase, R.; Jug, F. LABKIT: Labeling and segmentation toolkit for big image data. Front. Comput. Sci. 2022, 4, 777728. [Google Scholar] [CrossRef]
Schindelin, J.; Arganda-Carreras, I.; Frise, E.; Kaynig, V.; Longair, M.; Piezsch, T.; Preibisch, S.; Rueden, C.; Saalfeld, S.; Schmid, B.; et al. Fiji: An open-source platform for biological-image analysis. Nat. Methods 2012, 9, 676–682. [Google Scholar] [CrossRef] [PubMed]
Chaudhury, K.N.; Sage, D.; Unser, M. Fast O(1) Bilateral Filtering Using Trigonometric Range Kernels. IEEE Trans. Image Process. 2011, 20, 3376–3382. [Google Scholar] [CrossRef] [PubMed]
Urschler, M.; Leitinger, G.; Pock, T. Interactive 2D/3D image denoising and segmentation tool for medical applications. In Proceedings of the MICCAI Interactive Medical Imaging Computing Workshop, Boston, MA, USA, 14 September 2014; pp. 209–216. [Google Scholar]
Bornik, A.; Neubauer, W. 3D Visualization Techniques for Analysis and Archaeological Interpretation of GPR Data. Remote Sens. 2022, 14, 1709. [Google Scholar] [CrossRef]
Geurts, P.; Irrthum, A.; Wehenkel, L. Supervised learning with decision tree-based methods in computational and systems biology. Mol. Biosyst. 2009, 5, 1593–1605. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Sommer, C.; Straehle, C.; Köthe, U.; Hamprecht, F.A. ilastik: Interactive learning and segmentation toolkit. In Proceedings of the 8th IEEE International Symposium on Biomedical Imaging (ISBI), Chicago, IL, USA, 30 March–2 April 2011; pp. 230–233. [Google Scholar]
Orengo, H.; Conesa, F.C.; Garcia-Molsosa, A.; Lobo, A.; Green, A.S.; Madello, M.; Petrie, C.A. Automated detection of archaeological mounds using machine-learning classification of multisensor and multitemporal satellite data. Proc. Natl. Acad. Sci. USA 2020, 117, 18240–18250. [Google Scholar] [CrossRef]
van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.; Boulogne, F.; Warner, J.D.; Yager, N.; Gouillart, E.; Yu, T.; the scikit-image contributors. Scikit-image: Image processing in Python. PeerJ 2014, 2, e453. [Google Scholar] [CrossRef]
Haubold, C.; Schiegg, M.; Kreshuk, A.; Berg, S.; Koethe, U.; Hamprecht, F.A. Segmenting and tracking multiple dividing targets using ilastik. In Focus on Bio-Image Informatics (Advances in Anatomy, Embryology and Cell Biology 219); De Vos, W.H., Munck, S., Timmermans, J., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 199–229. [Google Scholar]
Documentation. Available online: https://www.ilastik.org/documentation (accessed on 31 January 2025).
Kuwahara, M.; Hachimura, K.; Eiho, S.; Kinoshita, M. Processing of RI-Angiocardiographic Images. In Digital Processing of Biomedical Images; Preston, K., Onoe, M., Eds.; Springer: Boston, MA, USA, 1976; pp. 187–202. [Google Scholar]
Fogel, I.; Sagi, D. Gabor filters as texture discriminator. Biol. Cybern. 1989, 61, 103–113. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chiu, C.L.; Clack, N.; the napari community. Napari: A Python Multi-Dimensional Image Viewer Platform for the Research Community. Microscop. Microanal. 2022, 28 (Suppl. S1), 1576–1577. [Google Scholar] [CrossRef]
Nembrini, S.; König, I.R.; Wright, M.N. The revival of the Gini importance? Bioinformatics 2018, 34, 3711–3718. [Google Scholar] [CrossRef]
Jakulin, A. Machine Learning Based on Attribute Interactions. Ph.D. Thesis, University of Ljubljana, Ljubljana, Slovenia, 2005. [Google Scholar]
Liu, C.; Wang, W.; Zhao, Q.; Shen, X.; Konan, M. A new feature selection method based on a validity index of feature subset. Pattern Recognit. Lett. 2017, 92, 1–8. [Google Scholar] [CrossRef]
Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
Köthe, U. Generische Programmierung für die Bildverarbeitung. Ph.D. Thesis, University of Hamburg, Hamburg, Germany, 2000. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef] [PubMed]
HDF Software Documentation. Available online: https://support.hdfgroup.org/documentation/index.html (accessed on 31 January 2025).
Beucher, S.; Meyer, F. The Morphological approach to segmentation: The watershed transformation. In Mathematical Morphology in Image Processing; Dougherty, E., Ed.; Marcel Dekker Inc.: New York, NY, USA, 1993; pp. 433–481. [Google Scholar]
Schneider, A.; Gussone, M.; Müller-Wiener, M.; Lambers, K.; Ullrich, B.; Kniess, R.; Kniess, S.; Dorrestein, J.P. Understanding complexity: The case-study of al-Ḥīra, Iraq. In Advances in On- and Offshore Archaeological Prospection, Proceedings of the 15th International Conference on Archaeological Prospection, Kiel, Germany, 28 March–1 April 2023; Wunderlich, T., Hadler, H., Blankenfeldt, R., Eds.; Kiel University Publishing: Kiel, Germany, 2023; pp. 253–256. [Google Scholar]
Schmidt, A. The limits of a blob: Geophysically informed automatic extraction of magnetometer anomalies. In New Global Perspectives on Archaeological Prospection; Bonsall, J., Ed.; Archaeopress Publishing Ltd.: Oxford, UK, 2019; pp. 272–273. [Google Scholar]
von der Osten, H. Geophysikalische Prospektion Archäologischer Denkmale Unter Besonderer Berücksichtigung der Kombinierten Anwendung Geoelektrischer und Geomagnetischer Kartierung, Sowie der Verfahren der Elektromagnetischen Induktion und des Bodenradars; Shaker: Aachen, Germany, 2003. [Google Scholar]
Böniger, U.; Tronicke, J. Integrated data analysis at an archaeological site: A case study using 3D GPR, magnetic, and high-resolution topographic data. Geophysics 2010, 75, B169–B176. [Google Scholar] [CrossRef]
Roest, W.R.; Verhoef, J.; Pilkington, M. Magnetic interpretation using the 3-D analytic signal. Geophysics 1992, 57, 116–125. [Google Scholar] [CrossRef]
Verhoeven, G. Mesh is more—Using all geometric dimensions for the archaeological analysis and interpretative mapping of 3D surfaces. J. Archaeol. Method Theory 2017, 24, 999–1033. [Google Scholar] [CrossRef]
Dechezleprêtre, T.; Bonaventure, B.; Encelot, G.; Pieters, M. L’oppidum de Nasium à Boviolles (Meuse): Recherches récentes. In Archäologie in der Großregion, Proceedings of the Beiträge des internationalen Symposiums zur Archäologie in der Großregion in der Europäischen Akademie Otzenhausen, Otzenhausen, Germany, 14–17 April 2016; Koch, M., Ed.; Propylaeum: Heidelberg, Germany, 2021; pp. 253–258. [Google Scholar]
Dabas, M. La Prospection Archéo-Géophysique, Détection et Cartographie non Destructives du Patrimoine Enfoui; Hermann: Paris, France, 2024. [Google Scholar]
Verdonck, L.; Launaro, A.; Millett, M. The urban survey: Methodology. In Interamna Lirenas: A Roman Town in Central Italy Revealed; Launaro, A., Millett, M., Eds.; McDonald Institute for Archaeological Research: Cambridge, UK, 2023; pp. 19–37. [Google Scholar]
Grasmueck, M.; Viggiano, D. PondView: Intuitive and efficient visualisation of 3D GPR data. In Proceedings of the 17th International Conference on Ground Penetrating Radar (GPR), Rapperswil, Switzerland, 18–21 June 2018; Hugenschmidt, J., Ed.; IEEE: New York, NY, USA, 2018; pp. 708–713. [Google Scholar]
Duda, R.O.; Hart, P.E. Use of the Hough Transformation to Detect Lines and Curves in Pictures. Commun. ACM 1972, 15, 11–15. [Google Scholar] [CrossRef]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 2nd ed.; Prentice-Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
Soille, P. On morphological operators based on rank filters. Pattern Recognit. 2002, 35, 527–535. [Google Scholar] [CrossRef]
Verdonck, L.; Launaro, A.; Vermeulen, F.; Millett, M. Ground-penetrating radar survey at Falerii Novi: A new approach to the study of Roman cities. Antiquity 2020, 94, 705–723. [Google Scholar] [CrossRef]
Lewis, J.P. Fast Template Matching. Vis. Interface 1995, 95, 120–123. [Google Scholar]
Launaro, A.; Verdonck, L. Research Data Supporting ‘Interamna Lirenas—A Roman Town in Central Italy Revealed’. Apollo—University of Cambridge Repository. Available online: https://www.repository.cam.ac.uk/handle/1810/346047 (accessed on 19 April 2025).
Guyard, L. Gisacum: L’originalité d’un grand sanctuaire gallo-romain. Rev. Archéologique. Nouv. Série 2005, 1, 218–221. [Google Scholar]
Novo, A.; Dabas, M.; Morelli, G. The STREAM X Multichannel GPR system: First test at Vieil-Evreux (France) and comparison with other geophysical data. Archaeol. Prospect. 2012, 19, 179–189. [Google Scholar] [CrossRef]
Dabas, M. Theory and practice of the new fast electrical imaging system ARP. In Seeing the Unseen, Geophysics and Landscape Archaeology; Campana, S., Piro, S., Eds.; CRC Press: London, UK, 2009; pp. 105–126. [Google Scholar]
Verdonck, L.; De Smedt, P.; Verhegge, J. Making sense of anomalies: Practices and challenges in the archaeological interpretation of geophysical data. In Innovation in Near-Surface Geophysics: Instrumentation, Application and Data Processing Methods; Persico, R., Piro, S., Linford, N., Eds.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 151–194. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Millett, M. Approaches to Roman urbanism in Italy: The example of Falerii Novi. In Roman Urbanism in Italy: Recent Discoveries and New Directions; Launaro, A., Ed.; Oxbow books: Oxford, UK, 2023; pp. 7–21. [Google Scholar]
Rocklin, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (SciPy 2015), Austin, TX, USA, 6–12 July 2015; Huff, K., Bergstra, J., Eds.; pp. 126–132. [Google Scholar]
McClymont, A.; Green, A.G.; Streich, R.; Horstmeyer, H.; Tronicke, J.; Nobes, D.C.; Pettinga, J.; Campbell, J.; Langridge, R. Visualization of active faults using geometric attributes of 3D GPR data: An example from the Alpine Fault Zone, New Zealand. Geophysics 2008, 73, B11–B23. [Google Scholar] [CrossRef]
Trinks, I.; Hinterleitner, A. Beyond amplitudes: Multi-trace coherence analysis for ground-penetrating radar data imaging. Remote Sens. 2020, 12, 1583. [Google Scholar] [CrossRef]
Randen, T.; Monsen, E.; Signer, C.; Abrahamsen, A.; Hansen, J.O.; Sæter, T.; Schlaf, J. Three-dimensional texture attributes for seismic data analysis. In 70th Annual International Meeting, Expanded Abstracts; SEG: Tulsa, OK, USA, 2000; pp. 668–671. [Google Scholar]
Koyan, P.; Tronicke, J. 3D ground-penetrating radar data analysis and interpretation using attributes based on the gradient structure tensor. Geophysics 2024, 89, B289–B299. [Google Scholar] [CrossRef]
Chopra, S.; Marfurt, K.J. Seismic Attributes for Prospect Identification and Reservoir Characterization; SEG: Tulsa, OK, USA, 2007. [Google Scholar]
de Rooij, M.; Tingdahl, K. Meta-attributes—The key to multivolume, multiattribute interpretation. Lead. Edge 2002, 21, 1050–1053. [Google Scholar] [CrossRef]
Kovesi, P. Symmetry and asymmetry from local phase. In Proceedings of the 10th Australian Joint Conference on Artificial Intelligence, Perth, Australia, 30 November–4 December 1997; pp. 185–190. [Google Scholar]
Böniger, U.; Tronicke, J. Subsurface Utility Extraction and Characterization: Combining GPR Symmetry and Polarization Attributes. IEEE Trans. Geosci. Remote Sens. 2012, 50, 736–746. [Google Scholar] [CrossRef]

Figure 1. (a) Horizontal GPR slice at an estimated depth of 0.96–1.00 m, showing data from the Roman town Interamna Lirenas (Italy), presented in Section 4.2. (b) The same data, with a superimposed manual annotation of the archaeological structures. (c) Histograms plotting the amplitude of the GPR reflections caused by archaeological structures and background, based on the interpretation in (b). These demonstrate that a large amplitude range can belong to both classes.

Figure 2. Pixel segmentation in ilastik. (a) Part of the magnetometer data from Boviolles, France, presented in Section 4.1, with a few annotations representing the background (class 1) and the archaeological structures (class 2). (b) Map for each pixel showing the probability that it belongs to class 2. (c) Map showing the uncertainty of the prediction in (b). (d) Segmentation by assigning to each pixel the class label with the highest probability.

Figure 3. Creation and classification of objects from the data shown in Figure 2. (a) The two thresholds used for hysteresis thresholding of the ‘archaeological structures’ class (low = 50% probability, high = 75% probability). (b) Result of hysteresis thresholding. All individual objects are assigned a different colour. (c) The same as (b), with the objects in black and white and with the annotations used for the training of the object classifier in colour. (d) Result of the object classification by the RF.

Figure 4. (a) Horizontal slice at an estimated depth of 0.65–0.66 m, showing part of the 3D GPR dataset from Interamna Lirenas. (b) Objects created after interactive pixel segmentation and hysteresis thresholding. The numbers are explained in the text. (c) Result after watershed segmentation with a footprint of 20 × 20 × 20 voxels.

Figure 5. Post-processing following object classification. (a) The same data as in Figure 3d, represented in grey scale, with a few annotations (in colour) to manually reclassify existing objects. (b) Result of the manual object reclassification. (c) Correction of the object boundaries, so that they better correspond to the real dimensions of the buried structures. On the ‘linear anomalies’ class, morphological closing was performed with a disc-shaped structuring element with a radius of seven pixels. (d) The same image as (c), with manual corrections at the pixel level, mainly in the ‘ditches’ class (see the arrows).

Figure 6. The oppidum of Boviolles (with its defensive structures in dark red). (a) Results of the magnetometer survey by Geocarta, from −5 nT (black) to +5 nT (white). 1: Location of the murus gallicus. 2: Long ditch. 3: Smaller ditches. 4: Parcel boundaries. 5: Area shown in Figure 2, Figure 3 and Figure 5. 6: Interruption in long ditch. 7: Area shown in Figure 7a–c. 8. Area shown in Figure 7d–f. 9: Area shown in Figure 8. Background: Google Satellite. (b) Magnetometer data with the superimposed RF segmentation and classification (before manual corrections).

Figure 7. (a,d) Two details of the complete dataset from Boviolles (for their location, see Figure 6, nos. 7 and 8, respectively), used to compare our method for detecting buried archaeological structures (b,e) with the application of a simple threshold, at 1.75 nT (c) and at 1.5 nT (f). The arrows in (b) indicate undetected pit structures in (c). The arrows in (f) denote small noisy areas not included in (e). In (c,f), only anomalies of more than 10 pixels were kept.

Figure 8. (a) Part of the magnetometer dataset from Boviolles (for its location, see Figure 6, no. 9). The numbers are explained in the text. (b) Semantic segmentation generated by the RF classifier. (c) RF semantic segmentation after weak anomalies like the ones in nos. 1 and 2 in (a) have been included in the training data. (d) Manual interpretation. Small bipolar anomalies were considered as background.

Figure 9. Confusion matrix resulting from the quantitative comparison between the RF classification (Figure 8b) and the manual classification (Figure 8d). The calculation of the IoU metric is illustrated taking as an example the ‘circular anomaly’ class: the intersection between the manual and automated classification is indicated in red, the union in red and green.

Figure 10. (a) GPR depth slice from Interamna Lirenas, from a depth between 0.40 and 0.70 m, created by taking the maximum amplitude in the depth interval for each pixel. The red square shows the location of the area in Figure 1, Figure 4, Figure 11, Figure 12 and Figure 16. The numbers are explained in the text. (b) Interpretation based on RF classification and manual correction, encompassing a depth of 0.25–1.45 m, after georeferencing and importing of the shapefiles in QGIS. A number of anomalies classified as artefacts caused by terrain unevenness, e.g., nos. 6 and 7 in (a), are not shown in (b).

Figure 11. Object features calculated from the Interamna Lirenas GPR data, at an estimated depth of 0.65–0.66 m, using rectangular structuring elements (SEs) with orientations of N43.5°W and N46.5°E. (a) By applying the rank-max opening. From the resulting directionality volume (indicating to which extent a pixel belongs to a structure following these two orientations; see the Supplementary Data, Figure S4), the object feature shown here calculates the maximum directionality for each object. (b) By calculating the cross-correlation between the GPR data and the two SEs. The ‘maximum directionality’ object feature is derived in the same way as in (a).

Figure 12. (a) Examples of annotations to train the RF object classifier. The same data as in Figure 4c (slice at an estimated depth of 0.65–0.66 m), with the objects in grey scale and the annotations in colour. (b) Resulting RF classification. The arrows indicate examples of objects wrongly classified by the RF. (c) After manual reclassification of objects wrongly classified by the RF. (d) Entirely manual interpretation by L. Verdonck. To perform the quantitative comparisons described in the text, only three classes were used. To this end, in (c,d) the ‘columns’ and ‘streets’ classes were considered as ‘walls’ and ‘floors’, respectively.

Figure 13. Three-Dimensional visualisation of the segmented GPR data from Interamna Lirenas in the napari viewer. Vertical exaggeration: 5. The dimensions of the volume are 200 m × 150 m × 1.2 m.

Figure 14. Vieil-Évreux. (a) Result of the ARP measurements with channel 2 (i.e., with a distance between the electrodes of ~1 m). (b) Interpretation after RF pixel classification, hysteresis thresholding of the ‘archaeological structures’ class probability with upper threshold 85% and lower threshold 50%, application of a size threshold of five pixels, and manual object classification.

Figure 15. (a) Part of the resistivity data from Vieil-Évreux, with a few annotations representing the background (class 1) and the archaeological structures (class 2). (b) Resulting classification, after applying a simple threshold of 50% to the probability map of class 2. (c) Classification after adding the labels indicated with arrows in (a), resulting in some archaeological structures no longer included in class 2, see the arrows in (c).

Figure 16. (a) Watershed segmentation of the objects in Figure 4b, after the datacube was divided in small blocks to reduce RAM usage. The footprint was 20 × 20 × 20 voxels. The result is similar to the one in Figure 4c, apart from the edge effects near the arrows. (b) Classification of the objects in (a), using the same annotations as in Figure 12a. The result is similar to the classification in Figure 12b, which uses the entire datacube at once.

Figure 17. (a) Horizontal slice at an estimated depth of 0.50–0.51 m, showing part of the 3D GPR dataset from Interamna Lirenas. (b) Similarity attribute of the slice in (a). (c) Phase symmetry attribute of (a), using only the two perpendicular orientations of the walls and streets for the calculation.

Figure 18. (a) RF pixel classification of the data in Figure 17a, using the standard features included in ilastik (for the list, see the Supplementary Data, Section A.1). (b) Pixel classification using the similarity and phase symmetry attributes as features (Figure 17b,c), as well as a feature based on the 2D cross-correlation of the GPR slice and two rectangular templates along the main perpendicular orientations (see Section 4.2). The arrows show improved continuity of the linear structures.

Table 1. Time needed to complete the workflow described in Section 3, for the three case studies in Section 4. The timing of the manual pixel annotation and the fully manual analysis is based on the analysis of an area of 1000 × 1000 pixels; this duration was extrapolated to the full datasets.

Processing Step	Time Required
Processing Step	Boviolles 17.9 Megapixel (4816 × 3712 Pixels)	Interamna Lirenas 1.4 Gigapixel (4000 × 3000 × 120 Pixels)	Vieil-Evreux 1.9 Megapixel (904 × 2160 Pixels)
Training RF pixel classifier (ilastik)	27 min	45 min	23 min
Setting parameters for hysteresis thresholding (ilastik)	15 min	22 min	10 min
Training RF object classifier (scikit-learn)	33 min	1 h 2 min
Manual corrections (object reclassification)	1 h 34 min	6 h 8 min	2 h 7 min
Manual corrections (pixel annotation)	~3 h	~20 h
Total	~6 h	~28 h	2 h 40 min
Fully manual analysis	~12 h	~80 h

Table 2. Result of the segmentation generated by the RF classifier on part of the magnetometer dataset from Boviolles (see Figure 8). Quantitative comparison with a manual segmentation, using the IoU metric. The ‘background’ class also includes small bipolar anomalies.

Class	IoU of Manual vs. RF Classification (in %)
Class	Training Set Excluded Weak Linear Anomalies	Training Set Included Weak Linear Anomalies
Average of IoU scores for each class (mIoU)	56.0	46.5
Background	98.9	96.2
Circular anomalies	62.5	33.8
Linear anomalies	6.5	9.4
Circular + linear anomalies	60.9	34.3

Table 3. Comparison of the segmentation generated by the RF classifier and an existing manual segmentation, on part of the 3D GPR dataset from Interamna Lirenas (see Figure 12), using the IoU metric. The comparison is made before and after manual corrections were applied to the RF classification. The last column shows a comparison between two manual segmentations by independent human interpreters.

Class	IoU (in %)
Class	Manual vs. RF Classification	Manual Classification vs. RF Classification with Manual Corrections	Two Independent Manual Classifications
Average of IoU scores for each class (mIoU)	56.6	60.8	61.8
Background	91.6	92.3	92.2
Wall foundations	45.1	50.4	50.4
Floors	32.9	39.8	42.7
Wall foundations + floors	50.6	52.3	53.2

Table 4. Computation speed and RAM usage during segmentation of the datasets presented in Section 4.1 and Section 4.2. Experiments were conducted on a PC with 128 GB RAM, Intel Core i9-9980XE CPU with 3.0 GHz and 18 cores, and an Nvidia Quadro RTX 4000 GPU, running ilastik v1.4.1 (for details on the other algorithms used for our experiments, please see the Supplementary Data). The total run times are based on the assumption that a small portion of the total data is displayed during training in ilastik and that the calculation of the directionality feature is based on template matching (see Section 4.2).

Processing Step	Boviolles (2D; 17.9 Megapixel)		Interamna Lirenas (3D; 1.4 Gigapixel)
	Run Time	RAM Usage (GB)	Run Time	RAM Usage (GB)
In ilastik:
Calculation of features (all available features and scales)	5 s	0.5	5 s	4.3
Building of RF and computation of probability maps ¹
With entire dataset displayed	225 s	4.6	~38 min	91
With image portion of 1500 × 1000 pixels displayed	15 s	1.2	4 min	33
Hysteresis thresholding	45 s	4.3	13 min	33
Exporting binary segmentation map	4 s	4.3	52 s	11
Other algorithms (run from Jupyter Notebook):
Importing binary segmentation map	1 s	0.6	25 s	35
Watershed segmentation	-	-	9 min	90
Measuring object properties
Standard properties (scikit-image)	10 s	1.1	5 s	21
Directionality based on rank-max opening	-	-	4 h	59
Directionality based on template matching	-	-	8 min	30
Building RF classifier and classifying objects ¹	5 min	1.4	75 s	31
Manual object reclassification	8 s	1.7	10 s	40
Creation of shapefiles	1 min 30 s	2.2	2 min 45 s	6.6
Total	~8 min		~40 min

¹ Based on the assumption of five iterations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Verdonck, L.; Dabas, M.; Bui, M. Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites. Remote Sens. 2025, 17, 3092. https://doi.org/10.3390/rs17173092

AMA Style

Verdonck L, Dabas M, Bui M. Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites. Remote Sensing. 2025; 17(17):3092. https://doi.org/10.3390/rs17173092

Chicago/Turabian Style

Verdonck, Lieven, Michel Dabas, and Marc Bui. 2025. "Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites" Remote Sensing 17, no. 17: 3092. https://doi.org/10.3390/rs17173092

APA Style

Verdonck, L., Dabas, M., & Bui, M. (2025). Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites. Remote Sensing, 17(17), 3092. https://doi.org/10.3390/rs17173092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interactive, Shallow Machine Learning-Based Semantic Segmentation of 2D and 3D Geophysical Data from Archaeological Sites

Abstract

1. Introduction

2. Interactive Image Segmentation Based on Shallow Machine Learning

3. Interpreting Geophysical Data with Interactive Segmentation Tools

3.1. Pixel Classification and Creation of Objects

3.2. Object Classification

3.3. Manual Classification and Correction of Magnetic Anomaly Boundaries

3.4. Post-Processing and Vectorisation

3.5. Three-Dimensional Workflow

4. Results

4.1. Boviolles

4.2. Interamna Lirenas

4.3. Vieil-Évreux

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI