Learning to Classify Structures in ALS-Derived Visualizations of Ancient Maya Settlements with CNN

: Archaeologists engaging with Airborne Laser Scanning (ALS) data rely heavily on manual inspection of various derived visualizations. However, manual inspection of ALS data is extremely time-consuming and as such presents a major bottleneck in the data analysis workﬂow. We have therefore set out to learn and test a deep neural network model for classifying from previously manually annotated ancient Maya structures of the Chact ú n archaeological site in Campeche, Mexico. We considered several variations of the VGG-19 Convolutional Neural Network (CNN) to solve the task of classifying visualized example structures from previously manually annotated ALS images of man-made aguadas, buildings and platforms, as well as images of surrounding terrain (four classes and over 12,000 anthropogenic structures). We investigated how various parameters impact model performance, using: (a) six di ﬀ erent visualization blends, (b) two di ﬀ erent edge bu ﬀ er sizes, (c) additional data augmentation and (d) architectures with di ﬀ erent numbers of untrainable, frozen layers at the beginning of the network. Many of the models learned under the di ﬀ erent scenarios exceeded the overall classiﬁcation accuracy of 95%. Using overall accuracy, terrain precision and recall (detection rate) per class of anthropogenic structure as criteria, we selected visualization with slope, sky-view factor and positive openness in separate bands; image samples with a two-pixels edge bu ﬀ er; Keras data augmentation; and ﬁve frozen layers as the optimal combination of building blocks for learning our CNN model.


Introduction
Archaeologists engaging with airborne laser scanning (ALS) data rely heavily on manual inspection. For this purpose, raw data is most often converted into some derivative of a digital elevation model that the interpreter regards useful for visual examination [1][2][3][4][5][6]. In our own study of ancient Maya settlements, manual analysis and annotation of visualized ALS data for an area of 130 km 2 took about 8 man-months (the equivalent of a single person working full-time for 8 months) to complete.
Practice shows that manual inspection is extremely time-consuming [7,8] and as such presents a major bottleneck in the data analysis workflow. Continuing along this path without a more efficient solution, it is not only going to be impossible to keep up with ever-increasing data volumes but also very difficult to remove the inherent bias of the human observer [9]. This gives rise to a pressing need for computational methods that would automate data annotation and analysis and thus replace

Study Area, Data and Data Processing
The research area covers 230 km 2 around Chactún, one of the largest Maya urban centers known so far in the central lowlands of the Yucatan peninsula, located in the northern sector of the depopulated Calakmul Biosphere Reserve in Campeche, Mexico ( Figure 1). The area is characterized by low hills, "bulging" a few tens of meters (typically not more than 30 m) above the surrounding seasonal wetlands (bajos). It is completely covered by tropical semi-deciduous forest ( Figure 2). The overall elevation range is only from 220 m to 295 m (or 300 m when buildings are included).  The study area is covered by a natural, unmanaged forest and bushes, whose heights rarely exceed 20 m (a). Annotated buildings and platforms are concentrated in the lower part of the study area, while aguadas have been annotated throughout it (b). Terrain samples are shown in (c). Samples of all four classes were separated into the training set and the test set, according to their geographical locations. Each individual sample was placed in the train set if it was located entirely within the boundaries of the training area, or placed in the test set if it was located entirely within the boundaries of the test area.
Chatún's urban core, composed of three concentrations of monumental architecture, was discovered in 2013 by prof. Šprajc and his team [49]. It has several plazas surrounded by temple-pyramids, massive palace-like buildings and two ball courts. A large rectangular water reservoir lies immediately to the west of the main groups of structures. Ceramics collected from the ground surface, the architectural characteristics and dated monuments indicate that the center started to thrive in the Preclassic period (c. 1000 BC-250 AD), reaching its climax during the Late Classic (c. 600-1000 AD) and had an important role in the regional political hierarchy [49,50]. To the south of Chactún are Lagunita and Tamchén, both prominent urban centers, as well as numerous smaller building clusters scattered on the hills of the research area [51,52].
Airborne laser scanning data around Chactún was collected at the end (peak) of the dry season in May 2016. Mission planning, data acquisition and data processing were carried out with clear archaeological purposes in mind. The National Centre for Airborne Laser Mapping (NCALM) was commissioned for data acquisition and initial data processing (conversion from full-waveform to point cloud data; ground classification) [53,54], while the final processing (additional ground classification; visualization) was performed by ZRC SAZU. The density of the final point cloud and the quality of the derived elevation model proved excellent for detection and interpretation of archaeological features (Table 1), with very few processing artifacts discovered. Ground points were classified with the Terrascan software, and algorithm settings were optimized to remove only the vegetation cover and leave remains of past human activities as intact as possible ( Table 2). Ground points therefore also include remains of buildings, walls, terracing, mounds, chultuns (cisterns), sacbeob (raised paved roads), and drainage channels. The average density of ground returns from a combined dataset comprising information from all flights and all three channels is 14.7 pts/m 2 -enough to provide a high-quality digital elevation model (DEM) with a 0.5 m spatial resolution. The rare areas with no ground returns include aguadas (artificial rainwater reservoirs) with water.

Data Annotations
We used visualization for archaeological topography (VAT) [6] and a locally stretched elevation model to annotate buildings and platforms, as well as local dominance [55] to annotate aguadas. Local dominance shows the slightly raised embankments on the outer edges of aguadas, which does not exist around natural-occurring depressions.
The exact boundaries were drawn on the outer edges of walls or collapsed material. Due to the configuration and state of structures, it was not always possible to delineate each construction-therefore, a single polygon may encompass several buildings ( Figure 3). We annotated aguadas in the whole study area and buildings and platforms in its southern part (130 km 2 ).   Artificial terrain flattenings, elevated from the surrounding ground, were annotated as platforms ( Figure 5). We only annotated platforms that were not used for agricultural purposes, i.e., we excluded agricultural terraces. Aguadas are structures specific to and typical of ancient Maya lowland landscape. They are artificial modifications (deepening) of the terrain, usually with pronounced (slightly raised) edges. They functioned as water reservoirs-artificial ponds or lakes ( Figure 6). For annotation of aguadas, we sometimes used additional visualization that accentuated embankments, for instance, local dominance or flat terrain VAT (Figure 7).

Experimental Setup
We developed a classification model, based on CNN, that distinguishes among four different classes: building, platform, aguada and terrain. To this end, we produced thousands of smaller image samples for each class, cut from the ALS visualizations over the study area. These generated samples, rather than the visualization of the whole area, were then used as inputs for learning and testing the CNN models. The scale is not uniform.

Generating the Dataset for the Classification
We determined a square bounding box with an additional edge buffer (2 and 15 pixels, respectively, for two dataset variants)-a patch-for every anthropogenic structure, and exported that part of the visualization as a sample for either a building, a platform or an aguada (Figures 4-6, respectively). Along with the image samples of the three classes representing anthropogenic structures, we added terrain as a fourth class representing a "negative class" (the background). Terrain ( Figure 8) was not annotated; therefore, square patches were randomly selected from the whole study area, making sure the patch contained no (part of) annotated structures. Terrain samples are all 128 by 128 pixels so that they are comparably sized to samples of structures and match the input size of our neural network. We generated enough terrain samples to match the combined count of building, platform and aguada samples. When generating terrain samples, we had to make sure that they did not intersect an annotated structure, and that the selected patch did not have any no-data pixels. The same was true for structure samples as well; patches at the edge of the study area that included no-data pixels were omitted when generating datasets. For this reason, the testing set for aguadas with a 15-pixels buffer contains four samples less (one aguada less) than the testing set with a 2-pixels buffer. The same is true for platforms and buildings; there are a few samples less in the 15-pixels buffer dataset. In total, we produced 8706 samples of buildings, 2093 samples of platforms and 95 samples of aguadas. Because the total count of the latter was very low, we produced three additional image samples for each aguada by rotating the visualizations by 90 • , 180 • and 270 • , generating 380 aguada samples in total. For visualizations where hillshading was used, we also adjusted the sun azimuth so that shading direction is preserved and consistent across all final image samples ( Figure 9). The scale is uniform. We divided the sample images of all four classes into training and testing sets, containing roughly 80% and 20% of the image samples, respectively ( Table 3). The dataset was not split randomly, but according to the geographic location ( Figure 2). There were two reasons for this: (1) Buildings were often built on top of platforms and platforms often contained multiple buildings. Therefore, the image samples of buildings and platforms overlap. We had to assure images with overlapping objects are all in the same dataset, either in the training or the testing set, but not split between them. (2) Because terrain samples were generated randomly, some parts of the terrain could be contained in multiple image samples. Again, the best solution is to split the generated image samples according to location, so no two images of the same area could be in both the training and the testing set. Despite the random selection of terrain patches, the same locations were used for testing of all visualizations. Table 3. Sample count for each class in the training set and the test set for datasets with a 2-pixels and 15-pixels edge buffer (around structure polygons). The rotated samples are included in the aguada sample count. The number of samples per class is lower for the dataset with a 15-pixels buffer. Samples were omitted from the dataset because the larger buffer intersected the border of study area and thus included no-data pixels.

CNN Architecture
We used the VGG network [56], a deep CNN architecture, for transfer learning. There are two variations of the network, the VGG-16 (13 convolutional layers and three fully connected layers) and the VGG-19 (16 convolutional layers and three fully connected layers), and both use very small (3 by 3) convolutional filters. There have been previous uses of the VGG network, where solutions (CNN architecture) either fully or partially rely on the VGG network, in other remote sensing studies [57][58][59][60][61].
Our CNN model, based on VGG-19, was implemented in Python with the Keras and TensorFlow libraries. Image samples were scaled to the input size of 128 by 128 pixels before they were fed into the network. We decided to use pre-trained weights and fine-tune the network, rather than fully train it, as similar remote sensing studies pointed out that this tends to be the best performing strategy [62]. We initialized the network with weights pre-trained on ImageNet [14] and froze the first few layers of the network so that their weights were not updated during backpropagation. We can freeze the pre-set weights for neurons of the top (first) few layers that recognize lines, edges and simple geometric shapes. These visual features are not domain-specific, and the weights set for these neurons have been thoroughly trained on millions of images. However, the weights of neurons at the bottom (end) layers of the network still need to be fine-tuned [63] for our specific task of classification of anthropogenic objects on ALS visualization; therefore, these layers remain trainable.
Our training dataset consists of roughly 17,000 images (rather than millions that would be required ideally). This is a rather small training set for image recognition tasks. With such cases, measures to prevent a model from overfitting the training set should be implemented [64]. For this purpose, a dropout layer was inserted towards the end part of the network. In the dropout layer, neurons and their connections are randomly dropped out (in our case 50% of them, which is a common practice). This regularization method prevents the network from memorization of training samples and instead encourages the network to learn more general representations. Generalization capabilities are thus improved-leading to better model performance on unseen data-resulting in the final model being more robust [65].

Experimental Design
We trained and tested the neural network in different scenarios, examining how different combinations of parameters and visualizations used affect the predictive capabilities of the resulting model.
To find the best visualization for the image classification task, we compared image inputs, generated from the following visualizations: • Visualization for archaeological topography (VAT); a visualization that blends analytical hillshading, slope, positive openness and sky-view factor into a single grayscale image [6,66]. Red relief image map (RRIM); often used for manual interpretation, because it is direction-independent and easy to interpret. It overlays a slope gradient image, colored in white to red tones, with the "ridge and valley index" computed from positive and negative openness in a grayscale colormap [67,68].

•
Local dominance (LD) is well suited for very subtle positive topographic features and depressions [2]. We included it to test its performance against flat VAT.
To determine the optimal preparation of image samples, we tested image samples with two sizes for the edge buffer around bounding boxes: • 15-pixels to represent a loose edge that includes some immediate surrounding, and • 2-pixels for a tight edge. We kept 1 m of surrounding terrain around structures, because of the positional uncertainty of hand-drawn polygons.
When we generated image samples, as described previously, we always replicated each aguada three more times with rotations. However, we also used either: • no additional data augmentation, or • Keras library data augmentation. Applied augmentations include zoom range, width shift range and height shift range. We did not use rotation and flip, because these would result in inconsistent relief shading and distorted orientation of buildings, which are often aligned to a certain direction.
Finally, we considered two variations of the VGG-19 architecture with different degrees of trainability. The two neural networks were initialized with either of the architectures: • 3 frozen (untrainable) layers at the top or • 5 frozen layers at the top.
We consider the above dimensions as independent. We tested all possible combinations of visualizations, edge buffer, data augmentation and number of frozen layers, which resulted in 48 different scenarios.
All of the layers of our network are represented in the architecture diagram ( Figure 10). Layers at the top of the network are represented at the left side. Figure 10. Architecture of our VGG-19 based CNN network, depicted with color-coded layers. The network accepts images of 128 by 128 pixels as inputs. The first part of the network, the part on the left, represents the VGG-19 architecture, which consists of multiple blocks of convolution layers (blue) and max pooling layers (gray). At the end of the VGG-19, we added five extra layers, including flatten layer (green), dense or fully connected layer (yellow), dropout layer (orange) and dense layer with softmax function (red). The output of the network is one of the four classification labels (aguada, building, platform or terrain).

Results
For every scenario, the model performance was measured in terms of overall accuracy, precision for the terrain class and recall for the classes of anthropogenic structures (building, platform, aguada).
Accuracy (overall accuracy) gives us an initial evaluation of model performance in general, over all classes. While our classes have different numbers of training samples, no anthropogenic structure is more important than another (and neither is the terrain); we thus opted for non-weighted average accuracy.
Models with high overall accuracy-in our results many score above 0.95, as shown in Scheme 1-are good candidates for final model selection, even though the accuracy itself is not a sufficient measure to evaluate model performance.

Scheme 1.
Test results for all 48 scenarios; overall accuracy (ACC) and terrain precision (TPrec). Note that terrain precision is often rounded to 1.0 when it is in reality between 0.995 and 1.0 (meaning there are still some false positive terrain samples, but very rare). Light green rows mark scenarios with test ACC and TPrec higher than 0.90 and dark green where both are higher than 0.95. For the scenarios marked, the confusion matrices are presented in Figure 11, e.g., Figure 11a for the VAT-HS visualization, 15-pixel edge buffer, Keras data augmentation and 5 frozen layers. Ideally, our final model would also: • minimize false-negative results of anthropogenic structures' classes since we do not want structures to go unrecognized. Minimizing these false negatives would result in high recall for building, platform and aguada classes. Recall (also known as detection rate or sensitivity) can be defined as the probability of detection; the proportion of the actual positive samples that have been correctly classified as such.
• minimize false-positive results for the terrain class, since we do not want any structure to be misclassified as terrain-it is, in fact, better for one type of structure to be misclassified as a different type of structure. Minimizing terrain false positives leads to high precision for the terrain class. Precision is the fraction of true positive samples among those that were classified as positive ones.
Accuracy and terrain precision results are presented in Scheme 1, while Figure 11 shows the confusion matrices for the four models with highest accuracies (b,d,e,f) and two with lower accuracies (a,c).

Discussion
The results presented in Scheme 1 show that many of the tested models can successfully classify the three types of anthropogenic structures against the natural terrain, with accuracies of 95% and above. When using a model with terrain precision of over 99.5% (rounded up to 1.0 in Scheme 1), only less than 0.5% of anthropogenic structures would be missed by not inspecting samples classified as terrain. It is also important to note that even archaeologists inspecting ALS visualizations are not always sure how to annotate a certain feature, as in many cases it is not as obvious whether a feature is natural or anthropogenic. However, when using models with terrain precision of 0.995 and above, only 0.5% of anthropogenic structures would be misclassified as terrain, the other errors would include an anthropogenic structure misclassified as belonging to another class of anthropogenic structures. From an archaeological point of view, it is better to confuse an anthropogenic structure with another type of anthropogenic structure, than to misinterpret it as terrain (e.g., for estimating the man-power needed for construction or for population estimates).

Analysis of Misclassifications
Here we analyze the misclassifications made by the learned model by the VAT-HS visualization for generating image samples with a 2-pixel edge buffer around each structure's bounding box. This visualization is one of the two that achieved the highest performance and is the one that is visually most similar to VAT, which was used for the manual annotation in our study, as well as many others. The CNN model was learned with the top five frozen layers and with no additional data augmentation. The resulting model achieves 98% overall accuracy (Scheme 1).
It is evident from the confusion matrix ( Figure 11) that the most common type of classification error is misclassifying platforms as buildings, which is also true for most of the other tested scenarios. Visual inspection reveals that the misclassified platform samples usually contain a single building on a platform (Figure 12a,b), a string of buildings and multiple buildings on a larger platform (Figure 12c) or a smaller platform with particularly pronounced edges that in itself resembles a building (Figure 12d).  Some aguadas can be rather easily mistaken for natural formations because many have very indistinct edges (Figures 6 and 14a). The reverse is even more common; terrain is misclassified as an aguada in areas with a rougher surface, which we think is a highly porous limestone on the edge of a bajo (Figure 14b). Inspection of false positives for the aguada, building and platform classes additionally reveals a number of terrain samples that contain unannotated anthropogenic structures (e.g., walls, stone piles, agricultural terraces, tracks, quarries, chultuns), which results in the misclassification of terrain samples (Figure 14c,d). Some other natural formations could resemble man-made structures as well, but these do not seem to appear frequently within our study area. They could be either smaller, steep or rugged formations, the size of a house or a palace, thus resembling ruins of a building; or natural flattening of a larger area that could be mistaken for a platform. This could present a potential issue for the algorithm, and sometimes even for the archaeologists doing manual annotation, if the ALS data itself are noisy or if the archaeological features are too heavily eroded to be easily distinguished. We believe the model works better when the terrain characteristics are quite different from the archaeological features. In any case, all terrain samples should ideally be checked for accuracy of annotations, but it is this time-consuming process we try to avoid with the use of our CNN models. Thus, the best-performing models are those that achieve not only high accuracy but also terrain precision close to 100%-meaning there are almost no anthropogenic structures misclassified as terrain. This saves time because double checking samples classified as terrain is unnecessary, and potentially only visual checking of samples classified as aguadas, buildings, or platforms is required. Therefore, we want primarily to have high accuracies for these three structures classes; yet as a consequence, this would also result in better model performance on the terrain class.

Comparison of Visualizations
Comparison of the performance of models for different visualizations reveals that no single visualization is categorically better than others. However, we notice that VAT-HS and VAT-HS channels (with accuracies of the best models reaching 98% and 99%, respectively) seem to have some advantage over VAT and Flat VAT in most of the test scenarios. Drawing conclusions from these results, the addition of hillshading into the visualization blend seems to be hindering the model performance.
However, hillshading appears to remain a subjective preference for many of the researchers in archaeology and remote sensing that visually inspect ALS visualizations. This preference may only exist due to our human real-life experience and a more intuitive understanding of hillshading compared to other visualizations. Additionally, a person can better inspect a single visualization blend than three separate visualizations, while the same is not necessarily true for CNN classification models. The VAT-HS channels visualization that includes slope, positive openness and sky-view factor in separate channels in most cases performs better than VAT-HS, which blends the three into a single grayscale image. The preference for VAT-HS or VAT-HS channels does not present an issue in practice because either can achieve good results. What makes the difference is knowing-or finding-how to initialize the parameters of the neural network to optimize the performance of the final CNN classification model. RRIM visualization performed well, achieving accuracy of up to 96%, while LD produced unsatisfactory results with accuracy ranging from 45% to 70%. Terrain precision with LD is often at 100%, meaning there are no false positives for terrain, but that is because almost all samples are classified as buildings. This could be connected to buildings and platforms appearing too bright (having a single value) to differentiate details in this particular visualization. Although LD was of great use for determining the exact boundaries of aguadas manually, this does not seem to translate to the CNN models, which perform worse with local dominance than with the other visualizations for all classes.

Effects of the Edge Buffer
When applying a smaller or larger edge buffer, there are considerable differences in the total area covered within an image sample. An additional 15-pixels edge buffer is hardly noticeable on image samples of platforms, which are usually quite large (Figure 15a,b); however, with small building samples, the added surrounding can present a relatively large portion of the final image sample (Figure 15c,d). We should also keep in mind that all samples are resized to 128 by 128 pixels before they are fed into the neural network, regardless of the absolute size of the initial (generated) image sample. The size of the edge buffer only affects image samples of aguadas, buildings and platforms because we generated terrain samples of a uniform size. The models trained and tested on image samples with a 2-pixels edge buffer generally perform better than models with a 15-pixels edge buffer. The exceptions are Flat VAT and local dominance because their computation relies on a larger convolutional filter-they consider a larger local area in the calculation of the value for a certain pixel. Slope, for example, typically has a computational convolutional filter size of 3 by 3 pixels, while local dominance typically includes the local area in a radius from 10 to 20 pixels and excludes the nearby pixels altogether.

Effect of the Number of Frozen Layers on Model Performance
Having tested models with architectures where either the top three or the top five CNN layers are frozen and untrainable, our results do not show an obvious advantage of one scenario over the other. Keeping in mind that we only have approximately 17,000 image samples for training, perhaps the differences in performance would be greater and comparison more reliable if the training dataset had been larger. We therefore currently do not have enough information to determine the optimal number of frozen layers for our CNN classification model.

Effects of Data Augmentation
When training the CNN models with datasets of just thousands or tens of thousands of images, some form of data augmentation is often used to improve model performance [69,70]. In our tests, using the data augmentation options from the Keras library did not significantly change the final results, so the data augmentation advantages observed in other studies were not evident in our experiments. The restrictive transformations performed on our data could have prevented the data augmentation to come into its full effect.

Feasibility of the Model to Replace Manual Annotation
While our high-performing CNN models seem promising, we are, in the end, looking for a model that will eventually replace manual work, which has been the bottleneck in the data analysis workflow. If we would want to achieve this with our current CNN models for image classification, the ALS visualization of the wider area needs to be cut into thousands of smaller image samples that are fed one by one into the model. For supervised learning within our study, the wider area was cut up according to the manual annotations in the form of polygons. When using the model on a new (unannotated) area, cutting the wider image into smaller ones would be done according to a mesh (for example, a mesh with 128 by 128 squares). Additional difficulties would arise when multiple structures would be present within the same image sample. Better options for replacing the manual work in terms of recognition of individual structures from ALS images would then be CNN models for either object detection [71] or semantic segmentation [72]. The current study is as such more of a feasibility study, whose results serve as a proof-of-concept that ALS visualizations, especially VAT and VAT-HS channels, are suitable for distinguishing anthropogenic objects from images with the use of CNNs.
The results presented in this study instill the confidence that more complex CNN-based methods used with ALS visualizations are worth pursuing. The CNN models that we believe are able to come closest to replacing the manual annotation work are semantic and instance segmentation networks [73][74][75] for detecting archaeological features. While such CNN models are more complex and time-consuming to train, they produce results in the form of image-overlay with patches, which define the exact boundaries of individual objects detected (within the ALS visualization-the input image). This type of output is analogous to the polygons that are produced with manual annotation. The use of semantic or instance segmentation model would still require annotated (labeled) data for training. However, annotated data used for our current research are suitable for this task and can be reused with no to minimal additional manual work required, saving us from the most time-consuming part of the model development process.
On the contrary, if we were not to use any kind of automation for completing annotations for all of the anthropogenic structures over the whole area, we would need to manually annotate the remaining 100 km 2 (out of 230 km 2 total study area) to complete full anthropological analyses. Considering that annotating first 130 km 2 already took 8 man-months, we are looking at potential 6 man-months more spent on manual work, only for the current study area. We believe we could save time further down the line by first implementing a sufficiently advanced CNN network, e.g., an instance segmentation model, and then use it on the remaining unannotated 100 km 2 of our study area, as well as on new (unannotated) study areas where similar terrain and archaeological features can be expected.
The data we have available for training the model include different types of terrain-a lot of hilled and flat areas, but rather rare steep features. We expect the CNN model trained on such terrain to perform with comparable accuracy when used on ALS visualizations of other areas depicting similar terrain types and containing similar archaeological features. Since our data also lack any modern man-made structures, they are more suitable for use in remote areas where mostly (or only) remains of ancient structures are expected. There are thousands of square kilometers of similar areas in Mesoamerica [3,76,77] for which our CNN model could be later reused, as long as the ALS resolution and data quality of the recorded areas are comparable to ours. If criteria described in this paragraph were met, then ideally, the reuse of the model would require only fine-tuning, if at all. This way, thousands of square kilometers could be automatically examined and annotated in a matter of days or weeks instead of months. Manual inspection would then be employed for validation of annotations only and time spent for the whole analysis would still be considerably reduced.

Conclusions
We have shown that a CNN model can successfully classify multiple types of ancient Maya structures and differentiates them from their natural surroundings. We used ALS visualizations of individual anthropogenic structures and terrain as input images. The resulting image datasets included samples belonging to four classes: aguada, building, platform, and terrain.
We tested the performance of CNN models learned with six different ALS visualizations (visualization blends), two variants of edge buffer for the generation of image samples, training with and without the data augmentation, and deep neural network variants with three or five frozen layers. We discovered that CNN models using VAT visualization blends without the hillshading (VAT-HS, VAT-HS channels) generally perform better than models using visualizations that include hillshading (VAT or Flat VAT). Local dominance produced very poor results, and we consider it by itself unsuitable for such a classification task. From our results, we can also conclude that models using image samples with the 2-pixels edge buffer around the structure's bounding box usually perform better than those using the 15-pixels variant. For the other parameters (data augmentation, number of frozen layers), however, one parameter value was not necessarily better than the other, and the combination of different parameters seemed more important than any one parameter by itself.
Many of the considered scenarios result in models that achieve an overall accuracy of 95% and above. Based on the overall accuracy, precision for terrain and recall for aguada, building and platform classes as deciding criteria, we selected VAT-HS channels visualization and image samples with 2-pixels edge buffer, Keras data augmentation and five frozen layers as the optimal combination for our CNN model (Scheme 1).
The research presented in this paper is a proof-of-concept that ALS visualizations can be useful and effective for deep learning-based classification of Maya archaeology from ALS data. Despite the very high performance of our CNN models, in its current form, image classification CNNs cannot replace the manual annotation process to the extent that we desire. As such, our current research presents a stepping-stone towards object recognition and/or semantic segmentation. Future work will, therefore, include using the whole ALS visualization to recognize and locate each anthropogenic structure and its exact boundaries, which holds the potential to eliminate the vast majority of manual analysis and annotation work.