Machine Learning Generalisation across Di ﬀ erent 3D Architectural Heritage

: The use of machine learning techniques for point cloud classiﬁcation has been investigated extensively in the last decade in the geospatial community, while in the cultural heritage ﬁeld it has only recently started to be explored. The high complexity and heterogeneity of 3D heritage data, the diversity of the possible scenarios, and the di ﬀ erent classiﬁcation purposes that each case study might present, makes it di ﬃ cult to realise a large training dataset for learning purposes. An important practical issue that has not been explored yet, is the application of a single machine learning model across large and di ﬀ erent architectural datasets. This paper tackles this issue presenting a methodology able to successfully generalise to unseen scenarios a random forest model trained on a speciﬁc dataset. This is achieved looking for the best features suitable to identify the classes of interest (e.g., wall, windows, roof and columns).


Introduction
The documentation, restoration and conservation of architectural heritage monuments have become fundamental for protecting and preserving them from armed conflicts, climate change effects, natural catastrophes and human-caused disasters. The presence of these risks is further enlarged by the fact that all monuments are inevitably in a constant state of chemical transformation.
The advent in the last decades of 3D optical instruments for the 3D digitisation of objects and sites has undoubtedly changed the concept of heritage conservation and preservation. Indeed, the cultural heritage (CH) field is taking great advantage of reality-based surveying techniques (e.g., photogrammetry, laser scanning) [1,2]. Currently, digital photogrammetry and laser scanning have become standard methods for data acquisition and digital recording for the 3D documentation of heritage assets. These technologies for 3D documentation allow the generation of realistic 3D results in terms of geometric and radiometric accuracy, overcoming the so-called direct surveys, which involve measuring in direct contact of objects or excavation areas. Once data are acquired (images, scans, single points, etc.), post-processing operations allow the derivation of dense point clouds, polygonal models, orthoimages, sections, maps and drawing or further products. Towards providing precise representations of the objects at a given time to be passed down to future generations, these kinds of data can be used as a base for any further studies [3]. In this context, the association of semantic information to the point clouds leads to a simplification in the CH reading, accelerating the phase of data management and interpretation. There are various applications where semantically annotated point clouds are requested such as: • Identification of architectural elements, supporting the scan-to-BIM process [4][5][6][7]; • Detection and quantification of different states of conservation or materials, deriving data for monitoring and restoration purposes [8][9][10][11]; ISPRS Int. J. Geo-Inf. 2020, 9, 379; doi:10.3390/ijgi9060379 www.mdpi.com/journal/ijgi • Quantification of surface areas or volumes of interest, useful both in case of maintenance architecture planning and damage detection [12][13][14][15]; • Abstraction of structural elements, prior to simulations with finite element methods/analysis systems (FEM/FEA) [16][17][18].
As most of these applications are based on time-consuming and subjective manual procedures of annotation, it becomes fundamental to realise a more objective and automated classification method.

State of the Art
Up to now, machine-and its subset deep-learning algorithms (ML / DL) have become the state-of-the-art method to deal with point cloud classification, overcoming rule-based approaches such as Hough transform, Random Sample Consensus (RANSAC), or region growing, presented by Grilli et al. in [19]. Among the ML approaches, the studies proposed in Vosselman [20], Weinmann et al. [21], and Niemeyer et al. [22] can be considered as pioneer works in the geospatial field. Equally, on the DL side, it is fundamental to mention PointNet and its later improvement PointNet++ [23,24], built to perform the classification/part segmentation of simple objects with replicated shapes (e.g., mug, plane, table, car). Both ML and DL are fields of artificial intelligence scientific research related to the development of algorithms that allow computers to make predictions based on empirical training data. Associated with the training data are the so-called features, variables found in the given training set, that can powerfully or at least sufficiently help us at building an accurate predictive model. While within standard machine learning approaches, the choice of the features depends on the operators, deep-learning methods can learn the features by themselves, as part of the training process [25]. This ability to learn features is considered as one of the main causes for the quick advance in 2D and 3D understanding benchmark results [26]. However, deep learning does so using neural networks with many hidden layers, powerful computational resources and a significant amount of annotated data [27]. In this regard, the availability/unavailability of data can raise/limit the application of the deep-learning approaches in some fields more than in other ones. As Griffith and Boehm asserted [26], benchmarks are essential to provide the community with high-quality training data, also allowing a fair comparison between the various algorithms/approaches.
Although current public datasets provide several indoor [28][29][30] and outdoor scenes [31][32][33][34], there is still an evident lack of benchmarks designed for the heritage and architectural field. Despite this, in recent years the following solutions have been proposed. A random forest (RF) classifier has been used on texture and geometric data in [35]. Murtiyoso and Grussenmeyer [36] have proposed an algorithmic approach to perform point cloud segmentation through geometric rules and mathematical functions. Pierdicca et al. [37] presented a dynamic graph convolutional neural network (DGCNN) for point cloud segmentation trained with 11 labelled scenes of heritage architecture.

Aim and Contribution of the Paper
The design of a heritage data classification model is challenging due to the high variability of scenarios in this field. In addition, the class definition might change according to the classification aims (e.g., architectural element identification vs. material quantification) and the case study treated (e.g., classic temples differ from churches, churches can differ a lot from each other according to their architectural style, etc.).
In our previous work [38], a standard machine learning approach based on an accurate selection of geometric features was developed to facilitate and accelerate the classification of some heritage monuments. While before, for each case study a specific model was trained (Figure 1), the main aim of this paper is to verify the capability of a pre-trained model to generalise over other unseen 3D scenarios, featuring similar characteristics (Section 2). When we talk about 'generalisation', we refer to a machine learning model's ability to perform well on new, unseen data, rather than just the data that it was trained on. This term might be confused with the concept of 'transfer learning', used in the deep learning community to indicate the use of a model pre-trained for a particular task to solve a different problem (i.e., using a model trained to recognise apples for identifying pears) [39,40]. In order to test the generalisation concept, we worked with urban architectures (Section 1.3), looking for some recurrent classes such as floors, facades, windows, doors and columns. (Figure 2). ISPRS Int. J. Geo-Inf. 2020, 6, x FOR PEER REVIEW a different problem (i.e., using a model trained to recognise apples for identifying pears) [39,40]. In order to test the generalisation concept, we columns. (Figure 2).   ISPRS Int. J. Geo-Inf. 2020, 6, x FOR PEER REVIEW a different problem (i.e., using a model trained to recognise apples for identifying pears) [39,40]. In order to test the generalisation concept, we worked with urban architectures (Section 1.3), looking for some recurrent classes such as floors, facades, windows, doors and columns. (Figure 2).   An additional goal of this work is to test the generalisation when training and test datasets are acquired with different sensors (i.e., terrestrial photogrammetry and terrestrial laser scanners), featuring different resolutions, levels of noise, and attributes (Section 3.2).
In summary, the aims and contributions of the presented work are: • Identifying a set of transversal architectural classes and a few (geometric and radiometric) features that can behave similarly among different datasets; • Generalising a pre-trained random forest (RF) classifier over unseen 3D scenarios, featuring similar characteristics; • Classifying 3D point clouds featuring different characteristics, in terms of acquisition technique, geometric resolution and size.
In the next paragraph, the heritage datasets used for the experiments are presented. In Section 2 the adopted approach is described, with particular regard to the identification of the classes and the feature selection. Section 3 presents different experiments and discusses the classification results, followed by the conclusions in Section 4.

Datasets
The different datasets used in our experiments consist of five photogrammetric point clouds provided with RGB colour information and one laser scanned point cloud without colour information ( Table 1).
The first three datasets considered (Table 1 -A-B-C) represent a portion of the 40 km of porticoes built between the 11th-20th centuries in Bologna. As they became a distinctive building feature of the city, 25% of the porticoes were digitised using terrestrial photogrammetry under a project for the candidature of the porticoes as UNESCO "world heritage site" [41]. Such structures are interesting for our study, because they combine various types of columns and vaults, different materials, and many architectural details such as mouldings and ornaments. Among them, the Bologna-S. Stefano dataset ( Table 1 -A) was considered as a reference dataset where some portions were annotated and used as a training set. This dataset was chosen because it represents a heterogeneous starting point for the subsequent classification of the other scenarios ( Figure 3). ISPRS Int. J. Geo-Inf. 2020, 6, x FOR PEER REVIEW An additional goal of this work is to test the generalisation when training and test datasets are acquired with different sensors (i.e., terrestrial photogrammetry and terrestrial laser scanners), featuring different resolutions, levels of noise, and attributes (Section 3.2).
In summary, the aims and contributions of the presented work are: • Identifying a set of transversal architectural classes and a few (geometric and radiometric) features that can behave similarly among different datasets; • Generalising a pre-trained random forest (RF) classifier over unseen 3D scenarios, featuring similar characteristics; • Classifying 3D point clouds featuring different characteristics, in terms of acquisition technique, geometric resolution and size.
In the next paragraph, the heritage datasets used for the experiments are presented. In Section 2 the adopted approach is described, with particular regard to the identification of the classes and the feature selection. Section 3 presents different experiments and discusses the classification results, followed by the conclusions in Section 4.

Datasets
The different datasets used in our experiments consist of five photogrammetric point clouds provided with RGB colour information and one laser scanned point cloud without colour information ( Table 1).
The first three datasets considered (Table 1 -A-B-C) represent a portion of the 40 km of porticoes built between the 11 th -20 th centuries in Bologna. As they became a distinctive building feature of the city, 25% of the porticoes were digitised using terrestrial photogrammetry under a project for the candidature of the porticoes as UNESCO "world heritage site" [41]. Such structures are interesting for our study, because they combine various types of columns and vaults, different materials, and many architectural details such as mouldings and ornaments. Among them, the Bologna-S. Stefano dataset ( Table 1 -A) was considered as a referen Dataset D comes from a photogrammetric survey of the Buonconsiglio Castle in Trento (Italy). It is the renaissance-style lodge of the castle (15 th century) that, despite being of modest size, includes all the architectural classes previously annotated in the Bologna dataset.
To test the generalisation properties of the model we also worked on the challenging dataset of the Dome square in Trento (E), composed of buildings of different styles and periods, including the medieval praetorian palace, the city tower and the Dome (12 th -13 th centuries). Dataset D comes from a photogrammetric survey of the Buonconsiglio Castle in Trento (Italy). It is the renaissance-style lodge of the castle (15th century) that, despite being of modest size, includes all the architectural classes previously annotated in the Bologna dataset.
To test the generalisation properties of the model we also worked on the challenging dataset of the Dome square in Trento (E), composed of buildings of different styles and periods, including the medieval praetorian palace, the city tower and the Dome (12th-13th centuries).
Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1 km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information. Table 1. Datasets considered in the presented work to validate the pre-trained model and the generalisation method (Av. D. = average distance between points, L= length of the facades).
Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information. Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information. Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information. Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information. Finally, the classification was extended to a big portion of the old town of Trento (F) (about 1km of facades), surveyed with a hand-held laser scanning system. A critical problem with this dataset was the presence of a decreasing spatial resolution from ground to top, as well as the absence of the texture information.

Methodology
Even considering the advent of the deep learning approaches for point cloud classification [26], in this paper we chose to work with a random forest (RF) algorithm [42], for the following reasons: • Recent literature shows that this can still be considered a competitive method for point cloud classification [43][44][45][46]; • We wanted to extend the method presented in our previous study [38], and verify its applicability to larger and different scenarios; • There is a lack of annotated architectural training data necessary for training a neural network. RF uses an ensemble of classification trees, then gets a prediction from each tree and selects the best solution through voting. Each tree represents an individual classifier in the ensemble and is trained on a random subset of the training sample. During the training phase, both class labels and features were given as input to the model so it can learn to classify points based on these features. In this context, to increase the reliability of the generalisation, we had to make sure that the training dataset was as representative as possible of the entire scenario. To achieve this, it was fundamental to identify (i) transversal classes (Section 2.1) and (ii) features that could behave similarly among different datasets (Section 2.2).
In our classification experiments, the Scikit-learn Python library (version 0.21.1) was used [47] to train the RF classifier and predict the classes over unseen areas.

Methodology
Even considering the advent of the deep learning approaches for point cloud classification [26], in this paper we chose to work with a random forest (RF) algorithm [42], for the following reasons: • Recent literature shows that this can still be considered a competitive method for point cloud classification [43][44][45][46]; • We wanted to extend the method presented in our previous study [38], and verify its applicability to larger and different scenarios; • There is a lack of annotated architectural training data necessary for training a neural network.
RF uses an ensemble of classification trees, then gets a prediction from each tree and selects the best solution through voting. Each tree represents an individual classifier in the ensemble and is trained on a random subset of the training sample. During the training phase, both class labels and features were given as input to the model so it can learn to classify points based on these features. In this context, to increase the reliability of the generalisation, we had to make sure that the training dataset was as representative as possible of the entire scenario. To achieve this, it was fundamental to identify (i) transversal classes (Section 2.1) and (ii) features that could behave similarly among different datasets (Section 2.2).
In our classification experiments, the Scikit-learn Python library (version 0.21.1) was used [47] to train the RF classifier and predict the classes over unseen areas.

Class Selection
For the class selection, we followed the idea proposed in [48], where classes have been defined by studying several standards and dictionaries underlying the construction of 3D architectural models. In addition to their proposed floor, facade, column, arch, vault, window and door, we decided to add the classes moulding, drainpipe and other. This last category specifically includes all those objects that do not belong to the architectural classes (e.g., low vegetation, fences, garbage cans, bikes).
The classes were annotated using our in-house web annotation tool ( Figure 4) built upon the Semantic-Segmentation-Editor web application [49].
ISPRS Int. J. Geo-Inf. 2020, 6, x FOR PEER REVIEW • We wanted to extend the method presented in our previous study [38], and verify its applicability to larger and different scenarios; • There is a lack of annotated architectural training data necessary for training a neural network. RF uses an ensemble of classification trees, then gets a prediction from each tree and selects the best solution through voting. Each tree represents an individual classifier in the ensemble and is trained on a random subset of the training sample. During the training phase, both class labels and features were given as input to the model so it can learn to classify points based on these features. In this context, to increase the reliability of the generalisation, we had to make sure that the training dataset was as representative as possible of the entire scenario. To achieve this, it was fundamental to identify (i) transversal classes (Section 2.1) and (ii) features that could behave similarly among different datasets (Section 2.2).
In our classification experiments, the Scikit-learn Python library (version 0.21.1) was used [47] to train the RF classifier and predict the classes over unseen areas.

Class Selection
For the class selection, we followed the idea proposed in [48], where classes have been defined by studying several standards and dictionaries underlying the construction of 3D architectural models. In addition to their proposed floor, facade, column, arch, vault, window and door, we decided to add the classes moulding, drainpipe and other. This last category specifically includes all those objects that do not belong to the architectural classes (e.g., low vegetation, fences, garbage cans, bikes).
The classes were annotated using

Feature Selection
A critical part of the success of a classification model relies on the good selection of the training features. In order to characterise each point for classification, we combined the use of (i) radiometric and (ii) geometric features, extracted from the point clouds.

Radiometric Features
Radiometric features, when available, can be useful to recognise objects such as windows, commonly painted with specific colours, or also drainpipes, covered with a reflective material resulting in a high-intensity value. Given that different colour spaces represent the colour

Feature Selection
A critical part of the success of a classification model relies on the good selection of the training features. In order to characterise each point for classification, we combined the use of (i) radiometric and (ii) geometric features, extracted from the point clouds.

Radiometric Features
Radiometric features, when available, can be useful to recognise objects such as windows, commonly painted with specific colours, or also drainpipes, covered with a reflective material resulting in a high-intensity value. Given that different colour spaces represent the colour information in different ways, some of them can facilitate certain calculations [35]. Hence, after various attempts, in this work we chose to use both a composite channel of the RGB values ((R+G+B) / 3) and the colour component b* of the colour space La*b* [50]. In the L*a*b* colour space, L* indicates lightness and a* and b* are chromaticity coordinates. The a* and b* coordinates are the red/green and yellow/blue axis. Ignoring the L channel (luminance) makes the algorithm more robust to lighting differences. The colour component b* was chosen as it can facilitate the distinction between windows and walls ( Figure 5). ISPRS Int. J. Geo-Inf. 2020, 6, x FOR PEER REVIEW information in different ways, some of them can facilitate certain calculations [35]. Hence, after various attempts, in this work we chose to use both a composite channel of the RGB values ((R+G+B) / 3) and the colour component b* of the colour space La*b* [50]. In the L*a*b* colour space, L* indicates lightness and a* and b* are chromaticity coordinates. The a* and b* coordinates are the red/green and yellow/blu

Geometric Features-Covariance Features
To describe the geometric distribution of the points and highlight the discontinuities between the architectural elements, we used a few selected covariance features from [51]. The covariance features are widely used in segmentation and classification procedures because of their capability to provide in-depth knowledge on the geometrical structure of the reconstructed scene [21,52,53]. These features derive from the normalised eigenvalues λi (λ1 > λ2 > λ3) of the 3D structure tensor calculated from the 3D coordinates of all the points within a considered neighbourhood [54]. Different strategies can be applied to identify local neighbourhoods for points belonging to a 3D point cloud [55]. In a previous study [38], the authors investigated the behaviour of the covariance features calculated within spherical neighbourhoods at increasing radius sizes, in order to select a reduced number of features that could be beneficial for the classification of heritage case studies. Besides covariance features, the verticality V and the absolute height of the points in the cloud (Z coordinates) were considered. One of the main problems of using many features is the computational time, that grows with the density of the point clouds, the number of features to be extracted, and the size of the search radii [56]. Moreover, in [38] it was proved that the accuracy of the results was not related to the amount of the features used, but rather to their quality. Therefore, to make the generalisation effective, it was essential to identify a small set of features able to perform similarly across different architectural datasets. For the analysis of the best features, we first considered the selection suggested by the RF algorithm, based on impurity reduction [42], starting from a multi-scale analysis done over the training set ( Figure 6). Then, iteratively considering the most important features, only planarity P, omnivariance O, surface variation C, and verticality V, at specific radii (Table 2, Figure 7) were used. In addition, the absolute height was employed.
Once all the mentioned features had been extracted from all the datasets, we noticed that omnivariance O, and surface variation C, were presenting different ranges depending on the point cloud densities. Hence, we normalised them in the range 0-1 adopting the modified logistic function defined in [57], to facilitate the generalisation between the pre-trained model and the unseen scenarios.

Geometric Features-Covariance Features
To describe the geometric distribution of the points and highlight the discontinuities between the architectural elements, we used a few selected covariance features from [51]. The covariance features are widely used in segmentation and classification procedures because of their capability to provide in-depth knowledge on the geometrical structure of the reconstructed scene [21,52,53]. These features derive from the normalised eigenvalues λ i (λ 1 > λ 2 > λ 3 ) of the 3D structure tensor calculated from the 3D coordinates of all the points within a considered neighbourhood [54]. Different strategies can be applied to identify local neighbourhoods for points belonging to a 3D point cloud [55]. In a previous study [38], the authors investigated the behaviour of the covariance features calculated within spherical neighbourhoods at increasing radius sizes, in order to select a reduced number of features that could be beneficial for the classification of heritage case studies. Besides covariance features, the verticality V and the absolute height of the points in the cloud (Z coordinates) were considered. One of the main problems of using many features is the computational time, that grows with the density of the point clouds, the number of features to be extracted, and the size of the search radii [56]. Moreover, in [38] it was proved that the accuracy of the results was not related to the amount of the features used, but rather to their quality. Therefore, to make the generalisation effective, it was essential to identify a small set of features able to perform similarly across different architectural datasets. For the analysis of the best features, we first considered the selection suggested by the RF algorithm, based on impurity reduction [42], starting from a multi-scale analysis done over the training set ( Figure 6). Then, iteratively considering the most important features, only planarity P, omnivariance O, surface variation C, and verticality V, at specific radii (Table 2, Figure 7) were used. In addition, the absolute height was employed.    (2)   Table 1 datasets: (a) planarity: it can facilitate the identification of arches and columns; (b) omnivariance and (c) surface variation: they highlight the discontinuities between the walls, mouldings and the windows; (d) verticality: it is essential to distinguish floors from facades.

Evaluation Method
Traditionally, the evaluation of a classification model is performed by splitting the labelled data into two sets, one used for training and the other one for testing. However, in this way, the evaluation procedure does not assess how the method generalises to a different framework.
In this paper, we first pre-trained a model over a limited portion (about 5M points) of a reference dataset (dataset A: Bologna-S. Stefano, Table 1), then we extended the classification to all the different  Table 1 datasets: (a) planarity: it can facilitate the identification of arches and columns; (b) omnivariance and (c) surface variation: they highlight the discontinuities between the walls, mouldings and the windows; (d) verticality: it is essential to distinguish floors from facades.
Once all the mentioned features had been extracted from all the datasets, we noticed that omnivariance O, and surface variation C, were presenting different ranges depending on the point cloud densities. Hence, we normalised them in the range 0-1 adopting the modified logistic function defined in [57], to facilitate the generalisation between the pre-trained model and the unseen scenarios.

Evaluation Method
Traditionally, the evaluation of a classification model is performed by splitting the labelled data into two sets, one used for training and the other one for testing. However, in this way, the evaluation procedure does not assess how the method generalises to a different framework.
In this paper, we first pre-trained a model over a limited portion (about 5M points) of a reference dataset (dataset A: Bologna-S. Stefano, Table 1), then we extended the classification to all the different datasets described in Table 1. In this way, we could evaluate the performances of the classifier at four different levels of generalisation: 1.
Within the same dataset: the model trained over a portion of the dataset A (model 1) is used to classify the rest of the same dataset (Table 3, Figure 8); 2.
Within the same city: model 1 is applied to dataset B and C (Table 4, Figure 9); 3.
Changing city: model 1 is applied to two different photogrammetric datasets surveyed in a different city (dataset D (Table 5, Figure 10) and dataset E (Table 6, Figure 12)); 4.
Changing city and acquisition technique: a modified version (model 2) of the pre-trained model 1 is tested on the TL dataset F (Table 7, Figure 11). Since the handheld scanning dataset was not provided with RGB values, a re-training round was necessary including exclusively height and geometry-based features.
Finally, for an exhaustive evaluation of each level, some portions of each classified dataset were taken into consideration and compared with the same manually annotated point clouds. The number of correct and incorrect predictions were summarised with count values and broken down by each class inside confusion matrices, that allows the visualisation of the performance of the algorithm (Tables 3-7). Each row of the matrix represents the instances in an actual class (ground truth), while each column represents the instances in a predicted class. From each confusion matrix we could then derive the following accuracy metrics: • Precision: it is a ratio of the total detection by the classifier. It gives information about the model performance with respect to false positives (how many did we catch): • Recall: it is a ratio of the correct detection over the total number of test samples and gives information about a classifier's performance with respect to false negatives (how many did we miss): • F1 score: it is used to compare the performance of the predictive model. It considers both the precision and recall values to compute the measures: where Tp = true positive (sum of the values in the diagonal position), Fp = false positive (sum of the values in the column without the main diagonal one), Fn = false negative (sum of the values in the row without the main diagonal one). Precision, recall and the F1 score were first computed for each class using the above formula, then the arithmetic and weighted averages over all the classes were considered.
In addition, a visual examination over the entire datasets was carried out to complete the quality analysis (Figures 8-12).

Results
From the observation of the accuracy metrics (Tables 3-7) and the results (Figures 8-12), we can reasonably infer that both model 1 and model 2 were able to generalise over unseen datasets.
If we take into consideration Table 3, even if the training samples represent a portion of the tested dataset A, the results are still surprising (0.93 F1-score). In fact, these accuracy metrics have far exceeded our previous study results (0.80 F1-score) achieved over a smaller portion of the same Bologna dataset.

Results
From the observation of the accuracy metrics (Tables 3-7) and the results (Figures 8-12), we can reasonably infer that both model 1 and model 2 were able to generalise over unseen datasets.
If we take into consideration Table 3, even if the training samples represent a portion of the tested dataset A, the results are still surprising (0.93 F1-score). In fact, these accuracy metrics have far exceeded ou Concerning the second experiment (Table 4), the average of the arithmetic metrics is around 0.80. However, from a closer analysis, we can see that low values were achieved for the class other, which represents a small sample of the entire dataset. Hence, if we consider the weighted average, then the accuracy easily reaches 0.89. This kind of problem may be due to the lack of a representative annotation for this class within the training set. In particular, we can see that in dataset B and C some garbage cans (not present in dataset A) have been wrongly classified under facade or column ( Figure  9c).  Concerning the second experiment (Table 4), the average of the arithmetic metrics is around 0.80. However, from a closer analysis, we can see that low values were achieved for the class other, which represents a small sample of the entire dataset. Hence, if we consider the weighted average, then the accuracy easily reaches 0.89. This kind of problem may be due to the lack of a representative annotation for this class within the training set. In particular, we can see that in dataset B and C some garbage cans (not present in dataset A) have been wrongly classified under facade or column (Figure 9c).     On the other hand, this problem was not present within experiments 3 and 4, where instead, the accuracy values were decreased because of some problem with the classes window and moulding, often confused with each other (Tables 5 and 7). This is especially evident where the RGB values were not available in the point cloud (Table 7). A possible solution for this, in a future study, might be to include in the class window both glass and moulding.
The most problematic generalisation experiment was the one relative to dataset E, where the F1score reached was only about 0.70 ( Table 6). The peculiar type of windows and decorations of the medieval facades ( Figure 11) has led to several classification problems. To solve these kinds of errors in future works, it might be useful to integrate the training set with the samples coming from this dataset.  On the other hand, this problem was not present within experiments 3 and 4, where instead, the accuracy values were decreased because of some problem with the classes window and moulding, often confused with each other (Tables 5 and 7). This is especially evident where the RGB values were not available in the point cloud (Table 7). A possible solution for this, in a future study, might be to include in the class window both glass and moulding.
The most problematic generalisation experiment was the one relative to dataset E, where the F1-score reached was only about 0.70 ( Table 6). The peculiar type of windows and decorations of the medieval facades ( Figure 12) has led to several classification problems. To solve these kinds of errors in future works, it might be useful to integrate the training set with the samples coming from this dataset.  The presented methodology and all the results are summarised in this video: https://www.youtube.com/watch?v=_68PdseUh3o.
Moreover, the Random Forest code we used, and the pre-trained classifier models are available at: https://github.com/3DOM-FBK/RF4PCC

Conclusions
This paper proved the capability of a pre-trained random forest (RF) model to generalise across different and unseen 3D heritage scenarios. Although a reduced number of datasets have been evaluated in this study, it is essential to consider that, except for the Trento Lodge case study, each dataset (streets or square) already contains a big differentiation of buildings within it.
The absence of a generalisation study using a standard machine learning approach in this field precludes a practical comparison between similar works. Nevertheless, if we would compare the average of our accuracy metrics with other results (e.g., [36] and [37]), we can say that at the moment     The presented methodology and all the results are summarised in this video: https://www.youtube. com/watch?v=_68PdseUh3o.
Moreover, the Random Forest code we used, and the pre-trained classifier models are available at: https://github.com/3DOM-FBK/RF4PCC

Conclusions
This paper proved the capability of a pre-trained random forest (RF) model to generalise across different and unseen 3D heritage scenarios. Although a reduced number of datasets have been evaluated in this study, it is essential to consider that, except for the Trento Lodge case study, each dataset (streets or square) already contains a big differentiation of buildings within it.
The absence of a generalisation study using a standard machine learning approach in this field precludes a practical comparison between similar works. Nevertheless, if we would compare the average of our accuracy metrics with other results (e.g., [36] and [37]), we can say that at the moment our results reached better accuracy metrics, notwithstanding less training data and a faster prediction time.
The strengths of the presented approach can be summarised as follows: • It is possible to classify a large dataset starting from a reduced number of annotated samples, saving time in both collecting and preparing data for training the algorithm; this is the first time that this has been demonstrated within the complex heritage field; • The generalisation works even when training and test sets have different densities and the distribution of the points in the cloud is not uniform (Experiment 4, Figure 11); • The quality of the results allows us to have a general idea of the distribution of the architectural classes and could support restoration works by providing approximate surface areas or volumes; • The output can facilitate the scan-to-BIM problems, semantically separating elements in point clouds for the modelling procedure in a BIM environment; • Automated classification methods can be used to accelerate the time-consuming process of the annotation of a significant number of datasets, in order to benchmark 3D heritages; • The used RF model is easy to implement, and it does not require high computational efforts nor long learning or processing time.
On the other hand, we saw that when the test set does not follow the distribution of the training data, then the model does not perform as expected. Starting from this observation and considering previous research experiences [35,38], the authors believe that in the heritage field, particular case studies should be treated individually. However, it is possible, and it might be worth generating different pre-trained classifier models for macro-categories of architectures (e.g., classical architecture, Greek temples, gothic churches). In this view, we will consider in the future the possibility to generate simulated point clouds coming from BIM to further accelerate the annotation phase.
To conclude, the heritage domain is a sophisticated testfield for both machine and deep learning classification methods. For this reason, a new benchmark dataset [58] is going to be released in order to boost research activities in this field and become a central resource for the development of new, efficient and accurate methods for classifying heritage 3D data.
Funding: This research received external funding from the project "Artificial Intelligence for Cultural Heritage" (AI4CH), a joint Italy-Israel lab which was funded by the Italian Ministry of Foreign Affairs and International Cooperation (MAECI).

Conflicts of Interest:
The authors declare no conflict of interest.