1. Introduction
The accurate digital representation of historic buildings in HBIM depends on the robust classification of building components. Classification systems provide structured taxonomies to organize building elements and are foundational for consistent data labelling and retrieval. Several classification frameworks are used in the built environment. The Royal Institution of Chartered Surveyors (RICS) New Rules of Measurement (NRM) focus on cost estimation and project planning [
1]. In the UK, the Uniclass system (Unified Classification for the Construction Industry) offers a comprehensive hierarchical structure for classifying everything from construction elements to systems and products [
2]. In North America, OmniClass (Construction Specifications Institute, 2021) and UniFormat [
3] are commonly employed for organizing project information. Other countries have developed their own standards, such as Sweden’s CoClass [
4] and Denmark’s CCS [
5]. These systems, while broadly applicable, were primarily designed for contemporary construction and may not directly address the idiosyncrasies of historic buildings.
Heritage buildings often feature unique elements, irregular geometries, and archaic materials that challenge standard classification schemas [
6]. Furthermore, many heritage structures, especially from classical periods, are designed around strong principles of symmetry, resulting in repetitive or mirrored components such as columns windows and ornamental features [
7]. The ability of machine learning algorithms to correctly classify these elements depends on their capacity to recognize their underlying symmetric geometric signatures from point cloud data [
8].
Rigid adherence to a formal taxonomy can therefore leave a significant portion of heritage point clouds either unclassified or misclassified, particularly where classes are ambiguously defined or where overlapping features occur [
9]. This issue is amplified in HBIM workflows where semantic classification is not only a documentation exercise but also a highly technical challenge for automation [
10]. Recent advances in machine learning demonstrate considerable potential for automating the classification of 3D point cloud data, yet their success remains closely tied to the quality and structure of the classification schema employed [
11]. A poorly aligned taxonomy risks confusing the algorithm with indistinct boundaries between classes, whereas a well-designed taxonomy can significantly improve learning clarity, robustness, and accuracy.
This study proposes a bespoke Uniclass-derived data taxonomy to improve machine learning precision in classifying HBIM elements from point clouds. By tailoring the Uniclass classification taxonomy to a heritage case study and integrating it with a hierarchical Random Forest classifier, the research presented here aims to enhance classification accuracy for complex historic fabric.
Section 2 reviews advances in HBIM classification and machine learning, highlighting challenges of symmetry, irregular geometries, and taxonomy design.
Section 3 examines results from an initial implementation of the unmodified Uniclass scheme, which revealed systematic misclassifications.
Section 4 presents the development of a bespoke Uniclass-derived taxonomy, and
Section 5 describes the experimental methodology combining point cloud processing with a hierarchical Random Forest classifier.
Section 6 reports results comparing both taxonomies, and
Section 7 discusses the implications for classification performance, HBIM automation, and heritage documentation.
2. Related Work and State of the Art
Recent years have seen increasing use of machine learning to automate point cloud classification for built heritage documentation. Classical machine learning techniques such as decision trees and Random Forests have proven effective for semantic segmentation tasks on 3D data, especially when training data are limited. Decision trees partition data based on feature values in a hierarchical manner, and ensemble variants including Random Forest (RF) improve generalization by aggregating many such trees trained on random data subsets [
12,
13]. Early applications in the architectural domain used tree-based classifiers for tasks such as façade segmentation [
14] and distinguishing structural components in as-built models [
15]. In the cultural heritage field, Random Forest was applied to detect surface defects on historic walls from 2D orthophotos [
16] and to classify 3D scans of heritage sites with hand-crafted geometric features [
17]. These studies demonstrated the practicality of Random Forest for semantic labelling but also highlighted challenges in feature selection and class imbalance, typical of heritage data.
To address the complexity of heritage architecture, multi-level classification approaches emerged. A hierarchical multi-level, multi-resolution RF framework on the dense point clouds of Milan Cathedral and Pomposa Abbey was introduced [
18]. Their approach achieved high overall accuracy (F1 ~95.8% for detailed façade components) by iteratively classifying at multiple scales. However, it required extensive expert input to define rules and classes, and performance suffered in areas with occlusion or very similar geometry. This research was then extended by incorporating Markov Random Fields to enforce label consistency in class neighbourhoods, reaching an overall accuracy of 95.9% on Chinese heritage architecture [
19]. Meanwhile, RF-generated predictions with rule-based modelling in a HBIM environment are combined to automate the geometry creation from classified point clusters, though with some loss of accuracy (average F1 ~84%) [
10]. Beyond Random Forest, deep learning techniques are increasingly applied to point clouds [
20,
21], including recent transformer-based networks for heritage structures. These can learn multi-scale features directly but typically demand large, annotated datasets and significant computation, which are often impractical in historic projects. Thus, classical methods such as RF remain relevant for HBIM when coupled with careful feature engineering and domain knowledge.
A parallel thread of research by the authors examined the suitability of existing building classification systems for organizing heritage information. Uniclass is identified as a promising candidate for HBIM due to its detailed, hierarchical breakdown of construction entities [
22]. It consists of multiple tables (e.g., Entities, Elements/Functions, Systems, Products), each providing hierarchical codes and definitions. In comparison, the Industry Foundation Classes (IFC) schema offers an extensive ontology for BIM but is geared toward modern construction and uses a deep object-oriented hierarchy [
23]. IFC often groups heritage-specific elements under broad classes, potentially limiting its descriptive precision for historical details. The Finnish CCI (Construction Classification International) and Dutch NL-SfB systems provide alternative structures but similarly emphasize contemporary building components [
24]. Studies show that Uniclass tends to provide more granular categories for architectural heritage needs than these counterparts. For instance, in a comparative scenario of replacing a broken windowpane in a historic building, Uniclass could precisely classify the windowpane, whereas IFC and CCI fell back on more generic terms [
10,
18]. Nonetheless, even Uniclass has gaps: a review undertaken during the Palace of Westminster refurbishment noted missing entries for certain heritage objects, requiring custom extensions [
25].
The state of the art suggests that combining a structured taxonomy with machine learning is a viable strategy for HBIM automation. However, the taxonomy must be fit for purpose. If classes are too broad or not aligned with the sensor data’s precision, the ML classifier may struggle, leading to misclassifications or the need for excessive training data. An ontology-driven segmentation indicates that enriching pure geometric classification with semantic rules or hierarchical context can improve instance recognition in 3D scans [
26]. Our study builds on these insights by explicitly tailoring a standard taxonomy (Uniclass) to the heritage case study and demonstrating how this improves a Random Forest’s performance in classifying point cloud data. Prior to introducing the bespoke taxonomy, it is first necessary to summarize lessons learned from using the off-the-shelf Uniclass schema in a baseline experiment.
3. Key Lessons from the Previous Experiments
Before developing the bespoke classification, an initial Random Forest classification experiment was conducted using the standard NBS Uniclass structure as a benchmark. The earlier experiment, published by the authors [
27], on the Queen’s House point cloud, with class labels derived from the Uniclass Elements/Functions (EF) table, is shown
Table 1.
As shown in
Table 1, each level of the classifier corresponded to a tier in the EF hierarchy (top-level categories at Level 1, more specific sub-elements at Level 2, and fine-grained components at Level 3). A three-level hierarchical RF classifier was applied to a number of training dataset points achieved post sub-sampling, shown in
Table 2.
The performance varied significantly across these levels, revealing the strengths and shortcomings of the unmodified taxonomy when used for machine learning.
Level 1 (Coarse Level, 50 mm): At the top level, broad classes such as walls, roofs, and floors were classified on a down-sampled 50 mm point cloud. Large structural elements such as walls and roof surfaces retained enough distinct geometric features (planarity, orientation) to be recognized fairly well. However, many Uniclass Level 1 categories were too coarse or semantically abstract for reliable detection. Some classes encompassed disparate physical objects (e.g., a single Uniclass category for all “fittings and fixtures” covering lights, signs, and railings), leading RF to confuse these with larger elements. Small or infrequent classes lacked sufficient training examples, causing misclassifications. The overall precision at Level 1 was low (macro-average precision ~21%) while recall was higher (~44%), indicating that the model often over-predicted certain classes (many false positives). In other words, the classifier could detect most instances of major classes (high recall for walls and roofs) but with poor specificity (low precision) because the class definitions were too inclusive.
Level 2 (Intermediate Level, 20 mm): With a finer 20 mm resolution and narrower class scope under each Level 1 parent, performance generally improved. Classes that were indistinct at Level 1 (for example, Uniclass EF_25 “Walls and barriers”) were split into more coherent sub-classes such as wall, door/window opening, and barrier at Level 2. Focused on these subsets, RF achieved higher accuracy on dominant classes. For example, walls within the EF_25 group were classified with over 75% accuracy and >90% precision in that subset. This confirmed that restricting the classification context, both by geometry and by taxonomy, helps the model. Nonetheless, some Level 2 categories remained problematic. If a parent class from Level 1 had only one meaningful child category, the Level 2 classification became trivial or redundant. Conversely, when parent classes were still somewhat heterogeneous, RF continued to confuse similar sub-elements. For instance, under EF_30 (“Roofs, floors, paving”), distinguishing pavement from floor elements was difficult, as was separating rare classes such as signage or furnishings that shared geometric traits with more prevalent classes. These issues highlighted that the original taxonomy’s grouping did not fully align with the point cloud’s information content.
Level 3 (Fine Level, 5 mm): At the highest resolution (5 mm), RF could theoretically differentiate very fine features (e.g., mouldings, edge details) and, within certain subsets, the Level 3 classifier succeeded. For example, it could separate window components such as glass panes verses mullions or identify small roof details including chimneys versus roofing tiles, provided those were defined as distinct classes. Precision and recall for these distinguishable features improved compared to Level 2. However, moving to this fine scale introduced practical constraints. The number of points increased exponentially, straining memory and computation. In the earlier experiment, attempting to train on the full 5 mm cloud for a large class (all wall points) exceeded the system’s memory capacity (~100 million points), causing training to fail. This underscored scalability issues: a straightforward application of the Uniclass taxonomy in full detail is computationally expensive and can be unfeasible without aggressive subsampling or segmentation. Moreover, some fine classes still lacked distinctive local geometry, e.g., a wall ornament the wall itself at 5 mm may only differ in texture or subtle curvature that the RF features did not capture.
These findings demonstrated two main points. First, the standard Uniclass hierarchy, while comprehensive, is not optimally structured for the point cloud classification of a complex heritage building. It groups certain geometrically dissimilar elements, causing misclassification at low levels, and splits others in ways that are irrelevant to scan data. Second, classification accuracy improves when the task is broken into appropriate subtasks, for example, when the taxonomy aligns with the scale and discriminative features of the data. This motivated a manual redesign of the taxonomy to better suit the machine learning context: combining or reassigning classes to reduce confusion at coarse scales and deferring difficult distinctions to finer scales.
The next sections describe the research methodology used in this study to implement this approach, followed by details of the experimental design and the Random Forest classification results using the bespoke taxonomy.
4. Research Methodology
This section explains the research methodology, incorporating the case study context, dataset characteristics, and experimental design used to evaluate the integration of Random Forest classification with a hierarchical bespoke taxonomy for HBIM point cloud segmentation.
4.1. Case Study Dataset
The research is set in the Queen’s House, Greenwich, which is an iconic heritage building managed by Royal Museums Greenwich (RMG). Queen’s House was designed by Inigo Jones in the early 17th century and is renowned as the first classical architecture building in England [
28]. It now functions as a museum space, showcasing historical artefacts, and it is part of the Maritime Greenwich UNESCO World Heritage Site. This site was selected because it offered a rich variety of architectural components such as columns, decorative ceilings and grand staircases to test the classification approach. It also provided a real-world scenario where an accurate HBIM model would be valuable for conservation management and public presentation. Access to detailed survey data of the building was made possible through the collaboration with RMG (Royal Museum of Greenwich) and their digital survey partners. In
Figure 1, the original point cloud dataset indexed in RBG colours is shown with the 16% labelled dataset, which is further shown in RGB in
Figure 2 as the manually labelled training dataset.
Creating a training dataset for the Random Forest required manual annotation of the point cloud with class labels. A 16% subset of the full point cloud was isolated for this purpose. We employed a spatial stratified sampling: distinct cuboidal regions of the cloud were selected such that all target element categories were represented. This involved clipping out segments (e.g., a corner of the building capturing wall, windows, and a bit of roof; a section of the ground and steps; an interior slice with floors, ceilings, and columns). Within these volumes, points were labelled according to their corresponding building element as shown in
Figure 2.
CloudCompare [
29] software (V2.12.4) was used for labelling, taking advantage of its segmentation and region-growing tools to expedite the process. Class balance was deliberately controlled; roughly equal point counts were labelled for each top-level category to prevent the classifier from biasing toward overwhelmingly large classes such as walls or floors. In cases where the standard Uniclass did not have a suitable class for a set of points (e.g., miscellaneous shapes or scanning artefacts), a provisional label was assigned (later formalized as an “other/noise” category in the bespoke taxonomy). This careful ground-truth annotation yielded a representative training set of ~10 million labelled points spread across all defined classes.
Given the large size of the dataset, a multi-resolution approach was adopted both for efficiency and to align with the classification hierarchy. The master point cloud was down-sampled to three levels of spatial resolution using CloudCompare’s octree-based sampling:
50 mm voxel size for Level 1 (capturing overall forms, eliminating fine details);
20 mm for Level 2 (compromise between detail and size, capturing windows, small columns);
5 mm for Level 3 (near full detail for fine components such as ornaments or thin railings).
Each point in a lower-resolution cloud represents the centroid of all points within that voxel in the original cloud. This approach preserves the general shape and distribution of points while drastically reducing quantity (and thus computation) at coarser levels. By aligning the resolutions with classification levels, it was ensured that each RF classifier only considers the level of detail appropriate for the classes it must distinguish. The segmented training data were similarly sub-sampled to these resolutions, with labels transferred accordingly. A five-fold cross-validation scheme was prepared on the training set for robust performance evaluation. In cross-validation, the training points were split into five folds; in each run, four folds trained the model, and the remaining one validated it, rotating so all points had a chance to be in validation. This mitigated overfitting and provided average performance metrics.
4.2. The Experiment Research Design
The experiment followed an iterative, multi-phase design to develop and evaluate the bespoke taxonomy. Although Phases 1 and 2 outline the preliminary research experimental steps, the primary focus of this study lies in Phases 3, 4, and 5.
Phase 1: Experiment 1—Classification Systems Review: Initially, four candidate classification schemas (Uniclass, IFC, CCI, ETIM) were reviewed for their expressiveness and suitability to the Queen’s House context. This involved a
scenario-based analysis inspired by real-world tasks. For example, one scenario examined how each system would classify a “damaged glass windowpane” replacement: Uniclass could pinpoint a code for window glass pane, whereas IFC only offered a generic IfcWindow object without isolating the pane component. Another scenario looked at procuring a replacement baluster for the Tulip Stairs: Uniclass has distinct entries for balusters, while other systems lump these into broader categories (e.g., a general railing assembly). A brief
gap analysis was also conducted by observing all visible elements on the building’s facades and checking whether each had a corresponding class in the systems. Uniclass emerged as the most detailed and adaptable scheme, covering ~87.5% of observed elements in our survey, whereas others missed many heritage-specific details. These qualitative findings, aligning with the work carried out by Pupeikis [
22], led us to select Uniclass as the base classification system for integration with machine learning.
Phase 2: Experiment 2—Initial RF Classification with Uniclass: In this phase, a hierarchical Random Forest classifier was implemented using the original (unmodified) Uniclass EF (Element/Function) table for labelling. The objective was to identify problematic points in the standard taxonomy when applied to point cloud ML classification. The RF models were trained on the manually labelled data and evaluated at each level. This evaluation and research results from this phase (Phase 2) are published by the authors [
27].
The results quantified the issues anticipated from the taxonomy review: certain classes were consistently confused, and error patterns suggested that some classes should be merged or restructured. For instance, a notable confusion occurred between columns and wall segments, and many column points were misclassified as wall at Level 1, indicating these two should possibly not be separate at the top level. Another issue was that door and window openings, which Uniclass lists under the wall-related category EF_30, were often misidentified at coarse resolution because their defining features were small (doorframes, glass) and were lost in the 50 mm cloud. These observations directly informed the redesign of the taxonomy in the next phase.
Phase 3: Experiment 3—Data Classification of the Bespoke Taxonomy Development: Using the evidence from the previous research experiments in relation to the confusion matrices, feature importance plots, and misclassification visualizations, the Uniclass-based classification schema was manually refined and shown in
Figure 3.
The guiding principle was to group geometrically similar elements together at higher levels and delay differentiation of similar types until a scale at which their differences become machine detectable. This principle is deeply connected to symmetry; at coarse scales, elements with similar high-level symmetries (e.g., the planar symmetry of walls and the cylindrical symmetry of columns) can be confused. The bespoke taxonomy was designed to separate the classes based on how their unique symmetrical or asymmetrical features become apparent at different resolutions. Concretely, the following adjustments were made:
Top-level class reduction: The original EF table has numerous top-level classes (EF_10 to EF_40, etc.), which are reduced to a smaller set of Level 1 classes by merging some categories. For example, instead of separate top classes for walls, windows, and doors, we created a single top-level class encompassing the wall and any embedded openings (since, at 50 mm resolution, a wall with openings looks like a continuous wall surface). Likewise, standalone columns were initially merged with walls at Level 1 in our design, hypothesizing that, at a coarse scale, a column’s shape might be mistaken for a wall fragment and should be classed together, which is further explained in the results section. On the other hand, some broad Uniclass categories that contained very different shapes were split at Level 1. For instance, Uniclass EF_30 includes roofs, floors, and paving under one parent. Roofs were elevated to their own top-level class in the bespoke scheme, separate from Floors, because their orientation and context differ enough to separate even coarsely.
Intermediate level tweaks: At Level 2, new sub-classes or modified groupings were introduced to reflect the details visible at 20 mm. For example, the broad wall group (EF_30 in Uniclass) was set to split into wall, door, and window at Level 2 in the bespoke hierarchy, allowing RF to distinguish openings from solid walls once finer geometry and colour cues (glass reflections) become available at 20 mm. Columns, which were grouped with walls at Level 1, were slated to separate into their own class at Level 2 if the initial grouping proved beneficial. Roof at Level 1 was divided into chimneys, roof covering, and rainwater goods at Level 2, reflecting typical roof details. In essence, Level 2 was designed to handle the intra-class variability of each Level 1 category by introducing logical divisions.
An illustration of the final bespoke taxonomy is shown in
Figure 3. It maintains the Uniclass coding style (EF numbers) for consistency but reorganizes the hierarchy. The highest level now contains: External Ground (EF_10), Columns (EF_20), Walls (EF_30), Floors (EF_40), Ceilings (EF_50), Roofs (EF_60), Staircases (EF_70), and Other/Noise (EF_80). Each of these expands into one or more Level 2 sub-classes (and Level 3 in some cases). Notably, the
Walls top-level class (EF_30) includes what Uniclass would consider separate elements (actual walls, doors, windows, ornamental wall features), which are only distinguished at lower levels. This restructuring was expected to reduce misallocations at Level 1 and improve the clarity of the classification problem presented to RF at each stage.
Phase 4: Experiment 3—Random Forest Integration and Training: With the new taxonomy in place, the training point cloud was relabelled accordingly, merging classes or reassigning labels from the old scheme to the new one. Three Random Forest classifiers (one per level) were then trained on the training data using the bespoke labels. A consistent feature set and hyperparameters were used as in Phase 2 to enable a fair comparison. Each point’s features included 3D geometric descriptors (e.g., curvature, surface normal, roughness, etc., computed from its local neighbourhood of points) and colour information (RGB intensity from photogrammetry, where available). The feature vector had 17 dimensions (
Table 3), capturing shape properties including planarity or linearity, which are discriminative for structural vs. ornamental elements. Key RF hyperparameters were: 200 trees, maximum tree depth of 8, minimum samples per leaf 6, and using an entropy-based split criterion. These settings were tuned on smaller validation sets and kept constant to focus on taxonomy effects. During training, RF at each level only sees points and labels relevant to that level. For example, the Level 1 RF is trained to classify points into EF_10 vs EF_20 vs … EF_80; the Level 2 RF for walls sees only points that are within the wall group, classifying them into wall vs door vs window. Model training and inference were implemented in Python (V2.7.4) using scikit-learn and the cuML library (for GPU acceleration).
Phase 5: Experiment 3—RF Implementation and Evaluation of Semantic Classification Prediction: The performance of the hierarchical classification was evaluated on a reserved test portion of the Queen’s House point cloud (points not used in training). To isolate the impact of the taxonomy and classifier at each level, a ground-truth routing evaluation was employed. This means that we first evaluated Level 1 by comparing its predictions to ground truth; then, for Level 2, it is not fed the predicted Level 1 classes, which could carry errors, but rather the true Level 1 labels from ground truth to route points to the correct Level 2 classifier. This way, each level’s accuracy is measured independently of upstream mistakes. While in a fully automated pipeline, predictions are propagated hierarchically, and the qualitative results of that end-to-end process were examined. The reported quantitative metrics focus on each classifier’s intrinsic performance under ideal routing. The primary metrics were precision, recall, and F1-score for each class, along with overall accuracy. These were computed per level and averaged via cross-validation folds to ensure statistical reliability.
Through this experimental design, a thorough assessment was achieved: first confirming Uniclass as the most appropriate classification system, then identifying its shortcomings, and finally testing how a customized classification hierarchy can overcome those issues to improve machine learning outcomes for HBIM. The following section provides RF implementation on the bespoke classification taxonomy presented in
Figure 3.
5. Random Forest Experiment
The core dataset from the initial experiment (Phase 2) was reused, comprising approximately 1.2 billion LiDAR points captured at the millimeter-scale resolution for a heritage building. Computational constraints were managed through hierarchical subsampling at three levels (50 mm, 20 mm, and 5 mm). Although the fundamental sampling algorithms and the size of each subsampled cloud remained unchanged, the revised taxonomy affected how points were grouped and labelled at each level.
Hardware and software configurations also remained unchanged. Classifiers were trained on a desktop PC equipped with a Ryzen 5 3600 CPU, 32 GB of DDR4 RAM, and an Nvidia RTX 4090 GPU. Python libraries including scikit-learn, cuML RAPIDS, Polars, Open3D, and PyntCloud were employed to implement out-of-core preprocessing, feature extraction, resampling, and model training workflows. A five-fold cross-validation scheme was maintained to ensure robust estimates of classifier performance.
The Random Forest (RF) experiment utilized a 17-dimensional feature set, shown in
Table 3, comprising both geometric descriptors such as curvature, planarity, sphericity, anisotropy and raw RGB values.
Feature extraction was based on fixed-scale neighbourhoods corresponding to each level of spatial resolution: 50 mm for Level 1 (L1), 20 mm for Level 2 (L2), and 5 mm for Level 3 (L3). This ensured consistency in feature interpretation across subsampling levels. The RF hyperparameters remained unchanged, employing an entropy-based criterion, a maximum depth of eight, a minimum of six samples per leaf, a minimum of twelve samples per split, and a total of 200 decision trees. Given the computational expense of training at full scale, exhaustive hyperparameter tuning was considered infeasible. Instead, these parameters were selected based on prior tuning conducted on smaller datasets with similar feature structures.
5.1. Random Forest Implementation Workflow
The classification pipeline comprised three independent Random Forest models, each corresponding to a classification level (L1, L2, L3) as illustrated in
Figure 4. Each model was trained to segment points according to the class labels relevant at its respective level. Where a parent class had multiple child labels, errors at the higher level had the potential to propagate downstream in hierarchical classification workflows.
Apart from the revised class definitions (
Figure 3), the methodology replicates the pipeline established in the initial experiment from Phase 2. Each classification level (L1, L2, and L3) is managed by an independent random forest model, trained to partition points into the categories pertinent to its respective level. In cases where a parent category encompasses multiple child labels, misclassifications at the parent stage propagate downstream, as observed previously.
For consistency, the L1 classifier is tasked with distinguishing among the top-level categories at a 50 mm voxel resolution. Points assigned at L1 are subsequently routed to the appropriate L2 model, which operates on a 20 mm subsampling. Finally, any classes requiring further refinement (e.g., chimney versus roof covering) are addressed at L3 using a 5 mm resolution. The same ground truth routing protocol was applied during performance evaluation; at each level, points were assigned to their true labels from the annotated dataset, thus ensuring that the metrics reflect the intrinsic difficulty of the classification task rather than the cumulative effect of error propagation.
A key feature of the updated taxonomy is the reorganization of categories at L1 and L2. Certain classes that were previously conflated (for example, walls merged with windows and doors) have now been separated to acknowledge their distinct geometric characteristics. Conversely, elements such as columns and walls, which exhibited substantial overlap at the 50 mm scale in the initial experiment, were temporarily combined to minimise systematic misclassification. At finer resolutions (L2 or L3), these merged categories are further subdivided, allowing for more granular distinctions once higher-resolution data reveal their geometric specifics.
This realignment is intended to reduce the label misallocations observed in the original experiment, such as the frequent confusion between columns and walls. Furthermore, categories requiring detailed local geometry, such as windows, doors, or small fixtures, are prevented from competing at L1, where a 50 mm resolution may obscure subtle differences.
5.2. Findings from the Random Forest Implementation in Experiment 3
With the bespoke taxonomy applied, the Random Forest classifiers achieved notably better results than in the initial Uniclass-based experiment (Phase 2-Experiment 2). Results are presented level by level, highlighting how the taxonomy changes influenced the model’s precision and recall for different element types.
5.2.1. Level 1 Results with the Bespoke Taxonomy
At the top level (50 mm data, broad classes), shown in
Figure 5, the RF model benefitted from the reduced number of categories and more meaningful grouping.
Table 4 summarizes the per-class precision, recall, and F1-score at Level 1 under the new taxonomy, along with overall accuracy and macro-averaged scores.
Overall accuracy at Level 1 is 47.5% achieved, a marked improvement over the ~32.0% achieved with the original Uniclass classification taxonomy from Phase 2, an absolute increase of ~15%. This indicates that the classifier was able to correctly assign nearly half the points to the correct top-level category despite using minimal geometric detail. By comparison, previously, it could only correctly classify about one-third of points at this level. The macro-averaged precision (~48%) and recall (~64%) also improved, showing better balance between over- and under-prediction.
Examining individual classes: roofs (EF_60) had high recall (91%) and reasonably high precision (67%). RF could identify almost all roof points correctly, likely because roofs have distinct planar inclination and the new taxonomy isolated roofs into their own class whereas before, roof points had to compete with floor and paving points in one category. Walls (EF_30) show an inverse pattern, very high precision (95%) but low recall (35%). This means that, when the model predicted something as a wall, it was almost always correct, but it missed a lot of wall points, labelling them as other classes. This is likely because many actual wall points were temporarily grouped under other classes at Level 1, such as column or other) due to shared geometry, or perhaps because some wall surfaces were mis-routed to “Other” if they were heavily occluded or fragmented. Column (EF_20) points have the opposite problem: recall is high (82%) but precision extremely low (4%), implying that the model over-predicted columns. For example, many points were labelled as columns but most of them were actually walls. This indicates that our decision to keep columns separate at Level 1 might still be suboptimal since the classifier is picking up on vertical features and often labelling them as column even if they belong to walls.
This suggests that columns and walls are still too similar at this scale to distinguish reliably. This is elaborated further in the Discussion section. Classes including Floors (EF_40) and Ceilings (EF_50) attained a balanced precision (~58–68%) and recall (~57–63%), which is reasonable as the horizontal planar surfaces were relatively easy to isolate in the cloud. Staircases (EF_70) remained a difficult class (precision 20%, recall 42%), reflecting the limited representation of staircase points in training and their complex geometry, which, at 5 cm sampling, partly resembles walls or floors. The External Ground (EF_10) class had very high recall (85%) but modest precision (35%). This means that the model correctly caught most ground points including flat surfaces at low elevation but also mistakenly tagged other points as ground. For example, likely parts of low walls or steps were also misclassified as ground. The catch-all Other (EF_80) class had middling precision (~39%) and recall (~53%), which is expected, given that it scoops up many misfits and leftover points. A moderate portion of points fall into this class correctly as miscellaneous objects, but the model also erroneously dumps some hard-to-classify regular points there.
Qualitatively, the Level 1 classification with the bespoke scheme showed fewer glaring errors than with the original Uniclass labels. Visually, large contiguous sections of roof, ground, and ceiling were correctly classified with uniform labels in the prediction, whereas previously the predicted labels were fragmented. There was still visible confusion between columns and walls. For example, some columns remained classified with the wall label in predictions and vice versa, but other categories such as roof verses wall or ground verses wall were more cleanly separated than before. By simplifying the top taxonomy, Random Forest could make a first cut that was broadly more reliable, setting the stage for improved detail classification at subsequent levels.
5.2.2. Level 2 Results with the Bespoke Taxonomy
At Level 2 (20 mm resolution), the classifier dealt with subclasses within each top-level category. The improvements from the taxonomy refinement continued to be evident here. Importantly, because Level 1 was more accurate, the subclasses that the Level 2 models had to distinguish were defined within a
cleaner context. For example, considering the Walls (EF_30) top-level group, in the bespoke taxonomy, by the time we reach Level 2, this group includes actual wall surfaces as well as door and window regions, since those were grouped under walls at Level 1. The Level 2 RF for this group was tasked with separating wall verses door verses window points.
Figure 6 shows plots of the true labels and predicted labels at Level 2 for walls (EF_30_31), doors (EF_30_35) and windows (EF_30_38) in relation to the RF features.
The model achieved moderate success. It could identify many window points correctly, leveraging the higher density and the presence of transparent glass, which affects point colour and density. Window openings often had slightly lower point density and distinct colour variance. Doors were still harder to pinpoint, partly due to there being fewer training examples and their similarity to wall surfaces when closed. The door and surrounding wall are both vertical planes with similar material in many cases. Nonetheless, performance on door/window separation improved compared to the initial experiment, where doors and windows were underrepresented and often misclassified as walls until the finest level. Now, they were detected to a useful extent at Level 2 itself.
For the Roof (EF_60) category, the Level 2 model separated chimneys (EF_60_65) from roof surfaces (EF_60_61) and rainwater goods (gutters/downpipes, EF_30_31_33 in our code). The classifier performed well here: chimneys were distinguished with high precision since they appear as small vertical protrusions on roof surfaces, and gutter pipes, though they are thin, had unique linear shapes along roof edges. Misclassifications between chimney and main roof were infrequent, showing the taxonomy successfully isolated a clear sub-class at the right level.
Another positive outcome was seen in the Column (EF_20) category. At Level 2, columns were intended to split, if they are not already merged with walls, into column shaft verses column head for classical columns. However, given the confusion at Level 1, this became less of a focus; instead, the primary benefit was that any column mislabelled as a wall at Level 1 would be correctly handled under the wall classifier, or vice versa. In practice, at Level 2, the model dealing with the combined wall/column group was able to recognize many column points and could theoretically assign them a separate label if the taxonomy allowed. Since our final taxonomy kept columns as a separate top class, we did not split them further at Level 2. Lack of sufficient column samples made training a sub-classifier impractical. This points to an area for potential future refinement, such as merging columns and walls fully at Level 1 and splitting them at Level 2.
Subclasses of Floors (EF_40) like internal floor vs. external paving were correctly identified to a large extent. Visual analysis of misclassifications showed that, when Level 1 correctly identified a region as “floor system”, Level 2 rarely confused the indoor vs outdoor floor types. This is because colour and proximity to walls provided context. For example, external paving stones often had different coloration and were adjacent to ground terrain, whereas interior floors were wood or stone with a ceiling above.
For Ceilings (EF_50), which split into ceiling finishes verses fixtures, such as attached decorative elements or lighting fixtures, the model had mixed success. Major ceiling surfaces were labelled well. However, small fixtures such as ventilation grilles or chandeliers were sometimes missed or mislabelled, often ending up under the “Other” category even at Level 2. This is partly due to class imbalance and only few instances of those in training, and such fine items might need Level 3 detail to isolate.
Overall, Level 2 classification accuracy increased for most categories relative to the initial Uniclass experiment. The refined Level 1 classes meant that each Level 2 classifier was dealing with a homogeneous set of points. Error rates dropped particularly in cases where the old taxonomy forced dissimilar objects into one class. Some confusion still persisted in classes that had inherently high variability. For example, within Staircases (EF_70), we had sub-classes for steps, balustrades, handrails, etc. The classifier could distinguish steps, which look like small horizontal floors, reasonably well. But railings verses balusters proved difficult. At 20 mm, the spacing and form of stair balusters versus larger railing segments were not fully captured, so the model often confused them or grouped both as “railing system” generally. This indicated that an even finer resolution (Level 3) or advanced features might be required for such delicate structures.
5.2.3. Level 3 Results with the Bespoke Taxonomy
At the finest level (5 mm), the bespoke taxonomy allowed for the classification of minute details that were previously unattainable. By Level 3, each model was focusing on a very specific subset of points (e.g., distinguishing types of wall ornament or differentiating stair balusters from rails). The increased resolution and narrowed scope yielded some clear wins and also revealed the limits of the approach.
One success was in the External Ground (EF_10) category that sub-classes for different ground surface types were defined such as soft landscaping verses hard paving. The RF at Level 3 successfully separated gravel or pebble walkway points from solid pavement blocks, likely due to differences in point roughness and return intensity. We recorded area-under-curve (AUC) values around 0.98 for classifying pavement verses pebble walkways, indicating an almost perfect discrimination for those ground types, which is a testament to how distinctive their point patterns are when fully resolved.
For Roofs, Level 3 was used to refine details such as distinguishing individual roof tiles or intricate ornamentation on the roofline if present. In the case study dataset, aside from chimney structures, which were already separated at Level 2, not many additional roof sub-classes were needed at Level 3, so this level mainly confirmed the precise boundaries of roof verses chimney verses gutter.
The classifier could identify many rainwater goods (pipes, gutters) attached to walls, since, at 5 mm, these appear as cylindrical clusters against a flat wall background, a shape that features such as curvature can pick out. Decorative wall trims or mouldings, however, were sometimes only partially detected. They present subtle shape changes on the wall surface that the RF features, which are mostly local, sometimes failed to capture if the window size was not perfectly chosen. Some ornamental features were misclassified as just wall or as “other” noise if they were too small.
Within the Wall (EF_30) hierarchy, a Level 3 classifier was tasked, for instance, with distinguishing wall ornaments from plain wall surfaces, or identifying rainwater downpipes affixed to walls.
Figure 7 shows the plot for the true labels and predicted labels at Level 3 for wall ornaments (EF_30_31_32), rainwater goods (EF_30_31_33) and walls (EF_30_31_34) in relation to the RF features.
In the Staircase category, Level 3 attempted to finally resolve balusters verses handrails. The results were modest with some improvement, but not complete. The handrail is a continuous curved piece, while balusters are repetitive vertical posts. At 5 mm resolution, the circular cross-section of balusters and the flat of the rail can be seen in point geometry. The RF did identify segments of each correctly. However, it struggled with parts of the data where context was limited. For example, a fragment of rail might be seen as a short cylindrical piece and confused with a baluster, and vice versa. This indicates the inherent limitation of using strictly local geometric features. The global arrangement of balusters, lining up along a stair, is a strong clue, but the RF model, looking at one point’s neighbourhood at a time, cannot directly capture that pattern. A deep learning approach might infer such global structures, or one could incorporate a post-classification rule, e.g., impose regular spacing constraints, but that was outside the current scope.
In summary, the Level 3 classifications showed that the bespoke taxonomy did not introduce any unforeseen difficulties at the fine scale. On the contrary, it enabled the classifier to address fine distinctions that were previously smeared by taxonomy choices. However, they also highlighted that some problems transcend taxonomy, i.e., there remained issues including class imbalance and feature limitations such as small objects requiring more context.
It is important to note that the overall end-to-end performance, considering the full pipeline where Level 1 predictions feed into Level 2, and so on, also improved with the new taxonomy. Fewer errors at Level 1 meant that fewer points were routed down the wrong branch of the hierarchy. In the initial Uniclass run, a significant number of points would be misclassified at Level 1 and never have a chance to be corrected. With the bespoke taxonomy, the cleaner separation at the top reduced this error propagation. For instance, previously, with the original Uniclass classification taxonomy, many window points were classified as something else at Level 1 and thus could not be labelled as windows at Level 3; now, more window points stayed in the correct branch through the levels and ended up correctly labelled by the end. Qualitatively, the final-coloured point cloud using the bespoke classification scheme was much closer to the ground truth labelling than the one produced using the standard taxonomy.
6. Discussion
The experimental results demonstrate that taxonomy design plays a pivotal role in the success of the machine learning classification of heritage building elements. By reshaping the Uniclass hierarchy into a bespoke structure aligned with geometric discernibility, a hierarchy aligned effectively with the scale at which different symmetrical properties become distinguishable, we achieved a significantly higher classification accuracy at all levels. In particular, the roughly 50% relative increase in Level 1 accuracy (from 32% to 47.5%) is a strong validation of the approach. This improvement cascaded down the classification levels: with more correct top-level assignments, the Level 2 and Level 3 classifiers operated within more appropriate contexts and thus performed better. The bespoke taxonomy effectively reduced the complexity of the task that RF had to solve at each stage.
One key strategy was avoiding direct competition between dissimilar features at coarse scales. The original taxonomy forced the classifier to decide, for example, between, a wall verses a window verses a door at Level 1, even though a 5 cm point cloud cannot reliably distinguish a window from a wall even though the window is mostly an empty opening in the wall with perhaps a few points on the frame. By deferring the wall-versus-window decision to Level 2 (20 mm), it is ensured that, at Level 1, RF focused on broader separations that it could handle (vertical surface verses horizontal surface verses roof). Conversely, we merged classes that were visually similar at the coarse level such as grouping columns with walls to prevent the model from drawing spurious distinctions. This paid off in reduced confusion for some categories. For example, columns did not include as many wall points to their class when they were merged, even though the evaluation still showed some confusion, suggesting that even more merging might be necessary or that the classifier still found a way to separate them internally.
However, the process also revealed trade-offs and remaining challenges. Grouping complex features under one label can introduce high variability within that class. For instance, our Walls (EF_30) class at Level 1 included plain walls, windows, and doors as a diverse set. The result was that, while we avoided misclassifying those as separate classes at Level 1, the walls class itself became internally noisy, leading to its low recall. It only captured the most wall-like of wall points confidently and left many ambiguous ones to be sorted out later. This indicates a limit to merging. If push too far, a class can become so broad that the classifier effectively defers classifying those points at all; this is indicated by many wall points ending up being predicted as something else or thrown into “Other”. It suggests that the scheme could be iteratively refined. Some distinctions might need to be introduced earlier if their absence causes more harm than good. For example, perhaps windows could be a separate top-level class if some robust way was defined to detect window openings even at low resolutions, maybe via colour, since glass might produce distinctive colour signatures.
Another challenge is class imbalance. Even though the training data are balanced at the top level, at deeper levels, certain classes are an inherent minority. There are only a handful of staircases or chimneys in the whole building. This affected recall and precision for those classes. The proposed taxonomy does not solve that problem, but it can mitigate it by not over-partitioning data. As a result, small classes are not isolated too early, but ultimately, if an element is rare, the model’s performance on it will be limited by a lack of examples. Future work could integrate techniques such as data augmentation or synthetic point generation for under-represented classes.
Feature and scale limitations also persist. The Random Forest used single-scale geometric features at each level. Some misclassifications, such as railings verses balusters, occurred because these features could not capture the global pattern or context. This is a classic challenge of local feature methods failing to recognise global patterns of translational symmetry such as the regular repetition of balusters along a staircase. Multi-scale feature descriptors could be introduced. For example, features that consider both a 5 mm neighbourhood and a 50 mm neighbourhood around a point give local detail plus broader shape context. Additionally, deep learning models such as point-based neural networks, could potentially learn more complex shape signatures and spatial relationships, improving on cases where RF struggled. The results suggest that such advanced methods would likely perform even better if paired with a thoughtful taxonomy. A hybrid system can be designed where a top-level RF quickly segregates broad classes and a neural network and then finely classifies within each, or vice versa.
Nevertheless, Random Forest has inherent limitations in this domain. Its reliance on hand-crafted geometric features makes it less robust to noise, outliers, and incomplete scans compared to deep learning models that can learn multi-scale representations directly. RF classifiers also struggle with very fine-grained distinctions when local features are insufficient, and their performance can degrade if the feature set is not carefully engineered. These challenges were observed in our study in cases such as the misclassification of small ornamental features and confusion between columns and walls at coarse resolution. While our bespoke taxonomy mitigated some of these issues by restructuring the classification task, future work should explore hybrid approaches where RF provides a computationally efficient baseline while deep learning architectures are deployed for particularly noise-sensitive or fine-scale tasks.
From a practical standpoint, there is a tension between formal classification standards and AI classification needs. The proposed bespoke taxonomy had to diverge from the official Uniclass definitions to achieve better performance. This means the outputs are not directly standard-compliant. In a production HBIM scenario, one might need a mapping back to standard Uniclass codes for interoperability. The encouraging news is that Uniclass is flexible in allowing user-defined extensions, and our results reinforce arguments [
25] ) that extensions or modifications are sometimes necessary for heritage projects. The ideal future approach could involve an ontology layer such as CIDOC CRM, i.e., ISO 21127:2014 [
30], a cultural heritage ontology that links the machine-oriented classes to standard heritage terminologies, ensuring both machine accuracy and human interpretability.
Finally, the improvements observed come at the cost of additional manual effort in taxonomy development. We effectively performed an expert-in-the-loop optimisation, adjusting the schema based on model feedback. This is feasible in research or for one-off projects, but to generalise the approach, there is a need for more automated ways to tune a classification hierarchy. One possibility is using iterative training that starts with a standard taxonomy, trains the model, analyses error clusters, and algorithmically suggests class merges or splits. This would move toward a self-optimising classification system.
The proposed bespoke Uniclass taxonomy significantly enhanced the Random Forest classification of heritage building point cloud data. The experiment underscores the careful curation of classification labels, informed by both domain expertise and data-driven insight, which can dramatically improve machine learning outcomes. The remaining challenges point to avenues for further innovation, combining taxonomy refinement with advanced algorithms and addressing practical deployment constraints.
7. Conclusions
This study demonstrated that adapting a building classification system to the specifics of a heritage dataset can significantly improve automatic classification performance in HBIM. It started with the standard Uniclass system and identified a misalignment between its generic structure and the geometric information contained in a high-resolution point cloud of Queen’s House. By developing a bespoke taxonomy that reconciles formal definitions with on-the-ground reality, higher precision and accuracy in Random-Forest-based building component classification from point cloud data were achieved. The hierarchical classifier using the new taxonomy showed fewer misclassifications at the top level and better differentiation of architectural elements at subsequent levels, compared to the baseline using standard Uniclass labels.
The key insight is that a “one-size-fits-all” classification scheme is often suboptimal for machine learning on historic buildings. Heritage contexts benefit from customised taxonomic approaches that consider which features are detectable at what scales. The improved performance at Level 1 was especially critical, having the broad classes accurately set up the downstream classifiers for success. The proposed approach effectively reduced the error propagation that plagued the initial model. Furthermore, the methodology used in this study, which is an iterative refinement cycle of evaluate ML -> adjust taxonomy -> re-evaluate, provides a template for others aiming to optimise classification in similar projects.
There are, however, limitations to acknowledge. The bespoke taxonomy was crafted for Queen’s House and may require further adaptation for other buildings with different styles or components. Some improvements were achieved at the expense of strict conformity to standard Uniclass, which may pose interoperability questions. The Random Forest approach, while interpretable and data-efficient, ultimately faces scalability issues for full-scale dense point clouds and struggles with very fine or globally contextual features. Despite the efforts made in the research to balance classes, some rare elements remained hard to classify due to insufficient examples.
The findings suggest that further refinement of the taxonomy could lead to additional performance gains at Level 1. In particular, separating walls from other opening features at the coarsest resolution and postponing certain subdivisions (e.g., columns versus walls) until Level 2 or Level 3 may further reduce confusion. Post-processing heuristics, such as spatial smoothing via k-nearest neighbours, could also help rectify sporadic misallocations. Beyond these incremental adjustments, adopting multi-scale feature extraction methods or more sophisticated deep learning architectures, which can capture both local and global patterns, represents a promising avenue for comprehensive improvements. Moreover, exploring distributed or out-of-core training methods remains crucial for managing extremely large point clouds.
Future work should explore integrating multi-scale feature extraction and deep learning techniques to overcome these challenges. For instance, a point cloud neural network could be trained on the same taxonomy, potentially handling subtle distinctions, such as ornamentation or repetitive patterns, more gracefully. Another avenue is to implement out-of-core or distributed learning to handle billions of points without aggressive subsampling, thereby preserving detail for classification. Additionally, the concept of an adaptive taxonomy could be further developed by using algorithms to suggest taxonomy modifications for new datasets, reducing reliance on manual expertise.
In summary, the research contributes a novel perspective on HBIM element classification. Rather than treating the classification system as a static input to machine learning, we treat it as a tuneable parameter that can be optimised alongside the algorithm. By bridging the gap between a heritage-informed taxonomy and a data-driven learning process, we move closer to efficient, accurate, and scalable automation for historic building modelling. This synergy is essential for the next generation of HBIM tools that will support heritage conservation, analysis, and storytelling through rich and reliable digital representations.