Automatic Removal of Non-Architectural Elements in 3D Models of Historic Buildings with Language Embedded Radiance Fields

: Neural radiance fields have emerged as a dominant paradigm for creating complex 3D environments incorporating synthetic novel views. However, 3D object removal applications utilizing neural radiance fields have lagged behind in effectiveness, particularly when open set queries are necessary for determining the relevant objects. One such application area is in architectural heritage preservation, where the automatic removal of non-architectural objects from 3D environments is necessary for many downstream tasks. Furthermore, when modeling occupied buildings, it is crucial for modeling techniques to be privacy preserving by default; this also motivates the removal of non-architectural elements. In this paper, we propose a pipeline for the automatic creation of cleaned, architectural structure only point clouds utilizing a language embedded radiance field (LERF) with a specific application toward generating suitable point clouds for the structural integrity assessment of occupied buildings. We then validated the efficacy of our approach on the rooms of the historic Sion hospital, a national historic monument in Valais, Switzerland. By using our automatic removal pipeline on the point clouds of rooms filled with furniture, we decreased the average earth mover’s distance (EMD) to the ground truth point clouds of the physically emptied rooms by 31 percent. The success of our research points the way toward new paradigms in architectural modeling and cultural preservation.


Introduction
Techniques for the automatic creation of 3D models directly from photos or laser scans have shown success in constructing the antecedents for architectural structural analysis [1][2][3].In the preservation of heritage buildings, these techniques are even more relevant due to the often complex design of target structures and the historical importance of decorative or facade elements; these elements are not captured when creating 3D models from floorplans without substantial manual work [4].Automatically creating 3D models of the exteriors of buildings has been well researched in recent years, but progress on interior modeling has been slower [5][6][7].Capturing highly detailed 3D models directly from the physical ground truth presents a substantial problem for manufacturing structurally relevant interior spaces due to the presence of non-architectural elements like furniture [8,9].Not only are these interior elements not relevant for structural integrity testing or truly apprehending the naked building, but in occupied buildings they can also present privacy concerns for the current residents [10].One option for dealing with these interior elements is simply to physically remove them by hand before modeling the building; however, this is inconvenient, expensive, and time-consuming.It is also possible to manually clean the 3D model after capture, but this is likewise time-consuming, thus preventing these techniques from being used at scale; it also requires human annotators to comb through the interior spaces with a high level of focus, potentially exposing private details.The need for the automatic, privacy-preserving creation of 3D models of interior spaces motivates the techniques explored in this paper; we propose a pipeline from lightweight image capture to 3D modeling with the 3D native removal of non-structural elements.
This article focuses specifically on neural radiance field (NeRF) based approaches for 3D modeling.NeRFs are a 3D modeling paradigm wherein images of a scene and their corresponding camera poses are used to train a neural network to predict the 3D model of the scene without a 3D ground truth.Though rival photogrammetric approaches are able to show strong reconstruction performance in many scenarios, they do have multiple drawbacks in their large storage size, lack of novel view synthesis ability, and lack of native methods for manipulation or understanding of the 3D content.To ameliorate these concerns and also to explore the potentialities of more novel methods of 3D reconstruction, we utilized strictly neural radiance field-based approaches following Pepe et al. [11], Llull et al. [12], and Croce et al. [13][14][15] who demonstrate the feasibility of utilizing NeRFs specifically within the cultural heritage domain.In particular, we employed language embedded radiance fields (LERFs) to introduce querying ability to our models and make the identification of extraneous objects possible.Furthermore, if 3D modeling follows the general trend of 2D imaging in recent history, it is increasingly more likely that the capture and manipulation of 3D models for cultural heritage preservation will rely on deep learning-based approaches, making the creation and study of these techniques crucial in the present day.
The subject of our study was the old hospital of Sion (L'Ancien Hôpital de Sion) in Valais, Switzerland.This building has substantial historical significance in Switzerland, as the first hospital in Sion was constructed by the Monks of St. John by at least 1164.The various hospitals of the town went through many iterations before centralizing in this particular building, which is shown in Figure 1, in 1781 [16].The chapel of the hospital is listed as a national historic monument, and the rest of the building (much of which was built later), is regionally listed [17].The primary method of construction is natural stone masonry, about which much research has been conducted regarding its vulnerability to seismic activity [18].This is particularly relevant considering that the hospital is in the most seismically vulnerable region of Switzerland, and in fact an earlier seismic analysis using a manually created 3D model was performed on the hospital in 2015 which is displayed later in the paper in Section 2.2 regarding structural assessment [17].In 2020, the government of Sion announced a twenty-five million dollar renovation on the hospital designed to stabilize the building and retrofit it to become the new site of the city administration [19].The building was mostly empty when we began our data capturing process, but around one fourth of the rooms were being utilized by a music school.Though this building had many interesting architectural features, the most beneficial aspect for our research was that the building was being emptied in preparation for the future renovations.This allowed us to establish a structural ground truth for the building without any furniture or objects in addition to scans of the same rooms while they were filled and in use with a myriad of complex objects.
Throughout this article, we explore the efficacy of our modeling techniques in this realistic context by showing the ability of our LERF-based approach to apprehend the actual underlying 3D structure and to effectively remove various unwanted items from the scene in preparation for the creation of the structural model for integrity testing.

Materials and Methods
To achieve our desired objective of testing the usefulness of NeRF-derived approaches for privacy preserving automatic point cloud creation, we utilized numerous techniques which we explain in this section.In particular, we cover basic neural radiance fields and the offshoots Nerfacto and LERFs.We also detail the downstream LOD modeling methods which will inherit the point clouds as they are relevant to many of the specifics of how we chose to develop our pipeline and modeling techniques.

Neural Radiance Fields
Neural radiance fields (NeRFs) are designed to approximate 3D information about a scene given a set of input images of that particular scene and the camera pose relative to each input image [21].The camera poses are computed from the images by the use of structure-from-motion techniques such as COLMAP [22] or relying on information from the acquisition device.To be more precise, NeRFs utilize neural networks as a backbone to learn an approximate function which accepts a 3D coordinate vector as well as a viewing direction and outputs a predicted color vector and a density at that coordinate.The most general form of this function is where (x, z, y) is the coordinate vector, d is the directional vector, c is the color vector, and σ is the density.
In the training of a NeRF, rays are cast through a target pixel in an input image from the viewing direction of the camera into an synthetic 3D scene.Three-dimensional points are then sampled along that ray to be passed to the neural network described above along with the relative directional vector from the input image.The accumulated color and densities which are predicted at the location of each sampled point by the network are volumetrically rendered into the predicted pixel color of the target pixel.The predicted pixel is then compared with the target pixel to optimize a reconstruction loss.Through this training process, the neural network learns the latent 3D information which is described by the image dataset for one particular scene, and thus can be used to generate novel views of that scene effectively.

Nerfacto
Since the publication of the original neural radiance field paper, there has been an explosion of techniques utilizing some variant of the standard NeRF architecture [23].Though the utility and interest of many of these offshoots is without a doubt, we based our research on the Nerfacto architecture [24].
Nerfacto was developed as a synthesis of multiple other NeRF approaches during the creation of the Nerfstudio toolset.Nerfstudio is an open source NeRF development software which provides viewers and standard components of various NeRF architectures.In reimplementing multiple architectures (MipNeRF-360 [25], NeRF- [26], Instant-NGP [27], NeRF-W [28], and Ref-NeRF [29]) in a standardized, modular fashion, the authors were able to then use various components to create a model, Nerfacto, that is both fast and relatively accurate; close or surpassing the state of the art for NeRFs in most quality metrics while training substantially faster [30,31].There are multiple important differences between a base NeRF and the Nerfacto approach in the optimization of camera positions, ray sampling techniques, scene registration, hashgrid encoding, and the generation of normals.We will leave an in depth explanation of the Nerfacto innovations to the original paper, but we include here a diagram of the source of various components in Figure 2.

Language Embedded Radiance Fields
Generating realistic 3D scenes is an important standalone task, but in order to understand the content of the scene it is necessary to encode semantic features into the 3D representation or a proxy for the 3D scene.Three-dimensional semantic information underlies many tasks in computer science, and is of course necessary for our architectural object removal task.One approach to injecting semantics into 3D spaces, in this case linguistic information, is to utilize a language embedded radiance field [32].The general concept of a LERF is to instantiate a separate network which mimics the NeRF framework by accepting a 3D coordinate and camera pose but is trained to output the appropriate language embedding vector at that position.
In general, 3D semantic features are predicated on assigning some descriptive information about a scene to particular locations within the scene.There are multiple forms of assignment schemes including using object labels with bounding boxes that contain that object, per voxel assignments of particular objects, or continuous features like those used by LERFs where there are no discrete boundaries between objects.While 3D features can encode various sorts of data such as temperature or other sensor measurements, linguistic features specifically are useful because they allow for interaction with various other language-based tools and make querying or interacting with the scene natural for humans.Linguistic features can take multiple forms, such as explicit dictionaries of descriptive captions, but representing them as embedding vectors offers a broader range of querying possibilities.Since language embedding vectors allow fast, effective comparisons of any linguistic information (i.e., a description of a particular object), having linguistic embeddings assigned to the 3D points throughout a scene makes it possible to understand and interact with the scene in a robust way.
The underlying system for creating language embeddings which is used in LERFs is CLIP (Contrastive Language Image Pretraining) [33] or alternately its open source relative OpenCLIP [34].These embeddings form the basis of many systems that rely on a joint understanding of language and images such as stable diffusion because they are effective in regards to open set or long tail queries which do not fit particularly well into typical classes.This makes CLIP particularly useful for LERF, which utilizes the input images for the neural radiance field to generate its language embeddings.The largest difficulty with this approach is that CLIP embeddings, while image size agnostic, can only be produced on a full image.In other words, it does not give pixel wise embeddings or segmented embeddings, but instead a single embedding per image.However, in order to find the loss and thus train the network that predicts the CLIP embedding for 3D positions in a similar structure to a regular NeRF, each pixel in the input image must have a corresponding CLIP embedding.The way the LERF authors found a CLIP embedding value for each pixel was by creating a pyramidal stack of image subsets: they divided the original image into smaller sections before subdividing each of the smaller sections into even smaller subsets.After creating the stack of image subsections by repeating this operation up to a particular depth of subdivisions, they used CLIP to find the embeddings of each subset of the original image.When a neural network is used to predict the CLIP embedding for a particular pixel, the loss is then calculated for the prediction relative to the embedding for each of the images that contain the target pixel.Through training with this pyramid of image embeddings, the network learns to predict the best embedding at each 3D position due to the overlapping views of each 3D point in all the image subsets, both within a single stack, across stacks, and across images.
The trained LERF network is then easy to query; the system converts your text query into a clip embedding ϕquery and weighs it against the embeddings ϕlang at each 3D coordinate.The primary caveat to this query is that a distance measure between two embeddings vectors of a large enough size is arbitrary without a reference point or comparison because the embedding space can be rather sparse.The original LERF authors solved this by implementing a default dictionary of vague, unspecific labels dubbed canonical phrases: "object", "things", "stuff", and "texture".The query relevance score is then determined by weighing how close the embedding at any position is to the query relative to its distance to the canonical phrases by cosine similarity.Expressed mathematically, this score is

Structural Integrity Assessment
Structural integrity analysis and assessment is a crucial task in civil engineering for ensuring the stability of buildings both before they are built and throughout their lifespan.The backbone of many contemporary computational structural integrity testing techniques are 3D models of the particular building being tested, such as in Figure 3 Often, 3D models are inferred from the floor plans; however, in many situations this is either not possible or sub-optimal.For many buildings, the floor plans are either not representative of the physical building (often due to renovations), have a non-standard illustration style which is difficult to automatically parse, or simply do not exist in the first place.Furthermore, some of the buildings in need of structural integrity assessment have been damaged by natural disasters or the passage of time, and thus the details of this damage are not present in the floor plans created during the initial construction.These various difficulties combine to make it essential to be able to create 3D scans directly in reference to the real building.A manually created structural model of our test case building before and after an earthquake simulation.Note that the structure is missing floors, many interior walls, and much of the facade detailing [17].

Modeling Occupied Buildings
Many relevant computational integrity testing techniques specifically derive their models from a scan of the exterior walls of a building.However, since many important structural walls, beams, or other elements are inside of the buildings, it is also crucial to be able to model the interior as well as the exterior as a unified model.Though this poses some problems in an unoccupied building, it is a particularly difficult and pernicious problem when the building is currently occupied, as most of the target buildings are.
In an occupied building, especially large ones that are most of interest, it is usually not an option to empty the entire building of furniture.It would be extremely time-intensive, expensive, and logistically challenging to do this with one building, let alone with multiple buildings.Due to this, the only true option for interior scanning of large buildings or sets of buildings is to do so while they are occupied with non-structural objects such as furniture and artwork.It is therefore crucial to remove these non-structural objects for two primary reasons: functionality of the model for integrity testing and privacy preservation for the current occupants of the building.
Since any non-structural elements present in the model would only degrade the efficacy of the integrity tests, it is self evident that the extraneous objects which do not affect the actual building must be removed.It is possible to perform this sort of segmentation by hand; however, this is very labor-intensive and thus does not scale well.Another motivation for removing these objects is privacy preservation, especially in the case of modeling residential buildings.Many people will not consent to having the layout and contents of their homes publicly disclosed in a 3D model which will be saved for posterity and future testing.With these two considerations in mind, it is necessary to build modeling pipelines which can detect and remove non-structural elements automatically.

Inpainting versus Removal
A key consideration for any automatic removal pipeline is the choice between inpainting and removal.In some applications, it is imperative to inpaint the removed information, i.e., infer what should actually be in the missing space of the data and attempt to generate some realistic replacement.However, in our case, it is actually not necessary to inpaint because the building representation which is ultimately used is based on inferred wall placement rather than the native pointcloud or mesh representation.In this case, walls can be inferred from just the points that model the non-occluded walls or floors which were behind the objects that were removed.Since walls can be inferred without inpainting and because inpainting introduces the danger of generating artifacts which could distort wall placement, we choose to only remove the non-relevant points rather than to infer the sections of the scene which were previously occluded by the removed objects.

Automatic Generation of Simplified Building Geometry
The practice of assessing structural integrity in buildings leads to the automatic generation of building geometries in terms of Level of Detail (LOD) models.These models represent simplified versions of the buildings and denote a reduced complexity 3D representation [35].Current research focuses on generating such models for the building exterior.For instance, Pantoja-Rosero et al. [1] presented an automated method for generating LOD3 building models, which capture exterior geometry, including openings, using structurefrom-motion and semantic segmentation techniques.In this approach, the point cloud is clustered into planar primitives, and the intersection of these primitives yields potential faces for constructing a polygonal surface model of the building's exterior.Correct faces are then selected by solving a linear programming problem, considering the goodness of fit relative to the point cloud, the coverage of the point cloud on the faces, and the complexity of the LOD2 model.Subsequently, the LOD2 model is upgraded to LOD3 using a convolutional neural network designed to classify openings in images, such as windows or doors, and projected to 3D space using camera poses retrieved from the structure-from-motion.A simplified representation of the LOD2 model construction process is provided in Figure 4.A similar approach, involving clustering interior point clouds into planar primitives, can be applied to generate simplified geometry, which can then be combined with the exterior for generating LOD4 models.This approach is currently being developed in parallel with the present work by the authors, and some results are displayed in Figure 5.

LERF-Based Open Set Automatic Removal
In order to solve the challenges of the structural integrity modeling outlined above, we introduce a LERF-based automatic removal technique built on top of Nerfstudio functionality which is detailed in Figure 6.Our approach is effective on a large range of exotic objects, and in small dataset sizes where there are few captured views of particular objects.It can be run with no further human input after the initial capture of data and thus allows for the privacy preserving capture of accurate interior structural models of buildings when combined with the techniques detailed above.

Open Set Queries and the Default Dictionaries
A fundamental component of the general effectiveness of CLIP classification, and thus LERF classification, is the strong open set capability (in other words, its ability to classify an extremely diverse set of objects with no limitation on what those objects are).Since we are dealing with removing a wide range of unknown objects from a myriad of views, in our context it is particularly useful to utilize CLIP rather than a more traditional object classifier, even one specialized for interior objects and furniture.
In the traditional LERF workflow, queries are inputted through the GUI at rendering and then compared against a dictionary of canonical phrases to find a relevancy score toward the present query.Since we are seeking automatic detection without supervision, and we have a consistent, coherent yet broad range of objects we want to remove, we need to present a persistent set of negative queries.Furthermore, since we also know what we want to preserve (walls, floors, structural beams, etc.) we can present a persistent set of positive queries which we want to keep.By comparing the highest relevancy of each of these canonical sets of possible classifications, we can determine simply whether a particular point should be removed or preserved.This set of phrases was subject to substantial prompt engineering to find the optimal dictionaries and is presented in Table 1.

Three-Dimensional Native Segmentation and Point Removal
Many NeRF-based 3D segmentation approaches, such as SPIN-NeRF [36], actually rely on 2D native segmentation, inpainting, and removal as the backbone of their approach.Though 2D inpainting or segmentation has very strong results on individual images of a scene, this approach introduces a substantial number of artifacts because the segmentation maps are inconsistent on a image-by-image basis within the same scene [37].This results in a particular piece of furniture being mapped and removed in one image but not another, leading the NeRF model to try to learn to produce a piece of furniture from one view but not from another and thus creating ghostly floating artifacts where they ultimately should have been removed.Another paradigm in object removal follows [38] where objects are learned distinctly from the scene, and there is a discrete representation for each individual object.This works well in simple scenes with clearly delineated objects, but cannot scale to very complex scenes with large numbers of objects.One of the key advantages to using a LERF-based approach is that each point is being evaluated and segmented in the actual 3D space, and thus it avoids many of the issues associated with 2D removal and inpainting in addition to foregoing the need for discrete object representations.When removing points, we calculate a relevancy score for each point in 3D space after training and then simply blacklist every point which is classified as a negative sample when rendering the pointcloud from the NeRF.We also utilized the blacklist during training to not calculate loss for the RGB NeRF on points in the hashgrid which were eventually going to be blacklisted anyway, but we determined that the marginal improvement in quality of the non-blacklisted points was not worth the substantial slowdown in speed.

Data, Preprocessing, and Postprocessing
Our dataset is entirely architectural scans of an existing building that reproduces similar scenarios to our actual use case but where there were no true privacy preservation issues.In terms of image preprocessing, we only performed the steps outlined in the collection section as well as removed images with a high level of blur from the LERF input dataset.In terms of point cloud postprocessing (after the generation and cleaning), we normalized all the point clouds and afterwards, where appropriate, trimmed the point clouds to be the same size.This trimming was performed primarily on the point clouds of the filled rooms as they always had more points than the automatically cleaned rooms due to point removal and more than the empty rooms due to them holding more objects/information.We choose the points to remove in post-processing by finding the points farthest from the origin point of the normalized point clouds, and then trimming those by a mean of 12,696 points, leaving an average point cloud size of 1,004,997 points.Since neural radiance fields are known to generate outlier points more than photogrammetric or laser scanning-based approaches, it was actually a benefit to the filled rooms in terms of distance metrics to have their outliers removed.Since our interest is in making the automatically cleaned rooms as close as possible to the physically empty room, and this actually made our task more difficult, we considered this an acceptable trade-off to have more accurate distance measures not effected by disparities in the number of points.

Collection
For dataset capture, we utilized the free but closed source Polycam 3D capture application with a LiDAR enabled iPhone 12 Pro due to the higher quality camera poses produced with this approach in comparison to non-LiDAR-enabled cameras.Though more recent models of iPhone have substantially higher quality cameras and LiDAR scanners, we still considered the photo and pose quality of the 12 Pro sufficient for evaluating our method.The images were 944 by 738 pixels, and we captured between two hundred and eight hundred images per room depending on the size of the room.We captured our test building in per-room segments due to the limited capacity of the LERF models at their normal scale and because it allowed us to have multiple samples for evaluation of the approach.It would be possible to run this same pipeline with larger-scale captures, as there is no theoretical limit to modeling outside of the capacity of capture device and size of the neural network.However, in practice, smaller scenes tend to be less unstable in training/yield better point clouds.To capture the scenes, we walked through each room with LiDAR and video running, and allowed Polycam to record the camera positions directly from the iPhone.After optimizing the camera positions relative to the LiDAR and images, the raw data format was a series of images, each paired with the relevant camera pose at that moment.We then took the raw data from these captures and utilized the Nerfstudio data processing pipeline to place them in the format which was acceptable for the original LERF which we built our approach on top of (this did not involve any change in the images or poses themselves but involved removing blurred images as mentioned in the pre-processing section).

Results
Our experimental results were mixed-we were able to correctly identify and remove many complex objects in the 3D space without any human intervention.However, the underlying point clouds generated by the NeRFs (even of clean scenes) were not as precise as those created by traditional photogrammetry.

Quantitative Comparison
We use two metrics for the comparison of pointclouds: one was the Chamfer distance between pointclouds, a relatively common metric for pointclouds which is often used as a loss function in point cloud generation tasks.The Chamfer distance between sets of 3D points (S1 and S2) can be formalized as follows: The Chamfer metric between two sets is found by bidirectionally calculating the Euclidean distance between each point in one set and its nearest neighbor in the other set, and then taking the mean of these distances.Though Chamfer distance is a strong metric for local structures in a point cloud, it can be overly sensitive to outliers.In response to this problem, we also considered earth mover's distance (also known as the Wasserstein metric), another common point cloud similarity metric: As demonstrated in Table 2, the distance to either the auto-cleaned or filled point clouds is mostly small, as to be expected.However, the relative distance between the filled and auto-cleaned point clouds is not particularly dramatic -by Chamfer distance the distinction between the two sets of point clouds averages out to be near zero (0.019014).This is likely due to the process of removal rather than inpainting-we are removing many points from the center of the point clouds (i.e., the furniture) which is by distance rather close to the walls/floor of the empty room point cloud.Since the points in the furniture are the nearest neighbors for many of the points in the walls or floors, they produce a low Chamfer distance from the 3D model of the physically empty room.In contrast, the auto-cleaned point clouds had a higher percentage of points which were outliers or outside of the room (see pre-processing), and also in some cases parts of the floor or wall were unmodeled where removed objects were flush to the floor or wall.Though these missing areas have negligible effect on our LOD conversion approach, they do increase the the Chamfer distance between the empty room point cloud and the auto-cleaned point cloud.Since the Chamfer distance between the scenes can actually be increased by correctly removing the relevant furniture, we did not consider this a strong metric for this particular task.This result motivated us to look at a more holistic similarity metric: earth mover's distance.Comparatively, the distances calculated by EMD substantially favor the autocleaned rooms (see Table 3).By using a metric that is robust to outliers and also does not bias toward nearest neighbors but instead looks at the overall distribution of points, we can see that our auto-cleaned rooms are in fact far more similar to the empty rooms than the rooms filled with furniture.

Qualitative Comparison
Comparing our point clouds strictly by distance measures does not fully illuminate the efficacy of our approach-in this case, a qualitative analysis can be useful.Since this is only a simulated case of privacy preservation (i.e., we have permission to publish images of 3D models of the building interiors with their contents), we have included example images of the generated point clouds below-specifically focusing on one of the rooms for simplicity (room 7 on floor 0).
In examining initially the point cloud modeled on the empty room (Figure 7), it is immediately evident that there are substantial disparities between this generated 3D model and the ground truth.These artifacts come in three forms: outlier points created to satisfy the images which contain the windows and the doors (such as in Figure 8), surface distortions where sections of the wall or floor are sunken or deformed, and the infamous "floaters" (small patches of points predicted in the middle of the air where there should be no objects).We will elaborate on various mitigation strategies for this in the Discussion Section, but we find this to be a major drawback for all NeRF-based approaches in their current state.Due to the fact that NeRFs are trained to optimize their similarity to images of the scene rather than based on any underlying structural assumptions, they can be very effective at generating scene representations that look convincing when rendered as image or video.However, it is often possible in practice to have a point which is incorrect in its 3D placement but due to a simultaneous distortion of color it can render back to a convincing image of the scene (this is especially true when there are limited input image views of a particular area of the scene).Following this flaw, we find that NeRFs are ineffective as a backbone when the precision of the underlying 3D model is the target rather than convincing images, at least for the time being.Though our post-generation LOD modeling techniques can still capture wall structures in many instances regardless of this drawback, it is still severely limiting in insuring the accuracy of a fully automatic pipeline-at least for the time being, as NeRF techniques are in their infancy.Moving on from the relative quality of the empty room models, the LERF approach does appear to be successful in many instances of identifying and removing complex 3D objects, like those present in Figures 9 and 10.Many 3D native classification algorithms have a very limited number of possible classes, and thus dealing with such a diverse and long list of potential classes as listed in Table 1 is intractable-in this case, we find the ability of identifying these complex objects to be relatively impressive.Despite this, there are instances of unremoved objects or places where an object was only partially removed such as in Figures 11 and 12 (an artifact of the fact that classification occurs at a per point level rather than through bounding boxes or discrete segmentations).There is also the issue of occlusion-based destruction of unrelated structures-as mentioned before when a pieces of furniture is either flush to a wall or the floor or occludes the wall or floor in a substantial way (such that there are few available views behind or below the object), the points behind the object are either never rendered or are removed incorrectly.This leaves holes in the point cloud which may be unsuitable for many downstream use cases.Despite these drawbacks, the ability of the LERF model to learn a relatively convincing 3D model and the associated linguistic features for the points from just images and poses is exciting and groundbreaking in many ways.As these techniques continue to improve, their efficacy in the automatic modeling of buildings and usefulness in heritage conservation will increase commensurately.9.The majority of points that made up the tables, speakers, and musical instruments have been removed.This is the same room which can be seen in Figure 12.  10.The majority of points that made up the tables, speakers, and musical instruments have been removed.This is the same room which can be seen in Figure 11.

Discussion
Though these techniques are not yet fully mature, the automatic modeling and embedded understanding of historical building models by deep learning-enabled techniques is already proving to be a useful tool.They are currently limited by their scale, but we expect that, with further improvement of the underlying algorithms and the propagation of high-memory GPU compute, these techniques will become able to model large buildings in their totality in the future.In terms of expanding on this particular work, there are multiple avenues that present themselves.One is to rely on other forms of capture (such as photogrammetry or laser scanning) for producing the underlying 3D models and then embedding the linguistic features after the fact.Similar techniques have been designed for other purposes such as ConceptFusion [39] and CLIPFO3D [40], which would likely exhibit similar open set classification ability to LERFs.Though our ambition was to test the efficacy of novel generative methods in addition to utilizing the embedded classification ability, these projection-based techniques could be very useful when the accuracy of the underlying 3D model is paramount or where the object to be modeled is extremely large.Recently, language features have also been embedded into Gaussian splats (another technique for learning 3D representations from a set of poses and images), which also seems rather promising in terms of the quality of object segmentations [41].Another further solution would be to build a multimodal embedding model with a similar training regime to CLIP but which natively accepts 3D data and thus avoids the need to project linguistic information from images into the 3D space.This would involve assembling large datasets, marshalling a huge amount of compute, and developing innovative deep learning models; a substantial effort from even the most dedicated of researchers.
There is also the issue of point cloud inpainting-an active area of research in heritage preservation [42], automatic building models [43], and computer science more broadly [44].Through the novel view synthesis ability of the underlying NeRFs in our approach, we did generatively fill particular areas of our point cloud where there were not enough reference images to construct a strong representation using an approach like photogrammetry-yet there were still substantial holes in our 3D models.Utilizing other neural network or propagation-based approached to inpainting, these missing components of our point clouds likely would have increased the realisticness of our automatically cleaned point clouds-this is an area for future work.As the integration of neural rendering and point cloud understanding techniques advance, it may be possible to inpaint more holistically and directly within the generation of the point cloud itself rather than as a post-processing step.

Conclusions
In this paper, we explored the capabilities of language-embedded radiance fields for the automatic generation of point clouds from images and for the removal of non-structural elements from those point clouds.By contextualizing our approach within the ongoing conservation of the Ancien Hôpital de Sion, we showed the current use and tremendous potential of point cloud understanding in the cultural heritage space.The sort of linguistically aware modeling techniques presented here have a strong ability to classify a wide range of objects with a minimal level of manual intervention, making privacy preserving modeling of the interiors of at-risk buildings a realistic possibility.Due to the inherent difficulty of the manual removal of objects in the real world or within an already constructed 3D model, this research enables faster, simpler modeling of the interior of buildings-contributing to better models for general conservation and structural modeling in particular.This in turn has implications for the ability to protect and cherish our cultural heritage for the long run.There are multiple extensions and areas of improvement for this approach including more complete inpainting of the removed areas and more precise underlying modeling either with NeRFs, Gaussian splats, or other linguistic projection techniques.Regardless, language-embedded radiance fields and their technological descendants will undoubtedly have a important impact on the field of cultural heritage preservation in years to come.

Figure 1 .
Historical photos of the old hospital of Sion [20].(a) The hospital as photographed in the 1920s.(b) The situation of the hospital in the town of Sion (building with the bell tower on the bottom right).

Figure 3 .
Figure 3.A manually created structural model of our test case building before and after an earthquake simulation.Note that the structure is missing floors, many interior walls, and much of the facade detailing[17].

Figure 4 .
Figure 4. Automatic generation of LOD3 models using Pantoja-Rosero et al. [1] framework.(a) Point cloud generated with structure-from-motion; (b) clustered point cloud in planar primitives; (c) candidate faces produced by the intersection of planar primitives; (d) LOD2 model; (e) semantic segmentation of openings; (f) 2D coordinates of opening corners inferred from semantic segmentation; (g) 3D openings projecting 2D corner coordinates with camera poses; (h) LOD3 model merging LOD2 model and openings.

Figure 6 .
Figure 6.An overview of the full pipeline from captured images to automatically cleaned point cloud.

Figure 7 .
Figure 7. Point cloud of the 3D model of the empty room seen from a corner of the room.

Figure 8 .
Figure 8. Point cloud model of the room seen from above.

Figure 9 .
Figure 9. Point cloud of the filled room seen looking directly at some of the furniture (a desk with amp on the left, piano in the center, drumset and keyboard on the right).

Figure 10 .
Figure 10.LERF generated point cloud of the filled room without automatic removal seen looking from the corner of the room (music stands, microphone, and bongos in the foreground with desks and chairs on the left side).

Figure 11 .
Figure 11.LERF generated point cloud of the filled room with automatic removal from the same angle as Figure9.The majority of points that made up the tables, speakers, and musical instruments have been removed.This is the same room which can be seen in Figure12.

Figure 12 .
Figure12.LERF -generated point cloud of the filled room with automatic removal from the same angle as Figure10.The majority of points that made up the tables, speakers, and musical instruments have been removed.This is the same room which can be seen in Figure11.

Table 1 .
The two canonical dictionaries used for our modeling.

Table 2 .
The Chamfer distance between the 3D model of the empty room to either the LERF generated 3D model of the filled room or the LERF generated 3D model of the automatically cleaned room.The bold values represent the superior metric between compared approaches, and the arrows indicate that lower values of the metric are desirable.

Table 3 .
The earth mover's distance from the 3D model of the empty room to either the LERF generated 3D model of the filled room, or the LERF generated 3D model of the automatically cleaned room.