Open-Vocabulary Segmentation of Aerial Point Clouds

Alami, Ashkan; Remondino, Fabio

doi:10.3390/rs18040572

Open AccessArticle

Open-Vocabulary Segmentation of Aerial Point Clouds

by

Ashkan Alami

^1,2

and

Fabio Remondino

^1,*

¹

3D Optical Metrology (3DOM) Unit, Bruno Kessler Foundation (FBK), 38123 Trento, Italy

²

Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 572; https://doi.org/10.3390/rs18040572

Submission received: 30 December 2025 / Revised: 6 February 2026 / Accepted: 10 February 2026 / Published: 12 February 2026

(This article belongs to the Special Issue GeoAI for Urban Understanding: Fusing Multi-Source Geospatial Data)

Download

Browse Figures

Versions Notes

Highlights

A novel annotation-free open-vocabulary method for the 3D classification of large-scale aerial point clouds.
Comparison with state-of-the-art 3D deep learning (DL) methods.

What are the main findings?

Prior class definitions or annotated data are no longer necessary.
Metrics are almost on par with traditional DL methods.

What are the implications of the main findings?

Semantic understanding of 3D urban environments is feasible without training data.
Municipalities can classify their 3D urban environments into the desired classes using OV methods.

Abstract

The growing diversity and dynamics of urban environments demand 3D semantic segmentation methods that can recognize a wide range of objects without relying on predefined classes or time-consuming labelled training data. As urban scenes evolve and application requirements vary across locations, flexible, annotation-free 3D segmentation methods are becoming increasingly desirable for large-scale 3D analytics. This work presents the first training-free, open-vocabulary (OV) method for 3D aerial point cloud classification and benchmarks it against state-of-the-art supervised 3D neural networks for the semantic enrichment of these geospatial data. The proposed approach leverages open-vocabulary object recognition in multiple 2D imagery and subsequently projects and refines these detections in 3D space, enabling semantic labelling without prior class definitions or annotated data. In contrast, the supervised baselines are trained on labelled datasets and restricted to a fixed set of object categories. We evaluate all methods with quantitative metrics and qualitative analysis, highlighting their respective strengths, limitations and suitability for scalable urban 3D mapping. By removing the dependency on annotated data and fixed taxonomies, this work represents a key step toward adaptive, scalable and semantic understanding of 3D urban environments.

Keywords:

point cloud; semantic segmentation; language models; deep learning; photogrammetry

1. Introduction

3D scene understanding has recently considerably risen in importance due to various applications, including robotics, autonomous driving, urban planning, forestry and digital heritage site preservation [1,2,3,4,5,6,7,8,9,10]. Specifically, the fundamental task of semantic segmentation, which involves labelling 3D point cloud data to extract meaningful, object-level classifications, persists as one of the main challenges in the vision and geospatial communities. While the latest breakthroughs in supervised deep learning (DL) models have produced promising results on benchmark datasets like Semantic3D and KITTI [11,12,13,14,15,16,17,18], a significant and persistent bottleneck still exists in achieving real-world deployment robustness.

The primary limitation of supervised DL models is their innate dependency on massive, extensively annotated datasets for training purposes. For large-scale outdoor environments, such as urban scenes captured by LiDAR or aerial photogrammetry, this annotation procedure is exceptionally costly due to the total data volume, irregular data density, and fine-grained variety of object categories (e.g., distinguishing different types of street furniture or signage) [19,20,21]. More critically, DL models inherently fail to generalize beyond their predefined, closed set of trained classes. This leads to a severe degradation in performance, often referred to as the domain gap, when the model encounters novel scenes or previously unseen objects in a deployment setting. This generalization failure makes them unsuitable for dynamic, open-world applications where new classes appear frequently [22,23,24].

Recent research efforts are increasingly addressing this limitation by shifting towards open-vocabulary (OV) and zero-shot segmentation pipelines [25,26,27,28]. These methods aim to create models that are class-agnostic and can understand arbitrary concepts not seen during training. This paradigm shift is largely enabled by the rise of Vision-Language Models (VLMs), such as CLIP [29], which are pre-trained on vast amounts of image-text pairs [30,31,32,33]. By leveraging the comprehensive, open-set knowledge embedded within the VLM’s language space, researchers can transfer general object recognition capabilities from the 2D domain to 3D space. Most current implementations of this approach have focused on controlled, dense indoor scenes (e.g., ScanNet or Matterport3D), where point density is high and objects are relatively small [26,27,34,35,36,37,38,39].

However, the effective application of these training-free open-vocabulary methodologies to large-scale, unevenly distributed and complex urban point clouds remains largely unexplored and highly challenging. Urban point cloud data, acquired with LiDAR sensors or generated with photogrammetric methods, presents unique issues, including significant class imbalance (e.g., many more ground points than cars or pedestrians), varying levels of point density and large scale [40]. Therefore, it is non-trivial to directly adapt an indoor-focused OV pipeline to this challenging outdoor domain.

Inspired by these recent OV developments and the recognized limitations of current urban segmentation methods [41,42], this work explicitly addresses the need for annotation-less 3D classification of large-scale aerial point clouds by presenting a training-free, open-vocabulary pipeline that utilizes 2D image detection and subsequent 3D projection and refinement (Section 3.2). This method is compared to conventional, supervised 3D neural networks (Section 3.1). Our resulting analyses and extensive experiments (Section 5) on challenging urban datasets (Section 4) explore the advantages and inherent drawbacks of both supervised and OV approaches, ultimately highlighting the substantial potential of the latter to enhance segmentation performance across a significantly wider and more diverse range of urban classes. Qualitative results of the proposed OV pipeline across four urban scenes are shown in Figure 1.

The core contributions of this work are focused on bridging the gap towards annotation-less 3D classification in aerial urban environments and can be summarized as follows:

OV pipeline: We present a complete, training-free OV pipeline that leverages open-set object detection from multiple oriented 2D aerial images for robust semantic predictions. These are then triangulated and projected onto the point cloud, including a refinement procedure. The pipeline includes a detection and a segmentation model.
Comparative analysis: We report the first direct comparison between the proposed training-free OV approach and traditional, supervised 3D DL methods, evaluating performance, generalization capabilities and real-world utility for large-scale aerial photogrammetric point clouds over four different datasets.

2. Related Works

2.1. Three-Dimensional Deep Learning Segmentation Methods

In a nutshell, the most efficient approaches for point cloud classification can be categorized into four categories [16]: multi-view-based, voxel-based, point cloud-based, and polymorphic fusion-based approaches (Figure 2).

From the first developments in deep learning for 3D point cloud segmentation, the essential challenges of unstructured 3D data were addressed: permutation variance and inconsistent density. All methods presented in the literature offer performance based on the network used and the assumptions or given classes. Determining an outstanding method is quite challenging due to various factors, such as dataset size, number/type of classes, object type and sensor features. These variables should be considered when assessing the performance of different methods and the quality of their outcomes.

PointNet [43] succeeded by using symmetric functions to aggregate global features and resolve permutation invariance. PointNet++ [44] then followed, introducing a critical hierarchical method for efficiently capturing local features through multi-scale grouping and set abstraction.

Researchers developed a focus on convolutional adaptation. By using convolution with learnable kernel points in nearby neighborhoods, KPConv [45] successfully translated and demonstrated adaptability to shifting point densities. Additional methods restructuring the data include those that leverage efficiency or alternative data structures: the Minkowski Engine [46] provided sparse and incredibly effective 3D convolution, and VoxelNeXt [47] adopted volumetric representations (voxelization). Other highly effective approaches rely on transforming the point cloud into a graph structure for local context encoding, such as the Graph Convolutional Neural Network (DGCNN) [48], which employs dynamic graph construction, and subsequent graph-based methods like GTNet [49]. More recent studies overcome the limitations of purely local kernels by combining transformer architecture and self-attention mechanisms to capture long-range dependencies. By simulating intricate global interactions, architectures such as Point Transformer [50], Superpoint Transformer [51], and Mask3D [52] exhibit cutting-edge outcomes. In large-scale applications, techniques like RandLA-Net [53], which typically function as effective and reliable supervised baselines, combine random point sampling with lightweight feature aggregation. Recently, transformers-based methods like SAPFormer [54] were proposed to flexibly capture the semantic information of point clouds in geometric space and effectively extract the contextual geometric space information.

2.2. Zero-Shot and Open-Vocabulary Segmentation Methods

Supervised DL methods are limited by the need for large amounts of costly annotated training data, which is particularly challenging with 3D data such as point clouds. This dependency challenge is not only about annotation costs but also limits the models’ performance to the trained labels, preventing generalization to new classes. To overcome this, zero-shot and open-vocabulary methods were introduced. In computer vision, these models connect vision and language to recognize a wide array of classes and concepts unseen during training. In the image domain, this capability is often realized by training on paired text and image data. CLIP [29] is the foundational example, utilizing dual encoders to align text and vision embeddings in a shared space (a vision-language model). The ensuing proliferation of VLMs [55,56] applies this linkage across tasks, including detection and segmentation. Specifically, detection models like GLIP [57], OV-DETR [58], and Grounding DINO [59] now locate objects based on text queries. For segmentation, models like Mask DINO [60], OV3D [28], or Sa2VA [61] enable open-vocabulary segmentation, processing novel classes simply via a text input, with some methods, like OpenSeg [62] and LSeg [63], achieving this through the integration of per-pixel CLIP embeddings.

In the 3D domain, recent classification methods are also moving toward Open-Vocabulary (OV)-based segmentation. Due to the limited availability of 3D data compared to 2D, and the fact that these models are trained on an extremely large amount of data (internet scale), 3D methods are highly reliant on the power of previously mentioned 2D models. OpenScene [26] leverages the OV 2D segmentation model OpenSeg [62] for 3D scene segmentation. Similarly, ConceptFusion [64] combines SAM (Segment Anything Model) [65] with CLIP [29] to segment 3D scenes. Some methods use a class-agnostic 3D instance segmentor to generate masks, subsequently applying OV 2D models. OpenMask3D [36] is an example that uses Mask3D [52] to extract 3D masks. These masks are then projected onto the images from the scene, and 2D models are used to extract CLIP embeddings for each 3D mask, which allows for later querying and segmenting of the desired objects in the 3D scene. Search3D [66] builds a hierarchical OV 3D scene representation, enabling the search for entities at varying levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Alami and Remondino [67] presented a training-free and flexible method for indoor 3D point cloud segmentation using 2D OV models and geometric features. The method detects queried objects in images using 2D detectors such as YOLO-World [68] and Grounding DINO, projects the masks to 3D, and refines them with XGBoost-guided region growing. Crucially, it does not use dataset-specific training and operates directly on the surveyed scene.

In urban and aerial contexts, the applicability of OV methods is also rapidly expanding. HAECity [69] generates RGB images from 3D scenes and applies OpenSeg [62] to extract CLIP features, which are then projected back to 3D. These features are used to create pseudo-labels for training a hierarchical vocabulary-agnostic Expert Clustering model, built on a superpoint graph clustering–based Mixture of Experts, enabling efficient OV understanding of large-scale point clouds. Another noteworthy work, OpenCity3D [8], integrates models such as SigLIP [56] and SAM to extend urban scene understanding beyond conventional segmentation tasks. In addition to extracting standard classes like buildings, it demonstrates the capability to infer higher-level attributes, such as estimating building age. Furthermore, other approaches apply combinations of 2D OV models, such as Grounding DINO and SAM, to LiDAR data for open-vocabulary object detection [70], highlighting the growing applicability of OV methods in large-scale geospatial analysis.

Despite recent progress, these urban OV approaches are often not fully training-free or primarily focus on large structural classes rather than diverse object types. This leaves a gap for a purely inference-based pipeline capable of flexible, fine-grained segmentation across complex aerial environments.

3. Semantic Segmentation Methods

3.1. Three-Dimensional Deep Learning Segmentation Methods

Due to their outstanding performance in various works and on multiple datasets [40,45,50,51,71,72,73,74,75], the following 3D DL methods are selected:

KPConv (https://github.com/HuguesTHOMAS/KPConv, accessed on 9 February 2026): The architecture utilized 15 kernel points with an input radius of 18.0 m. The initial subsampling distance was 0.3 m, and the convolution radius was 2.5 m. Training was conducted for 250 epochs with a batch size of 3. Weighted Cross-Entropy Loss was used. The Stochastic Gradient Descent (SGD) optimizer was used with an initial learning rate of 1 × 10⁻³, and momentum was 0.98. The learning rate followed an exponential decay schedule, reducing the rate by a factor of 10 every 100 epochs.
PointNet++ (PN++) (https://github.com/yanx27/Pointnet_Pointnet2_pytorch, accessed on 9 February 2026): Scenes were tiled into 6 × 6 m² to 10 × 10 m² sections (4096 points per tile). Coordinates were normalized (x, y to unit square; z shifted to tile minimum). The model was trained for 100 epochs using the Adam optimizer. A cyclic learning rate schedule was applied, ranging between 1 × 10⁻⁶ and 1× 10⁻³ (step size 1000), with cyclic momentum disabled. Weighted Cross-Entropy was used as the loss function. The epoch with the highest Intersection over Union (IoU) on the validation set was selected for testing.
Point Transformer (https://github.com/Pointcept/Pointcept, accessed on 9 February 2026) V1 (PTv1) and V3 (PTv3): The learning rate was initially set to 5 × 10⁻³, with a momentum set to 0.9, and weight decay set to 1 × 10⁻⁴ using the Adam optimizer. Weighted Cross-Entropy Loss was used. A warmup phase was employed for the first 5 epochs. The learning rate then followed a step-wise decay schedule based on training progress after warmup, decaying by a factor of 0.5, 0.25, and 0.1 at 50%, 70%, and 90% of the remaining total epochs, respectively. Model selection was based on the highest IoU on the validation set.
Superpoint Transformer (SPT) (https://github.com/drprojects/superpoint_transformer, accessed on 9 February 2026): Scenes were partitioned using superpoints over tiles of approximately 300 × 400 m². Training ran for 2000 epochs with a batch size of 2. Optimization used AdamW with an initial learning rate of 5 × 10⁻³. Weighted Cross-Entropy Loss was applied. The final model was selected based on the best validation IoU.

These methods are later used in a supervised manner (Section 5) to benchmark the proposed annotation-free methodology (Section 3.2). Note that the reported hyperparameters were selected using standard or default values for each architecture, with adjustments made due to computational resource constraints (e.g., batch size and data partitioning). The primary objective of this work was to establish a baseline for comparison, not to achieve peak performance through exhaustive hyperparameter tuning.

3.2. Open-Vocabulary 3D Segmentation Methods

The methodology proposed in Alami and Remondino [67] for indoor scenarios is adapted and modified for processing urban aerial point clouds derived from high-resolution nadir and oblique imagery. To adapt the method for large-scale urban environments, three key modifications are introduced:

Tiling strategy (Section 3.2.1): Given the large resolution of aerial imagery, we apply an image partitioning strategy. This adapts the data for standard model architectures, while preserving the resolution necessary to detect small urban objects.
Adaptive thresholding (Section 3.2.2): A fixed confidence threshold often deletes rare or difficult classes that naturally have lower scores. We implement a class-adaptive strategy that dynamically adjusts the limit: It maintains a high threshold for common, high-confidence objects (like buildings) to reduce noise, while lowering the threshold for rarer objects to preserve them in the detection.
Optimized projection and refinement (Section 3.2.3, Section 3.2.4 and Section 3.2.5): We modified the projection step specifically to increase accuracy in complex urban geometries, following a Multiview-voting and refinement step to reduce the error and noise. Because the original refinement method is computationally too expensive for massive point clouds, we replaced it with a simpler, lightweight noise reduction step that efficiently cleans the final results.

Through these steps and modifications, we establish a fully training-free pipeline capable of targeting a wide spectrum of urban classes across diverse scenes.

The method assumes a dense point cloud and the corresponding aerial images are accurately co-registered, i.e., with known interior and exterior camera parameters. The method uses a modular structure (Figure 3) to allow two types of 2D open-vocabulary (OV) models to be applied, i.e., a detection-based model like Grounding DINO (G-DINO) and a segmentation-based model like Sa2VA.

3.2.1. Tiling Strategy

Due to the extremely large scale of urban aerial imagery and its centimeter-level ground sample distance (GSD), as well as the computational cost of processing full images, an image partitioning strategy is used. Images are subdivided into appropriate-sized tiles, processed separately, and then integrated. This tiling approach is critical for capturing small features that might be lost during downsampling while remaining computationally feasible. Choosing the right tile dimension is critical and scene-dependent, as too small tiles can fragment large structures, potentially causing errors (e.g., sections of a building facade being misidentified as a street). To find the optimal size, we balanced the need to keep small objects large enough to be recognized by the model after resizing, while ensuring that larger structures remain mostly intact within a single tile for better context. While an object-specific adaptive tiling strategy, such as using smaller tiles for smaller classes like vehicles and larger dimensions for larger classes like buildings, could potentially enhance fine-grained detections, in this work we employed a fixed tile size. This selection was refined through empirical testing to ensure the best detection performance across different object scales. When tiles are created, queries for specific classes are performed using dedicated prompt engineering.

3.2.2. Adaptive Threshold

Once images and class queries (e.g., buildings, trees, cars, etc.) are given, adaptive thresholding based on the distribution of raw prediction scores is applied. This mechanism ensures that only the most confident and contextually relevant OV predictions are retained, preventing the misclassification of ambiguous or irrelevant scene elements in the complex urban environment. To maximize recall, especially for detection models, the initial confidence threshold is set quite low or removed entirely to capture all possible detections. The raw prediction scores S_i,c for all instances i belonging to a specific class c are accumulated across all processed tiles and stored in a detection bank (see Figure 3). A class-specific threshold τ_c is then computed based on the statistical distribution of these accumulated scores. In our case, the mean score is used and provides a robust, per-class filter (see also the ablation study in Section 7.2); however, other values can be used, such as the median:

τ_{c} = E [S | C = c] = \frac{1}{| D_{c} |} \sum_{i \in D_{c}} S_{i, c}

(1)

where D_c is the set of all raw predictions for class c retrieved from the detection bank. A prediction is accepted as valid if and only if its confidence score is greater than or equal to the class-specific threshold (τ_c): S_i,c ≥ τ_c.

This filtering step is applied to the predictions from both model categories:

Detection model: The filtered bounding boxes and corresponding image regions are then fed to SAM to generate precise segmentation masks.
Segmentation model: Sa2VA is modified by explicitly removing its built-in thresholding step, obtaining continuous per-pixel scores S_i,c. The adaptive thresholding process is then applied to these scores. This adaptive approach ensures that points below the threshold are classified as unknown/other, preventing the forced classification of the entire scene into queried classes when irrelevant objects are present.

3.2.3. Projection

The segmented images are then projected onto the point cloud using the known camera parameters of the photogrammetric dataset. While in Alami and Remondino [67] the point cloud is voxelized and ray casting is applied, here we modified the process to better handle large-scale urban scenes. In these larger environments, the necessary use of wider voxels (e.g., 20–50 cm) can cause occlusion artifacts and projection inaccuracies that were less noticeable in smaller, denser indoor datasets. To address this, a coarse voxelization and ray casting are first used to identify visible regions. Then, the points within these voxels, along with their local spatial neighbors, are selected and projected back onto the segmented images. This adaptation establishes a direct pixel-to-point connection for label assignment.

3.2.4. Multi-View Voting

Since a photogrammetric reconstruction ensures that each 3D point appears in multiple images, the final label assignment employs a voting scheme based on the most frequently occurring label across all projections. In this approach, the final label is determined by a majority consensus; for example, if a point is identified as ‘street’ in three images but ‘tree’ or ‘grass’ in two others, it is assigned the ‘street’ label. This voting process helps correct errors caused by either 2D model misdetections or small projection offsets. The effectiveness of this multi-view consistency is further demonstrated in our ablation study (Section 7).

3.2.5. Refinement

The final refinement step focuses on noise reduction of potential misclassifications to produce the final semantically enriched aerial point cloud. This step is grounded in the geometric consistency assumption: We assert that a point and its neighbors that share the same or similar geometry likely belong to the same semantic group. For computational efficiency in dense aerial urban datasets, we avoid the computationally intensive calculation of geometric features at multiple radii as used in Alami and Remondino [67]. Although their method improves results, it is computationally expensive for large-scale urban scenes. Additionally, their approach assumes a 3D scene with limited images, whereas our photogrammetric data ensures that points are visible from multiple images. Therefore, instead of complex processing, we employ a lightweight smoothing step: For each point, we compare its label with those of its K-nearest neighbours while examining a minimal set of local geometric features (linearity, verticality, and surface change). Points whose labels differ from geometrically similar neighbours are reassigned to the majority neighbour label; otherwise, labels remain unchanged. This geometry-informed voting stabilizes the final 3D classification.

4. Datasets, Classes and Metrics

For the testing and validation of the proposed procedure (Section 3.2) in comparison to traditional supervised methods (Section 3.1), the following datasets are used (Table 1):

STPLS3D [71], in particular two sub-sets:
-
University of Southern California (USC)
-
Residential area (RA)
Hessigheim 3D [76]
Graz [77]

For the STPLS3D University of Southern California (USC) and STPLS3D Residential Area (RA) scenes, the following classes are used: Vehicle, Pole, Tree, Building, Impervious Surface (road), Fence, and Grass/Dirt. Unlabeled points are classified as Others by Open-Vocabulary methods, whereas traditional DL models typically use the Clutter class. Our proposed method, based on Sa2VA, is not intended to identify Clutter or Others, so its results for this category are excluded from results in Table 2 and Table 3.

In the Hessigheim 3D dataset, we predicted Low Vegetation, Impervious Surface (road), Vehicle, and Soil/Gravel classes. Facade, Roof, and Chimney are combined into a single Building category, merging Shrub with Tree, and categorized Vertical Surface, Urban Furniture, and Unknown points as Others. This necessary grouping is due to the nadir perspective of the available imagery of the dataset, which provides poor visibility of vertical elements (such as facades and chimneys), making them difficult to detect using Open-Vocabulary methods. It is worth notign that the nadir images and point cloud over the Hessigheim area have ca 1 year difference in time.

For the Graz dataset, the considered classes include Facade, Roof, Tree, Grass, Vehicle, and Impervious Surface (streets/pavements).

In all datasets, to provide a thorough and reliable evaluation, we use the Intersection-over-Union (IoU) and F-1 Score metrics, which measure the spatial correspondence between predictions and ground truth annotations and capture the balance of precision and recall, respectively.

5. Results

5.1. STPLS3D—University of Southern California

Table 2 and Figure 4 present the results for the University of Southern California (STPLS3D-USC) dataset. For Grounding DINO, we queried terms including vehicle, pole, tree, building, road, dirt, grass, and fence. In contrast, the Sa2VA-based method employed more detailed descriptions and prompts without requiring additional fine-tuning. The conventional DL models were trained on other scenes from the STPLS3D dataset, specifically the Residential Area (RA), Orange County Convention Center (OCCC), and Wrigley Marine Science Center (WMSC), and evaluated on the USC testing area.

Table 2. Metric results on STPLS3D—USC dataset. (*) merged; (**): clutter and unknown.

Method	Building		Vegetation		Vehicle		Poles		Fence		Road/Imp. Surface		Grass/Dirt (*)		Other (**)
Method	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1
KPConv	86.42	92.71	79.59	88.63	31.4	47.79	41.08	58.24	14.03	24.61	61.92	76.48	23.82	38.48	9.69	17.67
SPT	69.46	81.98	66.96	66.96	9.61	17.54	3.74	7.21	5.37	10.18	57.35	72.89	27.97	43.71	12.65	22.46
PN++	73.15	84.49	76.16	86.46	14.15	24.79	19.41	32.51	6.60	12.38	71.96	83.69	65.08	78.84	15.34	26.61
PTv1	78.66	88.06	67.51	80.61	10.37	18.8	8.61	15.86	3.59	6.93	61.37	76.06	23.63	38.23	11.03	19.87
PTv3	85.05	91.92	79.28	88.45	18.89	31.78	12.85	22.78	4.11	7.90	63.88	77.96	41.44	58.60	13.27	23.43
Ours (G-DINO)	75.14	85.8	64.90	78.71	9.36	17.12	3.46	6.68	3.97	7.64	43.88	61.04	23.13	37.57	1.58	3.11
Ours (Sa2VA)	82.43	90.37	70.55	82.73	28.46	44.31	6.16	11.60	6.61	12.39	53.08	69.35	35.62	52.52	-	-

Results indicate that Sa2VA performed noticeably better than the popular Grounding DINO + SAM configuration, particularly in classes with wide, continuous surfaces like grass and streets. Unlike standard open-vocabulary 2D models that use bounding boxes to identify these classes before refining with the Segment Anything model, Sa2VA generates per-pixel segmentations directly from text prompts. Furthermore, Sa2VA leverages vision-language models (LLaVA), which enhance its scene comprehension capabilities.

To better detect poles and fences, we employed a tiling strategy with Sa2VA (1024 × 1024 pixels with 256-pixel overlap) due to their small size relative to the large image dimensions. Since Grounding DINO’s detection-based architecture can identify multiple objects with a single query and naturally generates precise bounding boxes for individual objects, even when they are small, this tiling approach is not required for that model. Our analysis revealed that Sa2VA’s internal image resizing can cause poles to disappear completely, while fences often merge with adjacent classes like buildings, impervious surfaces, or grass. Consequently, we apply tiling exclusively for these two challenging classes. Larger objects such as buildings, trees, and vehicles remain clearly visible in full-resolution images, where tiling actually proves counterproductive. Partial building segments in small tiles are frequently misclassified as impervious surfaces or other categories. The results clearly indicate that open-vocabulary approach performance strongly depends on the selected 2D model. Accuracy varies according to both object size and prompt phrasing. Class definitions significantly impact outcomes. For instance, querying “road” to detect impervious surfaces may fail to identify pavements, while using “pavement” might overlook actual road areas. The term “impervious surface” itself often creates ambiguity for the model, resulting in decreased performance. Furthermore, discrepancies arise between the OV model’s visual interpretation and the point cloud annotation definitions. The OV models occasionally identify visually reasonable features, such as rooftop protective walls or balconies as “fence,” or palm tree trunks as “pole”, which are actually labeled as “building” or “tree” in the ground truth annotations (see Figure 3 and Figure 4).

5.2. STPLS3D—Residential Area

In the STPLS3D-RA scene, the same process as STPLS3D-USC was applied for the OV methods. For the conventional DL models, these were trained on the USC, OCCC, and WMSC scenes and tested on RA. As Table 3 and Figure 5 show, the Sa2VA method outperforms Grounding DINO; however, it still does not surpass the DL models. One reason is the discrepancy between the point cloud and images of the scene. For moving objects like cars, there are instances where vehicles appear in the street images but are absent from the point cloud. Additionally, in this point cloud scene, most of the vegetation (particularly palm tree foliage) is not reconstructed. Therefore, while models can detect vegetation in the images, these detections are projected onto incorrect locations in the point cloud due to the missing geometry. Moreover, in this RA scene, unlike USC, the OV methods did not accidentally detect small structural features like rooftop walls or balconies as “fence,” reflecting the different building style. However, the model did make a detection error when some bushes and vegetation in front of the houses were incorrectly identified as fence but are labeled as vegetation in the dataset.

Table 3. Metric results on STPLS3D—RA dataset. (*) merged; (**): clutter and unknown.

Method	Building		Vegetation		Vehicle		Poles		Fence		Road/Imp. Surface		Grass/Dirt (*)		Other (**)
Method	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1
KPConv	87.84	93.52	75.41	85.98	75.46	86.02	44.63	61.72	24.94	39.92	82.99	90.7	34.6	51.41	5.83	11.01
SPT	85.84	92.38	72.46	84.03	65.44	79.11	40.12	57.27	14.36	25.12	82.8	90.59	56.42	72.14	5.38	10.2
PN++	47.18	64.11	66.02	79.53	57.81	73.26	13.35	23.56	3.36	6.49	85.04	91.92	48.4	65.23	1.87	3.67
PTv1	57.14	72.73	59.76	74.81	44.07	61.18	25.91	41.16	10.97	19.77	70.45	82.66	30.96	47.28	1.52	3
PTv3	73.18	84.51	72.55	84.09	67.4	80.53	37.98	55.05	18.28	30.91	74.38	85.31	36.54	53.52	1.93	3.79
Ours (G-DINO)	81.08	89.55	52.63	68.96	38.11	55.19	13.91	24.42	13.38	23.6	64.14	78.15	20.59	34.15	0.46	0.92
Ours (Sa2va)	83.94	91.27	59.97	74.97	40.2	57.34	6.62	12.42	31.51	47.92	66.88	80.15	41.89	59.04	-	-

5.3. Hessigheim 3D

For Hessigheim 3D, results are consistent with those on STPLS3D (Table 4 and Figure 6), with the 3D DL models achieving generally higher performance than the proposed OV method across all classes. For this dataset, the available nadir images are downscaled to 1362 × 1024 pixels. The original images are too large in size, and tiling has been avoided to preserve context. Indeed, unlike STPLS3D, where downscaling can make some classes harder to detect due to a more complex scene, in Hessigheim 3D the classes remain visible and easily detectable after downscaling, making tiling unnecessary.

Moreover, some labels are inherently difficult for 2D models to interpret from the available images. For example, detecting building façades from nadir imagery is challenging, and abstract classes such as “urban furniture” can encompass a wide range of objects. Prompting for every possible item in such a category is both time-consuming and computationally expensive, while using a general prompt like “urban furniture” results in low detection rates and frequent failures. In general, for such cluttered environments, the most visually distinct and semantically simple classes are well detected by the 2D model.

Unlike the STPLS3D dataset, where deep models are trained on different scenes and tested on unique scenes (introducing a domain gap, as models are trained on diverse scenes like a convention center (OCCC) and a marine facility (WMSC) but tested on a completely different one, such as a university campus (USC) or a residential area (RA)), the Hessigheim dataset uses the same large-scale scene separated for training, validation, and testing. This consistency improves the DL methods’ performance. Additionally, the dataset contains ambiguous class labels (as mentioned above), which makes detection harder for OV methods. The temporal differences between the images and 3D scene acquisition, combined with the nadir viewing angle, add further obstacles for OV approaches.

5.4. Graz

Results for the Graz dataset are reported in Table 5 and Figure 7. Similar to other experiments, the OV approach is ideal for clear, unambiguous classifications, while 3D DL produces clean and precise predictions for most classes. Notably, Graz features a different architectural style compared to other datasets. Due to the historical town design, buildings are continuous rather than separated. Unlike the other datasets where buildings appear as distinct, separate cubic shapes, Graz’s connected building structures present additional challenges for detection. Furthermore, in this dataset, we queried for façades and roofs separately rather than just “building.” As shown in the results, Grounding DINO tends to classify most building structures as façades and struggles with roof detection. On the other hand, the segmentation based on Sa2VA achieved better visual and metric results, almost on par with traditional DL methods.

6. Discussion

For the traditional DL methods, we utilized effective hyperparameter configurations that are well-established in the literature for 3D urban segmentation. While an exhaustive search for the absolute optimal parameters could potentially improve performance, it was out of the paper’s scope, and the current settings represent a strong and standard baseline. This mirrors our approach with the OV method, where we similarly avoided extensive, scene-specific prompt engineering to maintain a realistic and generalizable comparison. The primary performance gap observed is effectively structural rather than parametric. A significant downside of DL models is their reliance on subsampling data, which can reduce the scene’s resolution, whereas the proposed OV-based method, which only projects onto the point cloud the inferred classes in image space, does not inherently require this. This subsampling and an equivalent tiling strategy in deep models can cause them to miss fine-grained details, potentially leading to an increase in unclassified points, as observed in the USC and RA scenes.

Moreover, class imbalance plays an important role. While techniques like weighted or focal loss can mitigate this issue, under-represented classes remain difficult for traditional DL models to detect and segment accurately [72,78,79]. Models like SuperPoint Transformer, which rely on an initial superpoint generation stage, are also highly dependent on this initial clustering. This requires significant testing and parameter adjustment, with no guarantee of generalization to new scenes. In this respect, OV methods are simpler as they can be applied to a variety of classes without extensive parameter tuning or the need for scene-specific training data.

On the other hand, the biggest challenges for OV-based methods are prompt engineering, very precise camera parameters and the absence of moving objects (depicted differently in the images and point cloud). The quality of the textual query can have a significant impact on the segmentation results. Small errors in the camera poses are reflected in the projected 2D inferences, which will match incorrect 3D points. Temporal differences between image and point cloud acquisition directly introduce errors into the final predictions. Figure 8 shows an example of these errors in the Hessigheim data, where cars visible in the images but absent from the point cloud resulted in false detections. Furthermore, issues also occur with points like vegetation: Missing leaves due to smoothing in the photogrammetric process often cause points belonging to structures or paved surfaces to be incorrectly labelled as trees after the 2D detections are projected onto the point cloud.

Finally, the main advantage of OV methods is the elimination of labor-intensive annotation (Figure 9). While prompt engineering could be seen as a costly operation, it typically requires only a few minutes. Moreover, this effort depends heavily on the employed OV model and scene to be queried. Recent architectures perform well with almost no extra tuning. In fact, in our experiments, we did not perform complex prompt engineering and kept the queries as simple as possible. On the other hand, the OV pipeline can be computationally expensive and time-consuming for detecting the queried object, whereas the inference of a trained DL model is generally faster. However, the trade-off is in favour of OV when flexibility is required: if a new class appears in the scene, a traditional supervised DL approach requires re-annotation and re-training, whereas with OV, we simply query the desired object.

7. Ablation Study

7.1. Projection and Multiview Voting

To evaluate our projection method, a dataset with perfect alignment between camera parameters and ground truth labels is needed. Since existing urban datasets do not meet these requirements, a synthetic benchmark is created using Blender and a part of the Hessigheim point cloud. This controlled environment allowed us to isolate and measure the accuracy of our projection method without external data errors. We compared two ways of assigning labels to the 3D space:

Single-View Projection: the accuracy of every individual 2D-to-3D projection is measured. Instead of averaging results per image, every projection instance is treated as a unique data point to see how well the 2D labels transfer to the 3D geometry.
Majority Voting: this strategy is used to handle the “real-world” reality of noisy 2D detections. When a 3D point is observed in multiple frames, it receives multiple candidate labels. The final label is chosen based on which one appears most frequently. This is designed to filter out transient errors from the 2D detector or slight camera misalignments.

To test how much the Majority Voting actually helps, we intentionally introduced noise into the 2D labels (ranging from 0% to 50%) to ensure that the voting system can “recover” accuracy when the input data is of low-quality. As shown in Table 6, at 0% noise, the proposed single-view projection is highly accurate (96.60%), confirming our mathematical foundation is solid. However, as the data becomes noisier, the value of the fusion method becomes clear. Under extreme conditions, such as 50% noise, the single-view accuracy drops significantly, but the majority voting recovers over 13% of that lost accuracy. This proves that even if the initial voxel-based projection encounters noisy 2D detections, the multi-view voting can successfully find the “truth” by aggregating information over time.

7.2. Threshold and Refinement Analysis

The impact of various statistical thresholds (τ_c) is analyzed to determine the most effective filtering strategy for open-vocabulary predictions. The STPLS3D–RA dataset (Section 5.2) is used to compare Mean, Median and Standard Deviation offsets (Mean ± 0.5 × σ) against a fixed default threshold of 0.5, which is the default threshold for Sa2VA. As shown in Table 7, the performance metrics across all statistical thresholds remain remarkably consistent. This stability suggests that the system is not overly sensitive to the specific choice of a statistical metric. We ultimately selected the Mean as our standard parameter; while it performs similarly to other measures, it provides a reliable, data-driven baseline that adapts to different class score distributions without requiring manual, scene-specific tuning.

The most significant performance improvement occurs after the final Refinement stage (Section 3.2.5). Although the multi-view voting projection reduces noise and misclassifications from the 2D detections, some errors still appear on the point cloud. To address this, a lightweight smoothing step reassigns labels using K-nearest neighbors and local geometric features, such as linearity and verticality. Table 7 shows that when the Mean baseline is compared to the Refinement results, both IoU and F-1 scores increase consistently across all categories. This shows that by using geometric features and a simple refinement step, better final results can be achieved.

7.3. Two-Dimensional OV Models Performance on Images

To validate the proposed architectural and overall methodology (Section 3.2), we conducted a comparative ablation study against other state-of-the-art Open-Vocabulary baselines, including Grounding DINO, Owl-ViT [55], and the recently released SAM3 [80]. The results, presented in Table 8, reveal a distinct trade-off between object-centric and scene-centric performance. While SAM3 achieves a marginally higher mean IoU (+1.7%) driven by its superior distinct object segmentation (e.g., Vehicles, Buildings), it struggles significantly with continuous semantic surfaces, dropping over 23% in accuracy on Road/Impervious surfaces compared to Sa2VA (73.0%). In the context of large-scale urban mapping, the accurate delineation of continuous classes (roads, terrain) is as critical as distinct objects. Furthermore, Sa2VA integrates the LLaVA Vision-Language Model, allowing it to segment regions directly from descriptive text prompts. This linguistic capability, combined with its superior performance on continuous surfaces like roads, made it the optimal choice for our methodology. However, thanks to our modular design, the 2D backbone is interchangeable and can be upgraded to newer architectures as they emerge.

8. Conclusions

This research showed that Open-Vocabulary (OV) segmentation methods are a feasible choice for aerial 3D point cloud classification over urban areas, especially when labelled training data are limited or unavailable. The presented pipeline can be applied to a photogrammetrically derived or to an aerial LiDAR point cloud that features accurately co-registered images.

The OV-based approach performs satisfactorily, closely matching conventional deep learning (DL) metrics for visually clear and unambiguous object classes. The OV-based approach demonstrated its capability by achieving satisfactory classification results across the majority of datasets and classes. This outcome supports the potential for identifying any object via a simple text query, provided the target is clearly visible in multiple aerial imagery. However, OV methods struggle with highly specific classes (such as urban furniture) and small objects that are difficult to detect in aerial images, including powerline cables or poles. Consequently, when dealing with complex and cluttered environments seen from airplane or drone cameras, supervised 3D DL models still deliver superior segmentation performance. Regardless of these challenges, the zero-shot nature of the OV approach remains a key advantage for large-scale point cloud classification, as it eliminates the need for training or manual annotations, allowing classes to be defined and detected instantly regardless of data density or scene type. Given the rapid advancement of multimodal and vision-language models, it is highly likely that 2D OV methods will soon match or even exceed the segmentation power of fully trained 3D DL approaches in certain contexts or for specific classes.

Therefore, the recommended future strategy is the hybrid integration. For example, OV methods could automatically generate labels for datasets, which would then be used for training a supervised 3D DL model (known as pseudo-labelling). The exploration of combining OV methods with unsupervised clustering techniques on the point cloud is also warranted. This combined approach utilizes the complementary strengths of 2D vision-language processing and native 3D DL, promising improved results for a greater variety of urban objects and datasets.

Author Contributions

Conceptualization, A.A. and F.R.; methodology, A.A.; software, A.A.; validation, A.A.; investigation, A.A. and F.R.; resources, F.R.; data curation, A.A. and F.R.; writing—original draft preparation, A.A.; writing—review and editing, A.A. and F.R.; visualization, A.A. and F.R.; supervision, F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Datasets employed in the paper are available on the reported benchmark websites, except for the Graz dataset, which is available on demand for research purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Grilli, E.; Remondino, F. Machine learning generalisation across different 3D architectural heritage. ISPRS Int. J. Geo-Inf. 2020, 9, 379. [Google Scholar] [CrossRef]
Berrett, B.E.; Vernon, C.A.; Beckstrand, H.; Pollei, M.; Markert, K.; Franke, K.W.; Hedengren, J.D. Large-scale reality modeling of a university campus using combined UAV and terrestrial photogrammetry for historical preservation and practical use. Drones 2021, 5, 136. [Google Scholar] [CrossRef]
Özdemir, E.; Remondino, F.; Golkar, A. An Efficient and General Framework for Aerial Point Cloud Classification in Urban Scenarios. Remote Sens. 2021, 13, 1985. [Google Scholar] [CrossRef]
Zamanakos, G.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. A comprehensive survey of LiDAR-based 3D object detection methods with deep learning for autonomous driving. Comput. Graph. 2021, 99, 153–181. [Google Scholar] [CrossRef]
Cappellazzo, M.; Baldo, M.; Sammartano, G.; Spanò, A. Integrated Airborne LiDAR-UAV methods for archaeological mapping in vegetation-covered areas. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 357–364. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Yang, S.; Hou, M.; Li, S. Three-Dimensional Point Cloud Semantic Segmentation for Cultural Heritage: A Comprehensive Review. Remote Sens. 2023, 15, 548. [Google Scholar]
Bieri, V.; Zamboni, M.; Blumer, N.S.; Chen, Q.; Engelmann, F. OpenCity3D: 3D Urban Scene Understanding with Vision-Language Models. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025. [Google Scholar]
Ruoppa, L.; Oinonen, O.; Taher, J.; Lehtomaki, M.; Takhtkeshha, N.; Kukko, A.; Kaartinen, H.; Hyyppa, J. Unsupervised deep learning for semantic segmentation of multispectral LiDAR forest point clouds. ISPRS J. Photogramm. Remote Sens. 2025, 228, 694–722. [Google Scholar] [CrossRef]
de Gelis, I.; Saha, S.; Shahzad, M.; Corpetti, T.; Lefevre, S.; Zhu, X. Deep unsupervised learning for 3D ALS point clouds change detection. ISPRS Open J. Photogramm. Remote Sens. 2023, 9, 100044. [Google Scholar]
Grilli, E.; Daniele, A.; Bassier, M.; Remondino, F.; Serafini, L. Knowledge enhanced neural networks for point cloud semantic segmentation. Remote Sens. 2023, 15, 2590. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, IV-1/W1, 91–98. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3D point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Zeng, J.; Wang, D.; Chen, P. A Survey on Transformers for Point Cloud Processing: An Updated Overview. IEEE Access 2022, 10, 86510–86527. [Google Scholar] [CrossRef]
Zhang, R.; Wu, Y.; Jin, W.; Meng, X. Deep-learning-based point cloud semantic segmentation: A survey. Electronics 2023, 12, 3642. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, X.; Miao, Y. A review of point cloud segmentation for understanding 3D indoor scenes. Vis. Intell. 2024, 2, 14. [Google Scholar] [CrossRef]
Betsas, T.; Georgopoulos, A.; Doulamis, A.; Grussenmeyer, P. Deep learning on 3D semantic segmentation: A detailed review. Remote Sens. 2025, 17, 298. [Google Scholar] [CrossRef]
Das, A.; Xian, Y.; He, Y.; Akata, Z.; Schiele, B. Urban scene semantic segmentation with low-cost coarse annotation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5978–5987. [Google Scholar]
Chen, Z.; Xu, H.; Chen, W.; Zhou, Z.; Xiao, H.; Sun, B.; Xie, X.; Kang, W. PointDC: Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-modal Distillation and Super-Voxel Clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023. [Google Scholar]
Liu, J.; Yu, Z.; Breckon, T.; Shum, H.P.H. U3DS3: Unsupervised 3D Semantic Scene Segmentation. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Yi, L.; Gong, B.; Funkhouser, T. Complete & label: A domain adaptation approach to semantic segmentation of LiDAR point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15363–15373. [Google Scholar]
Sanchez, J.; Deschaud, J.E.; Goulette, F. Domain generalization of 3D semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 18077–18087. [Google Scholar]
Hegde, D.; Lohit, S.; Peng, K.C.; Jones, M.; Patel, V. Multimodal 3D object detection on unseen domains. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 2499–2509. [Google Scholar]
Ding, R.; Yang, J.; Xue, C.; Zhang, W.; Bai, S.; Qi, X. PLA: Language-driven open-vocabulary 3D scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7010–7019. [Google Scholar]
Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T. OpenScene: 3D scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 815–824. [Google Scholar]
Gu, Q.; Kuwajerwala, A.; Morin, S.; Jatavallabhula, K.M.; Sen, B.; Agarwal, A.; Paull, L. ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 5021–5028. [Google Scholar]
Jiang, L.; Shi, S.; Schiele, B. Open-Vocabulary 3D Semantic Segmentation with Foundation Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Simonyan, K. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Nießner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the International Conference 3D. Vision, Qingdao, China, 10–12 October 2017; pp. 667–676. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Takmaz, A.; Fedele, E.; Sumner, R.W.; Pollefeys, M.; Tombari, F.; Engelmann, F. OpenMask3D: Open-vocabulary 3D instance segmentation. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Piekenbrinck, J.; Schmidt, C.; Hermans, A.; Vaskevicius, N.; Linder, T.; Leibe, B. OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5246–5255. [Google Scholar]
Boudjoghra, M.E.A.; Dai, A.; Lahoud, J.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Khan, F.S. Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Nguyen, P.; Ngo, T.D.; Kalogerakis, E.; Gan, C.; Tran, A.; Pham, C.; Nguyen, K. Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Bayrak, O.; Ma, Z.; Farella, E.M.; Remondino, F.; Uzar, M. Combining 3D Urban Objects From All Around the World to Improve Object Classification and Semantic Segmentation. J. Photogramm. Remote Sens. Geoinf. Sci. (PFG) 2026, 1–28. [Google Scholar] [CrossRef]
Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
Zhu, Q.; Fan, L.; Weng, N. Advancements in Point Cloud Data Augmentation for Deep Learning: A Survey. Patter Recongition 2024, 153, 110532. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of NIPS 2017 Workshop on Machine Learning for the Developing World, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Gwak, J.; Choy, C.B.; Savarese, S. Generative Sparse Detection Networks for 3D Single-shot Object Detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNext: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 1–23. [Google Scholar] [CrossRef]
Zhou, W.; Wang, Q.; Jin, Q.; Shi, Z.; He, Y. Graph Transformer for 3D point clouds classification and semantic segmentation. Comput. Graph. 2024, 124, 104050. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Robert, D.; Raguet, H.; Landrieu, L. Efficient 3D Semantic Segmentation with Superpoint Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3D: Mask transformer for 3D semantic instance segmentation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8216–8223. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Markham, A. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Xiao, G.; Ge, S.; Zhong, Y.; Xiao, Z.; Song, J.; Lu, J. SAPFormer: Shape-aware propagation Transformer for point clouds. Pattern Recognit. 2025, 164, 111578. [Google Scholar] [CrossRef]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Houlsby, N. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 728–755. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11975–11986. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; et al. Grounded Language-Image Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Open-Vocabulary DETR with Conditional Matching. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M.; Shum, H.-Y. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Yuan, H.; Li, X.; Zhang, T.; Huang, Z.; Xu, S.; Ji, S.; Tong, Y.; Qi, L.; Feng, J.; Yang, M.-H. Sa2va: Marrying SAM2 with llava for dense grounded understanding of images and videos. arXiv 2025, arXiv:2501.04001. [Google Scholar]
Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 540–557. [Google Scholar]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven Semantic Segmentation. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Jatavallabhula, K.M.; Kuwajerwala, A.; Gu, Q.; Omama, M.; Chen, T.; Maalouf, A.; Li, S.; Iyer, G.; Saryazdi, S.; Keetha, N.; et al. ConceptFusion: Open-set multimodal 3D mapping. In Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Takmaz, A.; Delitzas, A.; Summer, R.W.; Engelmann, F.; Wald, J.; Tombari, F. Search3D: Hierarchical Open-Vocabulary 3D Segmentation. IEEE Robot. Autom. Lett. 2025, 10, 2558–2565. [Google Scholar] [CrossRef]
Alami, A.; Remondino, F. Querying 3D point clouds exploiting open-vocabulary semantic segmentation of images. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, XLVIII-2/W8-2024, 1–7. [Google Scholar] [CrossRef]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar]
Rusnak, A.M.; Kaplan, F. HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5256–5265. [Google Scholar]
Goo, J.M.; Zeng, Z.; Boehm, J. Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, XLVIII-2-2024, 107–113. [Google Scholar] [CrossRef]
Chen, M.; Hu, Q.; Yu, Z.; Thomas, H.; Feng, A.; Hou, Y.; McCullough, K.; Ren, F.; Soibelman, L. STPLS3D: A Large-Scale Synthetic and Real Aerial Photogrammetry 3D Point Cloud Dataset. In Proceedings of the 33rd British Machine Vision Conference, London, UK, 21–24 November 2022. [Google Scholar]
Bayrak, O.; Ma, Z.; Farella, E.M.; Remondino, F.; Uzar, M. ESTATE: A Large Dataset of Under-Represented Urban Objects for 3D Point Cloud Classification. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, XLVIII-2-2024, 25–32. [Google Scholar]
Robert, D.; Raguet, H.; Landrieu, L. Scalable 3D Panoptic Segmentation as Superpoint Graph Clustering. In Proceedings of the International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024. [Google Scholar]
Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Zhao, H. Point Transformer V3: Simpler Faster Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
Bayrak, O.C.; Remondino, F.; Uzar, M. A new dataset and methodology for urban-scale 3D point cloud classification. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 1–8. [Google Scholar]
Kölle, M.; Laupheimer, D.; Schmohl, S.; Haala, N.; Rottensteiner, F.; Wegner, J.D.; Ledoux, H. The Hessigheim 3D (H3D) benchmark on semantic segmentation of high-resolution 3D point clouds and textured meshes from UAV LiDAR and Multi-View-Stereo. ISPRS Open J. Photogramm. Remote Sens. 2021, 1, 100001. [Google Scholar]
Farella, E.M.; Morelli, L.; Remondino, F.; Qin, R.; Schachinger, B.; Legat, K. Investigating the new Ultracam Dragon hybrid aerial mapping system. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, 48, 117–125. [Google Scholar]
Griffiths, D.; Boehm, J. Weighted point cloud augmentation for neural network training data class-imbalance. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, XLII-2/W13, 981–987. [Google Scholar]
Ren, P.; Xia, Q. Classification method for imbalanced LiDAR point cloud based on stack autoencoder. Electron. Res. Arch. 2023, 31, 3453–3470. [Google Scholar] [CrossRef]
Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Feichtenhofer, C. SAM 3: Segment anything with concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar] [CrossRef]

Figure 1. Visual results of the proposed open-vocabulary classification pipeline on different photogrammetric urban scenes: STPLS3D-RA (a,b), Hessigheim (c,d), Graz (e,f) and STPLS3D-USC (g,h).

Figure 2. Schematic timeline with the most important 3D deep learning classification methods applied to point clouds.

Figure 3. The proposed adaptive Open-Vocabulary segmentation pipeline. OVD: Open-Vocabulary Detections.

Figure 4. Qualitative results of OV methods (Ours) vs. deep models on the STPLS3D-USC scene (metrics in Table 2).

Figure 5. Qualitative results of OV methods (Ours) and deep models on the STPLS3D-RA scene (metrics in Table 3).

Figure 6. Qualitative results of OV methods (Ours) and deep models on Hessigheim data (metrics in Table 4).

Figure 7. Qualitative results of OV methods (Ours) and deep models on Graz data (metrics in Table 5).

Figure 8. Open-Vocabulary (OV) misclassification results due to an image–point cloud temporal misalignment (ca 1 year). Moving objects like cars creates problems when detected in the images but are not present in the point cloud. Hence, the OV prediction generates false detections for cars absent in the 3D data.

Figure 9. Estimated efforts and processing time comparison. Unlike supervised methods (bottom), which rely on extensive manual annotation (red), the proposed pipeline (top) shifts the workload to automated machine computation (blue), restricting human effort to a lightweight prompt-tuning phase if necessary.

Table 1. Employed datasets. In parentheses are the original number of classes with respect to the used ones.

Datasets	USC (STPLS3D)	RA (STPLS3D)	Hessigheim 3D	Graz
Source	Photogrammetry	Photogrammetry	Photogrammetry	Photogrammetry
Platform	Drone	Drone	Drone	Aircraft
# Images	ca 4500	ca 1900	ca 1000	ca 50
Image type	Oblique	Oblique	Nadir	Nadir + Oblique
Image size (px)	4864 × 3648 px	4864 × 3648 px	14,204 × 10,652 px	14,144 × 10,560 px
Avg GSD	1–2 cm	1–2 cm	2–3 cm	5 cm (nadir)
# Classes	8 (9)	8 (9)	7 (11)	7
Area size (km²)	0.2	0.06	0.1	1.6
# 3D points (mil)	29.3	6.8	82	107

Table 4. Metric results on the Hessigheim 3D dataset. (*) merged facade, roof and chimney. (**): unlabelled points, vertical surface and urban furniture.

Method	Low Vegetation		Impervious Surface		Vehicle		Soil/Gravel		Building (*)		Tree/ Shrub		Unknown/Other (**)
Method	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1
KPConv	80.24	89.04	77.55	87.36	82.71	90.54	0	0	94.22	97.03	87.38	93.27	46.72	63.68
SPT	55.05	71.01	56.54	72.24	36.65	53.64	14.78	25.76	67.48	80.58	76.63	86.77	25.15	40.2
PN++	77.23	87.15	78.76	88.12	33.93	50.67	24.5	39.36	83.67	91.11	58.33	73.68	26.33	41.69
PTv1	69.58	82.06	65.09	78.86	10.18	18.47	0.19	0.37	83.3	90.89	58.41	73.74	18.5	31.22
PTv3	72.06	83.76	64.99	78.78	53.83	69.99	14.72	25.66	87.48	93.32	86.05	92.5	38.98	56.09
Ours (G-DINO)	31.11	47.45	33.08	49.72	8.46	15.59	5.9	11.15	61.01	75.78	21.72	35.68	6.1	11.51
Ours (Sa2VA)	47.11	64.05	46.67	63.64	13.7	24.09	7.13	13.32	76.19	86.49	28.28	44.09	6.57	12.33

Table 5. Metric results for the Graz dataset.

Method	Grass		Tree		Roof		Car		Imp. Surface		Facade		Other
Method	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1
KPConv	18.25	30.87	82.64	90.49	85.21	92.01	54.65	70.68	53.9	70.04	80.7	89.32	10.42	18.87
SPT	64.09	78.11	82.15	90.2	81.95	90.08	49.9	66.58	67.35	80.49	73.73	84.88	14.41	25.19
PN++	61.88	76.45	73.75	84.9	75.79	86.23	37.87	54.93	67.73	80.76	49.61	66.32	9.14	16.75
PTv1	49.47	66.19	77.45	87.29	81.99	90.1	42.8	59.94	64.78	78.63	75.02	85.73	12.99	23
PTv3	60.11	75.08	82.56	90.44	84.51	91.61	54.65	70.68	70.3	82.56	79.57	88.63	13.49	23.77
Ours (G-DINO)	36.97	53.98	57.34	72.89	13.58	23.92	34.31	51.09	21.95	36.01	27.68	43.35	1.62	3.18
Ours (Sa2VA)	58.41	73.74	75.6	86.11	75.49	86.03	44.61	61.7	59.96	74.97	68.22	81.10	-	-

Table 6. Comparison of Single-View vs. Majority Voting accuracy across different noise percentages, showing the system’s ability to recover correct semantic labels through multi-view redundancy.

Noise %	Single-View	Majority Vote	Recovery
0%	96.60%	97.30%	+0.70%
5%	91.80%	95.10%	+3.30%
10%	87.00%	92.70%	+5.80%
15%	82.10%	90.10%	+7.90%
20%	77.30%	87.10%	+9.80%
25%	72.50%	83.90%	+11.40%
30%	67.60%	80.30%	+12.60%
40%	58.00%	71.80%	+13.90%
50%	48.30%	61.80%	+13.40%

Table 7. Evaluation of statistical thresholds (τ_c) and the impact of geometric refinement on the RA dataset. (*) denotes merged classes; (**) denotes clutter and unknown categories. The results illustrate the stability of the adaptive thresholding and the consistent performance increase provided by the refinement step.

	Building		Vegetation		Vehicle		Poles		Fence		Road/Imp. Surface		Grass/Dirt (*)		Other (**)
	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1	IoU	F-1
Mean	79.96	88.86	56.12	71.9	37.02	54.04	4.46	8.54	26.48	41.87	65.13	78.88	40.69	57.84	0.16	0.32
Median	80.05	88.92	56.19	71.95	37.06	54.07	4.49	8.59	26.54	41.95	65.09	78.85	40.75	57.9	0.15	0.3
Mean—0.5 × σ	80.04	88.92	56.19	71.95	37.06	54.08	4.49	8.59	26.54	41.95	65.08	78.85	40.75	57.91	0.15	0.3
Mean + 0.5 × σ	79.54	88.6	55.93	71.74	36.97	53.98	4.38	8.39	26.26	41.6	65.1	78.86	40.53	57.68	0.19	0.38
0.5	77.08	87.06	53.12	69.39	36.18	53.13	5.1	9.72	27.34	42.94	63.86	77.95	40.02	57.16	0.69	1.37
Refinement	83.94	91.27	59.97	74.97	40.2	57.34	6.62	12.42	31.51	47.92	66.88	80.15	41.89	59.04	-	-

Table 8. Quantitative comparison of Open-Vocabulary models (Grounding DINO, Owl-ViT, Sa2VA, SAM3) on a manually annotated subset of the STPLS3D dataset.

Model	Building	Vehicle	Fence	Grass/Dirt	Poles	Road/Imp. Surface	Vegetation	mIoU
Grounding Dino	49.2	68.00	5.40	45.70	16.90	63.00	63.10	44.50
Owl-Vit	34.30	64.00	4.70	9.90	15.90	34.70	55.90	31.30
Sa2VA	68.00	75.10	14.90	64.30	23.90	73.00	73.10	56.00
SAM3	79.50	85.20	24.50	55.40	36.30	49.60	73.50	57.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alami, A.; Remondino, F. Open-Vocabulary Segmentation of Aerial Point Clouds. Remote Sens. 2026, 18, 572. https://doi.org/10.3390/rs18040572

AMA Style

Alami A, Remondino F. Open-Vocabulary Segmentation of Aerial Point Clouds. Remote Sensing. 2026; 18(4):572. https://doi.org/10.3390/rs18040572

Chicago/Turabian Style

Alami, Ashkan, and Fabio Remondino. 2026. "Open-Vocabulary Segmentation of Aerial Point Clouds" Remote Sensing 18, no. 4: 572. https://doi.org/10.3390/rs18040572

APA Style

Alami, A., & Remondino, F. (2026). Open-Vocabulary Segmentation of Aerial Point Clouds. Remote Sensing, 18(4), 572. https://doi.org/10.3390/rs18040572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Open-Vocabulary Segmentation of Aerial Point Clouds

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Three-Dimensional Deep Learning Segmentation Methods

2.2. Zero-Shot and Open-Vocabulary Segmentation Methods

3. Semantic Segmentation Methods

3.1. Three-Dimensional Deep Learning Segmentation Methods

3.2. Open-Vocabulary 3D Segmentation Methods

3.2.1. Tiling Strategy

3.2.2. Adaptive Threshold

3.2.3. Projection

3.2.4. Multi-View Voting

3.2.5. Refinement

4. Datasets, Classes and Metrics

5. Results

5.1. STPLS3D—University of Southern California

5.2. STPLS3D—Residential Area

5.3. Hessigheim 3D

5.4. Graz

6. Discussion

7. Ablation Study

7.1. Projection and Multiview Voting

7.2. Threshold and Refinement Analysis

7.3. Two-Dimensional OV Models Performance on Images

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI