Next Article in Journal
Passive Microwave Imagers, Their Applications, and Benefits: A Review
Previous Article in Journal
PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MambaMeshSeg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using a State Space Model with a Hybrid Scanning Strategy

by
Wenjie Zi
,
Hao Chen
,
Jun Li
and
Jiangjiang Wu
*
College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(9), 1653; https://doi.org/10.3390/rs17091653
Submission received: 2 April 2025 / Revised: 30 April 2025 / Accepted: 5 May 2025 / Published: 7 May 2025

Abstract

:
Semantic segmentation of urban meshes plays an increasingly crucial role in the analysis and understanding of 3D environments. Most existing large-scale urban mesh semantic segmentation methods focus on integrating multi-scale local features but struggle to model long-range dependencies across facets effectively. Furthermore, owing to high computational complexity or excessive pre-processing operations, these methods lack the capability for the efficient semantic segmentation of large-scale urban meshes. Inspired by Mamba, we propose MambaMeshSeg-Net, a novel 3D urban mesh semantic segmentation method based on the State Space Model (SSM). The proposed method incorporates a hybrid scanning strategy that adaptively scans 3D urban meshes to extract long-range dependencies across facets, enhancing semantic segmentation performance. Moreover, our model exhibits faster performance in both inference and pre-processing compared to other mainstream models. In comparison with existing state-of-the-art (SOTA) methods, our model demonstrates superior performance on two widely utilized open urban mesh datasets.

1. Introduction

Three-dimensional urban meshes are large-scale textured triangular mesh data that provide highly realistic visual representations, intuitiveness, and a wealth of geometric information. They consist of numerous triangular facets with varying shapes and sizes, each associated with texture images and rich spatial coordinates, texture picture coordinates, normal vectors, colors, and other pertinent information. Urban meshes exhibit continuous surfaces and photorealistic effects, attracting increasing interest across diverse fields.
Three-dimensional urban meshes semantic segmentation is a fundamental and pivotal task in 3D data analysis. It aims to classify and label complex 3D mesh data by assigning distinct semantic categories to each triangular facet, such as buildings, roads, bodies of water, and so on. Three-dimensional Urban mesh semantic segmentation is widely applied in domains such as 3D geographic information systems (GIS) [1,2,3], path planning [4,5,6], urban planning [7,8,9,10], traffic management [11,12,13,14], and autonomous driving [15,16,17], enhancing the intelligent analysis and detailed comprehension of urban environments.
This study seeks to develop an efficient and scalable semantic segmentation framework for large-scale 3D urban meshes by addressing challenges related to computational complexity, irregular mesh structures, and spatial information preservation. Accordingly, this study proposes MambaMeshSeg-Net, a novel segmentation network based on the State Space Model (SSM), integrated with an octree-based Z-order serialization mechanism and a hybrid scanning strategy, to enhance both segmentation accuracy and scalability. It is hypothesised that the method is applicable under conditions of fully supervised training with clean annotations and adequate spatial coverage.
In recent years, significant advancements have been made in 3D mesh data semantic segmentation, driven by rapid progress in feature extraction and representation using deep learning models. However, the inherent complexity and irregularity of urban meshes present considerable challenges for 3D urban mesh semantic segmentation. Existing 3D urban mesh semantic segmentation methods can be broadly divided into three categories: superfacet-based methods [18,19,20], global parameterization methods [21,22,23], and multi-modal methods [19,24,25,26]. Superfacet-based methods usually involve a two-step process: an over-segmentation step followed by a superfacet classification step. However, the over-segmentation step depends on non-differentiable clustering, preventing global optimization. Global parameterization methods map the 3D triangular facets into a 2D domain, enabling the application of existing 2D semantic segmentation approaches. Although these methods can employ a wide range of CNN-based 2D semantic segmentation approaches [27], the reparameterization process may cause distortions, discretization errors, and occlusions in the parameterized domain, negatively impacting 3D urban meshes semantic segmentation accuracy. Multi-modal methods combine additional data modalities (e.g., images, point clouds, and center of gravity (CoG)) with mesh data to enhance the effect of 3D urban mesh semantic segmentation, allowing for the extraction of rich 3D spatial features and improving the performance of 3D urban mesh semantic segmentation. Multi-modal approaches that integrate image and 3D mesh data exhibit reduced semantic segmentation performance in distorted regions. Distorted regions in urban meshes induce misalignment between image and mesh data, resulting in feature confusion and thereby degrading semantic segmentation performance. Methods incorporating urban meshes with a CoG primarily extract local features, which may insufficiently capture the overall structure features of the urban meshes. In addition, all the mentioned methods are time-consuming and computationally intensive.
Superfacet-based methods generally suffer from limited scalability due to the computational burden introduced by non-differentiable over-segmentation processes, which become increasingly inefficient on large-scale urban meshes. Additionally, their resilience is constrained by the dependence on the quality of initial segmentation, making them vulnerable to noise, irregularities, and topological inconsistencies. Global parameterization methods demonstrate relatively strong scalability, as the transformation to a 2D domain allows the use of mature and efficient 2D convolutional networks. However, resilience remains a challenge, as parameterization often introduces distortions and occlusions that degrade performance, particularly in highly complex or distorted regions. Multi-modal methods enhance resilience by integrating complementary data modalities, which helps to alleviate the adverse effects of local mesh defects and variations. Nevertheless, their scalability is often compromised by the substantial computational and memory demands associated with processing and fusing multiple sources of information. As the octree-based Z-order encoding mechanism can effectively capture the geometric features of point clouds and urban meshes, the proposed model exhibits strong scalability. MambaMeshSeg-Net, based on the SSM, exhibits linear computational complexity; however, its scalability is constrained by the necessity of processing diverse data features.
Inspired by Mamba [28], we propose a novel 3D urban mesh semantic segmentation method based on the State Space Model (SSM), named MambaMeshSeg-Net. MambaMeshSeg-Net not only captures long-range dependencies across facets of urban meshes but also achieves high efficiency in both inference and pre-processing. Since SSM-based models lack the capability to directly capture features of 3D mesh data, we design an octree-based ordering mechanism with Z-order curves to convert irregular urban meshes into 1D mesh sequences. In addition, we propose a novel hybrid scanning strategy that can extract forward and backward spatial features, as well as non-continuous spatial features with a learnable stride size, from 1D mesh sequences, enhancing 3D urban mesh semantic segmentation performance.
The contributions of our work are as follows:
  • Inspired by Mamba, we propose MambaMeshSeg-Net, a novel large-scale 3D urban mesh semantic segmentation method based on the SSM, which has linear computational complexity, and is especially suitable for large-scale 3D urban scenes. To the best of our knowledge, this is the first work that leverages the SSM for 3D urban mesh semantic segmentation.
  • We design an octree-based ordering mechanism with Z-order curves for the 1D serialization of 3D urban meshes and a novel non-continuous scanning approach with a learnable stride size, enabling our SSM-based model to effectively capture 3D urban meshes’ spatial features and their dependencies.
  • Our proposed method demonstrates high computational efficiency, outperforming other methods in both data pre-processing and inference time.
  • We validate the proposed method through extensive experiments on two real-world datasets. Our model surpasses several state-of-the-art (SOTA) methods in mean F1 score, overall accuracy, and mIoU.

2. Related Work

2.1. Urban Mesh Semantic Segmentation

Substantial progress has been achieved in the semantic segmentation of urban meshes, driven by advancements in deep learning-based feature extraction techniques. Urban mesh semantic segmentation is typically classified into three primary methodologies: superfacet-based, global parameterization, and multi-modal approaches [18,19,20,21,22,23,24,25,29]. RF-MRF [18] utilizes random forests to extract abstract features from raw urban meshes, yielding preliminary semantic information, which is subsequently refined through Markov random field smoothing. SUM-RF [19], similar to RF-MRF but lacking the MRF component, combines point cloud data with mesh geometric features to facilitate semantic segmentation. PSSNet [20] outperforms SUM-RF in multiple aspects, leveraging random forests and Markov random fields for superfacet clustering and utilizing PointNet and gated graph sequences for classification. Furthermore, PSSNet employs SPG [30] to extract point cloud features, thereby improving superfacet classification. Superfacet-based approaches enable effective local feature extraction but rely on non-differentiable clustering, thereby impeding global optimization. TextureNet [21], a leading global parameterization technique, utilizes QuadriFlow [31] to parameterize 3D mesh scenes into a four-way rotationally symmetric field, thus producing a uniform directional field for sampled points. However, the parameterization process incurs directional ambiguities as an inherent trade-off. Multi-modal approaches are categorized into two main types: those integrating mesh and point cloud data and those incorporating mesh data with center of gravity (CoG) features. CoG-based approaches utilize the facet centroid and its associated features, including geometric and radiometric attributes. However, since the number of points generally corresponds to the number of mesh facets, CoG-based methods are limited in capturing fine-grained textural details of the mesh. Point cloud-based methods offer a viable alternative to mitigate the limitations of superfacet-based approaches, by integrating sampled point clouds to enhance feature representation and elevate semantic segmentation performance for urban meshes. However, these approaches entail substantial computational costs, rendering them impractical for the efficient semantic segmentation of large-scale urban meshes. The proposed model combines mesh and point cloud features to achieve semantic segmentation of large-scale urban meshes. It employs a novel hybrid scanning strategy to adaptively extract comprehensive spatial features, thus improving semantic segmentation performance. Moreover, by exploiting the linear complexity of the State Space Model (SSM), the proposed method facilitates the efficient semantic segmentation of large-scale urban meshes.

2.2. Mamba and Its Applications

Mamba, an emerging deep learning architecture, provides an efficient sequence modeling framework by leveraging the strengths of the SSM. Compared to traditional Transformer architectures, Mamba achieves near-linear time complexity when processing long-sequence data, establishing its advantage in large-scale data processing applications. In natural language processing, extensive research has explored Mamba’s potential across diverse downstream tasks [28,32]. In video processing, Mamba exhibits superior capability in distinguishing short-term actions and interpreting extended temporal dependencies. For instance, VideoMamba [33] utilizes 3D convolutions to partition input videos into non-overlapping spatiotemporal patches and encodes these patches into vector representations via stacked bi-directional Mamba blocks. For unordered data processing, Mamba-based vision models have been developed to address computational and memory constraints while attaining competitive performance [34,35,36]. Vision Mamba [34,37,38,39,40], for example, conducts global visual-semantic modeling by integrating bi-directional SSMs with positional embeddings to facilitate position-aware visual understanding. Unlike attention-based mechanisms, Vision Mamba reduces computational complexity to sub-quadratic levels while maintaining linear memory efficiency, achieving performance that is comparable to Vision Transformers.
Mamba-based models have not yet been applied to 3D urban meshes’ analysis and processing. To bridge this research gap, we propose MambaMeshSeg-Net, which incorporates a hybrid scanning strategy. By utilizing the hybrid scanning strategy, our method can capture diverse spatial features, facilitating precise and robust urban meshes semantic segmentation.

3. Method

This section offers a comprehensive overview of the MambaMeshseg-Net framework, while the following subsection provides a systematic exposition of its architectural components.

3.1. Overview

As illustrated in Figure 1, we propose MambaMeshseg-Net, a U-shaped network that integrates contextual spatial features and long-range dependencies across facets while preserving original features and enhancing semantic segmentation performance. The proposed model comprises an octree-based ordering module, multiple MambaMesh modules, several Feature Evaluation Fusion (FEF) modules, and multi-layer perceptrons (MLP).
First, we employ an octree-based ordering mechanism with Z-order curves to convert 3D urban meshes into a sequential representation while preserving spatial adjacency within the 3D urban meshes. Secondly, during the down-sampling phase, the serialized mesh data is processed sequentially through three MambaMesh modules (MambaMesh 256, MambaMesh 128, and MambaMesh 32). These modules progressively extract hierarchical features, reducing the feature dimensions from 256 to 128 and subsequently to 32. In MambaMesh, the Strided State Space Model (SSM) optimizes the interval N, which corresponds to receptive fields of varying scales, during the training phase to produce features across multiple scales. Lastly, during the up-sampling phase, features starting from MambaMesh 32 are gradually up-sampled and merged with the corresponding down-sampled features to progressively restore higher feature dimensions (32→128→256). Additionally, between the down-sampling MambaMesh modules, an FEF module is employed to merge the corresponding features from the sampling stages. These modules integrate features extracted at different stages and enhance feature representation through weighted additive fusion.

3.2. Octree-Based Ordering Mechanism

Since MambaMesh requires a one-dimensional serialized data sequence as input, urban meshes must be converted into a corresponding one-dimensional format. Inspired by recent research on octree encoding [41,42], we discovered that employing 3D Z-order ordering based on octree structures effectively preserves spatial adjacency in 3D urban mesh analysis. In this approach, mesh data are serialized using space-filling curves that maintain spatial locality, ensuring that facets adjacent in 3D space remain contiguous in the serialized sequence. For each facet, its spatial coordinates are represented by the centroid, and we employ the parallel algorithm implemented by OctFormer [42] to construct the octree on the GPU. After constructing the octree, nodes at the same depth are sorted using shuffled keys. The shuffled key is defined in its binary representation as follows:
K e y ( x , y , z ) = x 1 y 1 z 1 x 2 y 2 z 2 x d y d z d
where x i , y i , and z i represent the i-th coordinates of the point (center of urban meshes’ facet) within the octree node, and d denotes the octree’s depth. The value of Key (x, y, z) represents the point’s position along the 3D Z-curve. The position encoding of the 3D Z-curve preserves spatial adjacency relationships, thereby facilitating the capture of causal dependencies and geometric information.

3.3. MambaMesh

As illustrated in Figure 2, the proposed MambaMesh module consists of a hybrid scanning strategy, a spatial feature extractor, Depthwise Convolution (DWConv), MLP, and Root Mean Square Normalization (RMSNorm). MambaMesh applies RMSNorm to stabilize the training process and ensure spatial feature consistency. To address the spatial feature extraction limitations of the SSM, we propose a spatial feature extractor designed to enhance the representation of urban meshes. This extractor incorporates DWConv and MLP, with DWConv providing superior computational efficiency over traditional convolutions, while the simplified MLP structure ensures high computational performance. Our spatial feature extractor facilitates the extraction of rich local spatial features while maintaining model efficiency. Urban meshes represent complex, irregular 3D data, posing challenges for the original uni-directional Mamba in capturing global dependencies across facets. To effectively model the global features of urban meshes, we propose a hybrid scanning strategy that extracts forward and backward spatial features, as well as non-continuous spatial features with a learnable stride size N, from 1D mesh sequences. The hybrid scanning strategy captures non-contiguous features through strided sampling expanding the model’s receptive field and enhancing feature diversity. The adaptive feature extraction of our hybrid scanning strategy provides robust support for semantic segmenting urban meshes.

3.4. Feature Evaluation Fusion

Our proposed model can capture spatial features from 3D urban meshes. However, some of these features remain ambiguous, particularly in regions encompassing distorted vehicles and ground surfaces, impeding the model’s ability to learn them effectively. To address this issue, we design a feature evaluation fusion module that reinforces low-uncertainty features across multiple scales at skip connections within the distorted regions. The i-th feature f i R D i × W i represents the i-th feature during down-sampling, including the outputs of MambaMesh 256, MambaMesh 128, and MambaMesh 32. Its uncertainty is expressed as:
u i = f ¯ i log f ¯ i
f ¯ i = σ 1 C c = 1 C f c i
where u i is the uncertainty value of feature f i . The sigmoid function, denoted by σ , is used to normalize the features. C is the number of categories. f c i represents the features of class c within the feature f i . To enhance the feature representation of irregular regions, the features integrated in the previous stage are represented as follows:
f ˜ i = f i · 1 u i

3.5. Loss Function

In this work, Focal Loss is employed due to the class imbalance that is inherent in real-world urban meshes. Focal Loss is based on cross-entropy loss. The expression for Focal Loss [43] is as follows:
L FL = c = 1 C α c 1 p c γ log p c
where c denotes the total number of classes and p c is the predicted probability of the model for class c.The weighting factor for class c, represented by α c , addresses class imbalance by assigning larger weights to less frequent classes. The focusing parameter, denoted by γ , adjusts the weight of samples that are easily and less easily classified. It reduces the loss contribution from the easily classified samples and directs the model’s focus toward harder-to-classify samples. In this experiment, we configure α c as the inverse square root of the frequency of class α c , and we set γ to 2.

4. Experiment

4.1. Dataset

In this experiment, the proposed method is evaluated using two publicly available real-world datasets: SUM [19] and semanticMetropolis (seMet) [44]. The SUM dataset covers about 4 km2 and includes six semantic classes: terrain, vehicle, water, building, high vegetation, and boat. The dataset contains 64 tiles, with each tile covering an area of 250 m × 250 m. Following the methodology outlined in [19], 40 tiles are allocated for training, 12 for testing, and 12 for validation.
The seMet dataset consists of 19 tiles with approximately 20 million facets, each covering an area of 450 m2 in Hong Kong, China. It includes four semantic classes: terrain, building, high vegetation, and car. Of these, 12 tiles are designated for training, 4 for testing, and 3 for validation. The seMet dataset serves as a robust benchmark for evaluating urban mesh segmentation models.

4.2. Implementation Details

All experiments are conducted on a Linux server configured with 2.4 GHz CPU and an NVIDIA RTX 4090 GPU. The model is trained using a batch size of 2 and the ADAM optimizer [45]. Weight decay is set to 0.05. The initial learning rate is set at 0.0015, which is subsequently decreased tenfold after the 360th and 560th epochs using the MultiStepLR scheduler [46]. All meshes are encoded into octrees with a depth of 6 levels.
Each facet is represented by coordinates (xyz), colors (RGB), and the face normal vector of the corresponding face in the sampled point cloud. Specifically, urban meshes undergo Poisson disk sampling (PDS) [47] to generate point clouds, which serve as input for these semi-supervised semantic segmentation methods, with a density parameter of 0.4. Finally, the sampled point clouds are fed into our network.

4.3. Baselines and Evaluation Metrics

To evaluate the effectiveness of the proposed model, several SOTA models are selected for comparison. Given that point cloud semantic segmentation methods mature considerably and are well-suited for application to urban meshes [19,48], we selected several representative techniques as baselines. Based on the benchmark settings outlined in Reference [19], the following models are selected as baseline models: PointNet++ [49], RF-MRF [18], SUM-RF [19], PTV2 [50], PSSNet [20], UrbanSegNet [44], and Point Transformer V3 (PTV3) [51]. Among the compared baseline models, PointNet++, PTV2, and PTV3 all use point cloud data and the face normal of mesh as input, while RF-MRF, SUM-RF, PSSNet, UrbanSegNet, and the proposed model processes both urban mesh and point cloud data. The point cloud data are randomly sampled from the urban mesh datasets using the Poisson disk sampling method [47] with a minimum interval of 0.4 m. MambaMeshSeg-Net is a multi-modal approach. We utilize both point cloud data and mesh geometry information.
  • PointNet++ [49] is the pioneer of point cloud semantic segmentation and the multi-modal method. It learns hierarchical features with increasing scales of contexts and uses PointNet as the local feature learner.
  • RF-MRF [18] is a superfacet-based method that includes region growing and MRF-based classification models. RF-MRF leverages both geometric and radiometric features.
  • SUM-RF [19] is a superfacet-based model containing region growing and random forest classification models.
  • PTV2 [50] is a multi-modal method that integrates Grouped Vector Attention, a Position Encoding Multiplier, and Partition-based Pooling, enabling efficient and comprehensive data understanding.
  • PSSNet [20] is the first deep-learning model for urban meshes with a two-step pipeline and is a superfacet-based model. It uses planarity-sensible features to over-segment urban meshes and adopts a GNN-based model to encode local and photometric features.
  • UrbanSegNet [44] is an end-to-end multi-modal model that incorporates diffusion perceptron blocks and a vertex spatial attention mechanism.
  • PTV3 [51] is a multi-modal model that offers a broader receptive field, improved performance, and a faster processing speed compared to PTV2.
We apply several evaluation metrics to compare the performance with baselines: mean intersection over union (mIoU), overall accuracy (OA), and F1 score (F1).
mIoU = 1 N i = 1 N TP i TP i + FP i + FN i
OA = TP + TN TP + FP + TN + FN
Precision i = TP i TP i + FP i
Recall i = TP i TP i + FN i
mF 1 = 1 N i = 1 N 2 · Precision i · Recall i Precision i + Recall i
In the above equation, TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. In addition, i represents a category, and the total number of categories is denoted by N.

4.4. Experimental Result Analysis

The performance of MambaMeshseg-Net is evaluated on the SUM [19] and seMet [44] datasets based on both quantitative results and qualitative visualizations for in-depth analysis. This evaluation offers a comprehensive assessment of the proposed method.

4.4.1. Evaluation and Analysis on SUM

As shown in Table 1, our method outperforms existing SOTA models across multiple metrics, demonstrating the enhanced semantic segmentation performance of our method. While PointNet++ adapts to data at various scales through hierarchical adjustments, MambaMeshSeg-Net outperforms PointNet++ across all metrics. This performance difference can be attributed to our method’s ability to capture spatial long-range dependencies across facets of urban meshes through a hybrid scanning strategy. Furthermore, our method preserves the original features effectively. RF-MRF, SUM-RF, and PSSNet are superfacet-based, two-step methods that limit global optimization, thereby reducing segmentation accuracy. However, these superfacet-based methods significantly reduce computational complexity during the super-facet classification stage. RF-MRF and SUM-RF rely solely on local spatial features, which are limited in their ability to process scenes with complex global geometric relationships, as they fail to fully capture global context information, thereby affecting the generalization of semantic segmentation. In contrast, our proposed method emphasizes the fusion of global features and explores spatial features across various levels. MambaMeshSeg-Net surpasses PSSNet by 4.82% in the mIoU metric, highlighting the advantages of our method, which captures a broader range of local features. To address Mamba’s limitations in capturing spatial features of 3D data, MambaMeshSeg-Net introduces a Spatial Feature Extractor to enhance spatial features, enhancing the model’s stability and accuracy. PTV2 and PTV3 are both Transformer-based models. Compared to PTV2, PTV3 introduces efficient point sampling and feature extraction methods, along with a more refined attention mechanism, resulting in enhanced semantic segmentation performance. In contrast with PTV2 and PTV3, our model is based on the Mamba framework and effectively handles long sequences. By incorporating the HiPPO matrix [52] and the U-shaped network architecture, our model preserves earlier information while integrating recent data, avoiding quadratic complexity. Consequently, our model can more efficiently and accurately capture the spatial representations of urban meshes, which improves its generalization capability. UrbanSegNet employs diffusion-aware blocks and vertex space attention mechanisms, but it performs worse than our model in both the mIoU and mF1 metrics. This is because our MambaMesh module captures more of the spatial representations of urban meshes effectively, enhancing the model’s performance. However, our method performs worse than UrbanSegNet for the boat data, as the spatial features of a boat in the water are susceptible to distortion, which poses a challenge for MambaMeshSeg-Net in accurately capturing these distorted spatial features.
As illustrated in Figure 3, our method outperforms all baseline models on the SUM dataset. Specifically, the results indicate that our model achieves superior semantic segmentation performance across categories, including buildings, high vegetation, vehicles, and water. This improvement is primarily due to the MambaMesh module in MambaMeshSeg-Net, which effectively captures spatial representations and learns and models’ long-range dependencies within urban meshes. Furthermore, the hybrid scanning strategy adaptively integrates multi-scale features from urban meshes, improving the model’s semantic segmentation accuracy. The proposed method also demonstrates enhanced performance in segmenting vehicles and terrain due to its enhanced ability to preserve geometric and spectral features. Notably, the proposed approach outperforms competing models in the water category, which can be attributed to the FEF mechanism’s ability to effectively integrate low-level geometric features with high-level semantic representations in a hierarchical manner, thereby enhancing the model’s ability to distinguish complex spatial patterns.

4.4.2. Evaluation and Analysis on seMet Dataset

Table 2 illustrates that our method outperforms existing SOTA models across multiple metrics, highlighting its capacity to learn and precisely capture the characteristics of urban meshes. PointNet++ achieves an IoU of 0 for the high-vegetation and vehicle categories, indicating its inability to distinguish between terrain and vehicles or buildings and high vegetation. Consequently, its performance on both mIoU and mF1 metrics is significantly limited. SUM-RF outperforms RF-MRF across all metrics because it utilizes a region-growing algorithm, which enhances the over-segmentation process. MambaMeshSeg-Net surpasses SUM-RF and RF-MRF in terms of the mIoU, OA, and mF1 metrics because the MambaMesh module effectively captures the spatial representations of urban meshes. Additionally, our FEF module component effectively integrates raw features with high-level abstract features. Our model exceeds other SOTA models by at least 1.7% and 2.2% in OA and mIoU metrics, respectively, highlighting its superior global segmentation performance and detail recognition capabilities. MambaMeshSeg-Net achieves IoU improvements of 6.47%, 6.0%, 6.05%, and 5.14% for terrain, high vegetation, buildings, and vehicles, respectively, compared to SPT and PSSNet. Although these methods utilize local and global spatial features, our approach more effectively retains the original long-range feature representations and leverages a U-shaped architecture, thereby capturing the spatial details of urban meshes and enhancing semantic segmentation performance. Both PTV3 and PTV2 are transformer-based models, but PTV3 incorporates more sophisticated feature extraction techniques and a more refined attention mechanism, resulting in improved performance across all metrics and a 10% improvement in OA over PTV2. MambaMeshSeg-Net, built upon the Mamba framework, is well-suited for handling long-sequence data and extracting spatial features. Compared to UrbanSegNet, our proposed method not only effectively preserves the original features but also leverages the MambaMesh module, thereby enhancing the model’s semantic segmentation performance.
From Figure 4, qualitative comparison results for our proposed model and baseline architectures are presented. The superior prediction accuracy of our model on the seMet dataset, compared to other SOTA methods, can be attributed to the integration of a Mamba architecture with a triple-scan mechanism and a U-shaped network designed to preserve structural details. MambaMeshSeg-Net demonstrates superior performance to SUM-RF and RF-MRF for ground and vehicle segmentation, as our method utilizes a more comprehensive abstract spatial representation. Furthermore, our Feature Evaluation Fusion mechanism efficiently integrates raw features with high-level semantic abstractions. The proposed model surpasses PointNet++ and PTV2 across diverse categories, demonstrating its robust global segmentation and fine-grained detail recognition capabilities. MambaMeshSeg-Net achieves SOTA semantic segmentation results for terrain, building, and vehicle classes compared to UrbanSegNet, PSSNet, and PTV3. While existing methods exploit local and global spatial features, our framework preserves intricate geometric structures in urban meshes through a U-shaped network and retains raw feature fidelity, resulting in significantly improved semantic segmentation accuracy. PTV3 and PTV2 are both Transformer-based architectures; however, PTV3 incorporates optimized feature extraction and a refined multi-head attention mechanism, leading PTV3 to achieve superior visual segmentation performance across all evaluated categories. MambaMeshSeg-Net, built upon the Mamba architecture, excels in processing long-sequence data and extracting discriminative spatial features. In contrast to UrbanSegNet, our framework not only preserves raw feature fidelity but also extracts enhanced spatial features via the MambaMesh module, further advancing segmentation performance.

4.4.3. Efficiency Analysis

As shown in Table 3, whether on the SUM dataset or the seMet dataset, our proposed model achieves the shortest pre-processing and inference times, exhibiting both high segmentation accuracy and fast operation, thereby highlighting its practicality. The pre-processing time for RF-MRF, SUM-RF, PSSNet, and UrbanSegNet is considerable due to the significant computational demands during pre-processing, including feature sampling, gradient calculation across various directions, and color sampling, among others. Moreover, the inference time for PSSNet and UrbanSegNet is also notably high due to the need for global feature computation and local feature extraction in both methods. The pre-processing time for PTV2 and PTV3 is identical, as both models utilize the same input data. However, the inference time for PTV3 is considerably shorter than that of PTV2, enhancing model efficiency due to PTV3’s incorporation of a novel approach and a more sophisticated attention mechanism. On the SUM dataset, MambaMeshSeg-Net reduces inference time by at least 1.7 s relative to other SOTA models, as our model is based on the Mamba framework and employs a selective approach within the State Space Model. This not only accelerates inference but also maintains linear scalability concerning sequence length. Furthermore, the designed modules are both simple and effective, resulting in a minimal increase in computational complexity. On the seMet dataset, MambaMeshSeg-Net reduces inference time by at least 3.25 s relative to other SOTA models, further underscoring the speed advantage of our model and highlighting its practicality.

4.5. Ablation Study

4.5.1. Analysis of Hybrid Scanning Strategy

The details of the hybrid scanning strategy are shown in the right panel of Figure 4. To validate the effectiveness of our proposed adaptive hybrid scanning strategy, we analyze specific hyperparameters. Specifically, the parameter N in the strided SSM is set to 5, 10, 15, and 20. In addition, we conduct experiments with uni-directional (forward and backward) and bi-directional scanning mechanisms to further assess the effectiveness of the adaptive tri-scanning mechanism. Subsequently, the experiments are conducted using the SUM datasets, and the seMet dataset, with the results presented in Table 4. In Table 4, forward and backward represent MambaMesh modules based on the uni-directional scanning mechanism, while bi-directions represent MambaMesh modules that perform both forward and backward scans. All other scanning methods adopt the hybrid scanning strategy. Hybrid N = 5 , Hybrid N = 10 , Hybrid N = 15 , and Hybrid N = 20 indicate the hybrid scanning strategy with a stride size of a stride size of 5, 10, 15 and 20, respectively.
The experimental analysis presented in Table 4 and Figure 5 demonstrates the notable advantages of our proposed dynamically learnable hybrid scanning strategy in spatial feature extraction. In both the SUM and seMet datasets, uni-directional scanning (i.e., forward or backward) demonstrates relatively stable performance regarding mIoU, OA, and mF1; however, its overall performance remains comparatively limited. By simultaneously leveraging both forward and backward information, the bi-directional scanning approach attains enhanced accuracy and robustness in recognition and classification. When the hybrid scanning strategy employing a fixed stride size (e.g., 5, 10, 15, or 20) is implemented, the model exhibits further improvements across various datasets and evaluation metrics; however, the optimal stride may vary according to data distribution and task requirements. Conversely, the adaptive hybrid scanning strategy typically attains the highest mIoU and OA in most scenarios. In comparison to fixed-stride scanning strategies, this mechanism effectively integrates multi-scale spatial contextual information by adaptively adjusting scanning parameters, thereby improving the model’s capacity to represent and generalize urban meshes. Comparative experiments on the SUM dataset show that the dynamic hybrid scanning strategy achieves notable improvements of 1.57% in mIoU and 3.84% in mF1, confirming its superiority over fixed-stride scanning methods. The advantages are further substantiated through evaluations on the seMet dataset, which consistently yields performance gains across all evaluation metrics. These improvements result from the dynamic scanning mechanism’s dual optimization effects: first, its learnable parametric design facilitates adaptive feature extraction from heterogeneous urban meshes’ spatial structures; second, the multi-scale feature fusion mechanism facilitates the seamless integration of local and global representations. Experimental validation confirms that this innovative architecture effectively mitigates the feature extraction limitations that are inherent in fixed scanning stride, paving the way for an advanced technical framework for 3D semantic segmentation in complex urban environments.

4.5.2. Octree Depth Analysis

To investigate the impact of different octree depths on model performance, five octree depths, 8, 7, 6, 5, and 4, are selected. As expected, larger data volumes associated with greater depths result in longer processing times. The octree depth has a substantial impact on the semantic segmentation accuracy of the proposed model. Experiments were conducted on the SUM and seMet datasets, and the results are presented in Table 5. For the SUM dataset, the model demonstrated consistent improvements in performance with increasing depth, reaching a peak at depth 6 before declining. In contrast, optimal performance was achieved at depths 6 or 7 for the seMet dataset. Therefore, the depth must be carefully selected to avoid being excessively large or insufficient. A greater depth enhances the model’s ability to capture semantic details. However, for urban meshes, both fine-grained (local) and global spatial features must be extracted. Consequently, an appropriate octree depth is crucial for effectively balancing local and global feature extraction. In our work, a depth of 6 was found to be optimal.

4.5.3. Analysis of Each Component

To validate the effectiveness of each module in the proposed model, such as the hybrid scanning strategy (HSS), Feature Evaluation Fusion (FEF), and Spatial Feature Extractor (SFE), we conducted a series of ablation experiments, as detailed below:
  • In our model, the hybrid scanning strategy is utilized to capture the spatial features of urban meshes. To verify its effectiveness, we removed the hybrid scanning strategy and retained only the dual-scanning mechanism, followed by conducting ablation experiments.
  • To validate the effectiveness of the Feature Evaluation Fusion module, we removed this module and performed ablation experiments.
  • To demonstrate that the Spatial Feature Extractor can effectively compensate for the limitations of the Mamba model in extracting 3D spatial features, we removed the Mamba model and conducted ablation experiments.
As shown in Table 6, the implementation of FEF enhances both mIoU and OA metrics, which arises from the effective integration of low-level features with high-level abstract representations. The integration of SFE significantly enhances mIoU and OA across both SUM and seMet datasets through the precise extraction of urban mesh spatial features. The HSS enables the acquisition of multi-spatial features through varied scanning intervals, facilitating improved spatial representation learning in urban meshes, thereby improving segmentation accuracy. Quantitative analysis shows baseline improvements of 0.19% mIoU and 1.19% OA on the SUM dataset, with more substantial enhancements of 5.86% mIoU and 2.73% OA observed on the seMet dataset. The combined framework significantly enhances model efficacy through three principal mechanisms: (1) HSS captures multi-spatial patterns, (2) SFE optimizes spatial feature extraction, and (3) FEF enables hierarchical feature fusion. This integrated approach improves the model’s spatial representation learning for urban mesh analysis, achieving superior segmentation performance across heterogeneous datasets.

4.6. Limitations

Although our model introduces the FEF and MambaMesh modules to enhance local feature representation, it still struggles with the long-tail problem, leading to suboptimal segmentation performance for classes with limited training samples. In some cases, the segmentation results exhibit isolated misclassifications within otherwise correctly predicted regions, suggesting that there remains room for improvement in local feature discrimination. Moreover, the robustness of the proposed model is relatively limited: the presence of erroneous labels in the training data can lead to a significant degradation in segmentation performance. These cases indicate that further work is needed to improve both the local feature representation and the robustness of the model against label noise.

5. Conclusions

This paper proposes MambaMeshSeg-Net, an efficient semantic segmentation model for urban meshes leveraging the SSM. The proposed model incorporates two key novel components: a MambaMesh module and a feature evaluation fusion module. The MambaMesh module integrates a hybrid scanning strategy in conjunction with a spatial feature extractor, simultaneously capturing local geometric details and global semantic correlations among facets. The proposed model can optimize multi-level feature integration via an adaptive weight allocation strategy to effectively balance geometric precision and semantic abstraction. Our method demonstrates superior performance compared to existing approaches in terms of both pre-processing and inference time, thereby offering a viable solution for the rapid semantic segmentation of large-scale urban meshes.

Author Contributions

Conceptualization, W.Z. and H.C.; methodology, W.Z.; software, W.Z.; validation, H.C. and J.W.; investigation, H.C.; resources, W.Z.; data curation, H.C.; writing—original draft preparation, W.Z. and H.C.; writing—review and editing, W.Z., H.C., and J.W.; funding acquisition, H.C. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants No. 42471403, No. 42101435, No. 42101432, and No. 62106276.

Data Availability Statement

The original SUMS dataset presented in the study is openly available at https://3d.bk.tudelft.nl/projects/meshannotation/ (accessed on 3 August 2021). The seMet dataset presented in this article is openly available at https://pan.baidu.com/s/1QjRoT3MKd-FfTS1DqYrsuQ?pwd=tbcv (accessed on 29 March 2025). In addition, the source code of MambaMeshSeg-Net can be found in https://github.com/ziwenjie/MambaMeshSeg-Net (accessed on 29 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shan, P.; Sun, W. Research on 3D urban landscape design and evaluation based on geographic information system. Environ. Earth Sci. 2021, 80, 597. [Google Scholar] [CrossRef]
  2. Boeing, G. Spatial information and the legibility of urban form: Big data in urban morphology. Int. J. Inf. Manag. 2021, 56, 102013. [Google Scholar] [CrossRef]
  3. Ahmad, N.; Khan, S.; Ehsan, M.; Rehman, F.U.; Al-Shuhail, A. Estimating the total volume of running water bodies using geographic information system (GIS): A case study of Peshawar Basin (Pakistan). Sustainability 2022, 14, 3754. [Google Scholar] [CrossRef]
  4. Mazaheri, H.; Goli, S.; Nourollah, A. A Survey of 3D Space Path-Planning Methods and Algorithms. ACM Comput. Surv. 2024, 57, 1–32. [Google Scholar] [CrossRef]
  5. Lin, L.; Liu, Y.; Hu, Y.; Yan, X.; Xie, K.; Huang, H. Capturing, reconstructing, and simulating: The urbanscene3d dataset. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 93–109. [Google Scholar]
  6. Zhao, Y.; Lu, B.; Alipour, M. Optimized structural inspection path planning for automated unmanned aerial systems. Autom. Constr. 2024, 168, 105764. [Google Scholar] [CrossRef]
  7. Zhu, L.; Shen, S.; Hu, L.; Hu, Z. Variational building modeling from urban MVS meshes. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 318–326. [Google Scholar]
  8. Li, M.; Nan, L. Feature-preserving 3D mesh simplification for urban buildings. ISPRS J. Photogramm. Remote Sens. 2021, 173, 135–150. [Google Scholar] [CrossRef]
  9. Fan, X.; Zhou, B.; Wang, H.H. Urban landscape ecological design and stereo vision based on 3D mesh simplification algorithm and artificial intelligence. Neural Process. Lett. 2021, 53, 2421–2437. [Google Scholar] [CrossRef]
  10. Yang, G.; Xue, F.; Zhang, Q.; Xie, K.; Fu, C.W.; Huang, H. UrbanBIS: A large-scale benchmark for fine-grained urban building instance segmentation. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles CA USA, 6–10 August 2023; pp. 1–11. [Google Scholar]
  11. Zhao, W.J.; Liu, E.X.; Poh, H.J.; Wang, B.; Gao, S.P.; Png, C.E.; Li, K.W.; Chong, S.H. 3D traffic noise mapping using unstructured surface mesh representation of buildings and roads. Appl. Acoust. 2017, 127, 297–304. [Google Scholar] [CrossRef]
  12. Wang, S.; Cao, J.; Philip, S.Y. Deep learning for spatio-temporal data mining: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 3681–3700. [Google Scholar] [CrossRef]
  13. Lauwers, D. Functional road categorization: New concepts and challenges related to traffic safety, traffic managment and urban design: Reflections based on practices in Belgium confronted with some Eastern European cases. In Proceedings of the Transportation and Land Use Interaction, Bucharest, Romania, 23–25 October 2008; Politechnica Press: Rome, Italy, 2008; pp. 149–164. [Google Scholar]
  14. Galvão, G.; Vieira, M.; Louro, P.; Vieira, M.A.; Véstias, M.; Vieira, P. Visible Light Communication at Urban Intersections to Improve Traffic Signaling and Cooperative Trajectories. In Proceedings of the 2023 7th International Young Engineers Forum (YEF-ECE), Lisbon, Portugal, 7 July 2023; pp. 60–65. [Google Scholar]
  15. Herb, M.; Weiherer, T.; Navab, N.; Tombari, F. Lightweight semantic mesh mapping for autonomous vehicles. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 30 May–5 June 2021; pp. 6732–6738. [Google Scholar]
  16. Shen, S.; Kerofsky, L.; Kumar, V.R.; Yogamani, S. Neural Rendering based Urban Scene Reconstruction for Autonomous Driving. arXiv 2024, arXiv:2402.06826. [Google Scholar] [CrossRef]
  17. Lu, F.; Xu, Y.; Chen, G.; Li, H.; Lin, K.Y.; Jiang, C. Urban radiance field representation with deformable neural mesh primitives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2023; pp. 465–476. [Google Scholar]
  18. Rouhani, M.; Lafarge, F.; Alliez, P. Semantic segmentation of 3D textured meshes for urban scene analysis. ISPRS J. Photogramm. Remote Sens. 2017, 123, 124–139. [Google Scholar] [CrossRef]
  19. Gao, W.; Nan, L.; Boom, B.; Ledoux, H. SUM: A benchmark dataset of semantic urban meshes. ISPRS J. Photogramm. Remote Sens. 2021, 179, 108–120. [Google Scholar] [CrossRef]
  20. Weixiao, G.; Nan, L.; Boom, B.; Ledoux, H. PSSNet: Planarity-sensible semantic segmentation of large-scale urban meshes. ISPRS J. Photogramm. Remote Sens. 2023, 196, 32–44. [Google Scholar]
  21. Huang, J.; Zhang, H.; Yi, L.; Funkhouser, T.; Niessner, M.; Guibas, L.J. TextureNet: Consistent Local Parametrizations for Learning From High-Resolution Signals on Meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  22. Yang, Y.; Liu, S.; Pan, H.; Liu, Y.; Tong, X. PFCNN: Convolutional neural networks on 3D surfaces using parallel frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13578–13587. [Google Scholar]
  23. Lei, H.; Akhtar, N.; Shah, M.; Mian, A. Geometric feature learning for 3D meshes. arXiv 2021, arXiv:2112.01801. [Google Scholar]
  24. Tang, R.; Xia, M.; Yang, Y.; Zhang, C. A deep-learning model for semantic segmentation of meshes from UAV oblique images. Int. J. Remote Sens. 2022, 43, 4774–4792. [Google Scholar] [CrossRef]
  25. Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
  26. Lee, E.; Kwon, Y.; Kim, C.; Choi, W.; Sohn, H.G. Multi-source point cloud registration for urban areas using a coarse-to-fine approach. GIScience Remote Sens. 2024, 61, 2341557. [Google Scholar] [CrossRef]
  27. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  28. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  29. Witharana, C.; Ouimet, W.B.; Johnson, K.M. Using LiDAR and GEOBIA for automated extraction of eighteenth–late nineteenth century relict charcoal hearths in southern New England. GIScience Remote Sens. 2018, 55, 183–204. [Google Scholar] [CrossRef]
  30. Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4558–4567. [Google Scholar]
  31. Huang, J.; Zhou, Y.; Niessner, M.; Shewchuk, J.R.; Guibas, L.J. Quadriflow: A scalable and robust method for quadrangulation. In Proceedings of the Computer Graphics Forum, Delft, Netherlands, 16–20 April 2018; Wiley Online Library: Hoboken, NJ, USA, 2018; Volume 37, pp. 147–160. [Google Scholar]
  32. Yue, L.; Xing, S.; Lu, Y.; Fu, T. Biomamba: A pre-trained biomedical language representation model leveraging mamba. arXiv 2024, arXiv:2408.02600. [Google Scholar]
  33. Chen, G.; Huang, Y.; Xu, J.; Pei, B.; Chen, Z.; Li, Z.; Wang, J.; Li, K.; Lu, T.; Wang, L. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv 2024, arXiv:2403.09626. [Google Scholar]
  34. Liu, X.; Zhang, C.; Zhang, L. Vision mamba: A comprehensive survey and taxonomy. arXiv 2024, arXiv:2405.04404. [Google Scholar]
  35. Liu, J.; Yang, H.; Zhou, H.Y.; Yu, L.; Liang, Y.; Yu, Y.; Zhang, S.; Zheng, H.; Wang, S. Swin-UMamba†: Adapting Mamba-based vision foundation models for medical image segmentation. IEEE Trans. Med. Imaging 2024. early access. [Google Scholar] [CrossRef]
  36. Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
  37. Shi, Y.; Kissling, W.D. Performance, effectiveness and computational efficiency of powerline extraction methods for quantifying ecosystem structure from light detection and ranging. GIScience Remote Sens. 2023, 60, 2260637. [Google Scholar] [CrossRef]
  38. Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
  39. Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A novel mamba architecture with a semantic transformer for efficient real-time remote sensing semantic segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
  40. Zhang, J.; Chen, R.; Liu, F.; Liu, H.; Zheng, B.; Hu, C. DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases. Remote Sens. 2024, 16, 4186. [Google Scholar] [CrossRef]
  41. Fu, C.; Li, G.; Song, R.; Gao, W.; Liu, S. Octattention: Octree-based large-scale contexts model for point cloud compression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–27 February 2022; Volume 36, pp. 625–633. [Google Scholar]
  42. Cui, M.; Long, J.; Feng, M.; Li, B.; Kai, H. OctFormer: Efficient octree-based transformer for point cloud compression with local enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 470–478. [Google Scholar]
  43. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  44. Zi, W.; Li, J.; Chen, H.; Chen, L.; Du, C. UrbanSegNet: An urban meshes semantic segmentation network using diffusion perceptron and vertex spatial attention. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103841. [Google Scholar] [CrossRef]
  45. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  46. Wei, J.; Zhang, X.; Zhuo, Z.; Ji, Z.; Wei, Z.; Li, J.; Li, Q. Leader population learning rate schedule. Inf. Sci. 2023, 623, 455–468. [Google Scholar] [CrossRef]
  47. Ebeida, M.S.; Davidson, A.A.; Patney, A.; Knupp, P.M.; Mitchell, S.A.; Owens, J.D. Efficient maximal Poisson-disk sampling. ACM Trans. Graph. (TOG) 2011, 30, 1–12. [Google Scholar] [CrossRef]
  48. Adam, J.M.; Liu, W.; Zang, Y.; Afzal, M.K.; Bello, S.A.; Muhammad, A.U.; Wang, C.; Li, J. Deep learning-based semantic segmentation of urban-scale 3D meshes in remote sensing: A survey. Int. J. Appl. Earth Obs. Geoinf. 2023, 121, 103365. [Google Scholar] [CrossRef]
  49. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  50. Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. Adv. Neural Inf. Process. Syst. 2022, 35, 33330–33342. [Google Scholar]
  51. Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler Faster Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
  52. Gu, A.; Johnson, I.; Timalsina, A.; Rudra, A.; Ré, C. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv 2022, arXiv:2206.12037. [Google Scholar]
Figure 1. Overview of our proposed model. MambaMesh 256, MambaMesh 128, and MambaMesh 32 represent the input and output feature dimensions of 256, 128, and 32, respectively. FEF denotes feature evaluation fusion modules. ⨁ indicates additive fusion. Linear represents a linear layer.
Figure 1. Overview of our proposed model. MambaMesh 256, MambaMesh 128, and MambaMesh 32 represent the input and output feature dimensions of 256, 128, and 32, respectively. FEF denotes feature evaluation fusion modules. ⨁ indicates additive fusion. Linear represents a linear layer.
Remotesensing 17 01653 g001
Figure 2. The framework of the MambaMesh module. RMSNorm represents Root Mean Square Normalization. DWConv refers to Depthwise Convolution. σ denotes the sigmoid activation function. ⨁ indicates additive fusion. Linear represents a linear layer.
Figure 2. The framework of the MambaMesh module. RMSNorm represents Root Mean Square Normalization. DWConv refers to Depthwise Convolution. σ denotes the sigmoid activation function. ⨁ indicates additive fusion. Linear represents a linear layer.
Remotesensing 17 01653 g002
Figure 3. Visualization of error prediction maps on the SUM dataset. The green color areas present correct semantic segmentation, and the red parts indicate error prediction.
Figure 3. Visualization of error prediction maps on the SUM dataset. The green color areas present correct semantic segmentation, and the red parts indicate error prediction.
Remotesensing 17 01653 g003
Figure 4. Visualization of error prediction maps on the seMet dataset. The green color areas present correct semantic segmentation, and the red parts indicate error prediction.
Figure 4. Visualization of error prediction maps on the seMet dataset. The green color areas present correct semantic segmentation, and the red parts indicate error prediction.
Remotesensing 17 01653 g004
Figure 5. Performance of the hybrid scanning strategy with different interval values N on SUM and seMet.
Figure 5. Performance of the hybrid scanning strategy with different interval values N on SUM and seMet.
Remotesensing 17 01653 g005
Table 1. Comparison of SOTA semantic segmentation models on the SUM dataset. Terra. denotes terrain; H-veg. represents high-vegetation; Build. means buildings; Vehi. means vehicles. There is also per-class IoU (%), mean F1 score (mF1, %), Overall accuracy (OA, %), mean IoU (mIoU, %), PRET. means pre-preprocess time (second), and INFT. means inference time (second). (bold: best).
Table 1. Comparison of SOTA semantic segmentation models on the SUM dataset. Terra. denotes terrain; H-veg. represents high-vegetation; Build. means buildings; Vehi. means vehicles. There is also per-class IoU (%), mean F1 score (mF1, %), Overall accuracy (OA, %), mean IoU (mIoU, %), PRET. means pre-preprocess time (second), and INFT. means inference time (second). (bold: best).
MethodTerra.H-veg.Build.WaterVehi.BoatmIoUOAmF1
PointNet++ [49]52.1555.6375.4567.4001.0441.9078.4153.62
RF-MRF [18]26.2572.1368.936.2910.72030.7172.1923.10
SUM-RF [19]76.6890.3192.2235.6943.690.9157.9590.3567.96
PTV2 [50]82.8489.6293.5987.4219.8618.7565.3593.1070.80
PSSNet [20]86.5495.3793.5382.2359.6328.0274.2294.6982.30
UrbanSegNet [44]75.7290.6594.7886.2062.03 63 . 21 78.7593.8686.89
PTV3 [51]87.3292.2594.2188.4849.3832.4274.0194.6979.36
MambaMeshSeg-Net 89 . 27 94 . 31 95 . 21 88 . 83 65 . 10 41.54 79 . 04 95 . 61 86 . 97
Table 2. Comparison of SOTA semantic segmentation models on seMet dataset. Terra. denotes terrain; H-veg. represents high-vegetation; Build. means buildings; Vehi. means vehicles. There is also per-class IoU (%), mean F1 score (mF1, %), Overall accuracy (OA, %), and mean IoU (mIoU, %). (bold: best).
Table 2. Comparison of SOTA semantic segmentation models on seMet dataset. Terra. denotes terrain; H-veg. represents high-vegetation; Build. means buildings; Vehi. means vehicles. There is also per-class IoU (%), mean F1 score (mF1, %), Overall accuracy (OA, %), and mean IoU (mIoU, %). (bold: best).
MethodTerra.H-veg.Build.Vehi.mIoUOAmF1
PointNet++ [49]59.97091.80037.9491.1244.22
RF-MRF [18]23.0251.0885.924.8641.2185.8251.68
SUM-RF [19]85.9965.0695.8647.4473.5995.8583.38
PTV2 [50]53.8834.9484.7114.4446.9985.1563.77
PSSNet [20]79.1371.8195.5325.0667.8895.7882.30
UrbanSegNet [44]73.4881.8296.89 47 . 50 74.9395.8684.37
PTV3 [51]84.2045.2595.4641.4666.5995.3870.51
MambaMeshSeg-Net 85 . 63 77 . 84 98 . 10 46.9 77 . 10 97 . 60 84 . 42
Table 3. Comparison of pre-process time and inference time for SOTA semantic segmentation models (bold: best).
Table 3. Comparison of pre-process time and inference time for SOTA semantic segmentation models (bold: best).
DatasetModelPre-Process Time (s)Inference Time (s)
SUMPointNet++ [49]81.174.58
RF-MRF [18]44.836.17
SUM-RF [19]30.913.42
PTV2 [50]0.7783.75
PSSNet [20]520.21102.96
UrbanSegNet [44]231.6336.57
PTV3 [51]0.7723.17
MambaMeshSeg-Net 0 . 73 1 . 72
seMetPointNet++ [49]44.255.75
RF-MRF [18]184.2520.25
SUM-RF [19]78.259.00
PTV2 [50]1.59138.25
PSSNet [20]1083.25234.03
UrbanSegNet [44]326.4053.72
PTV3 [51]1.5933.00
MambaMeshSeg-Net 1 . 43 2 . 50
Table 4. Performance of the hybrid scanning strategy with different interval values N on SUM and seMet. This also includes mean IoU (mIoU, %), Overall accuracy (OA, %), and mean F1 score (mF1, %), (bold: best).
Table 4. Performance of the hybrid scanning strategy with different interval values N on SUM and seMet. This also includes mean IoU (mIoU, %), Overall accuracy (OA, %), and mean F1 score (mF1, %), (bold: best).
DatasetScanning ModemIoUOAmF1
SUMforward68.4391.6373.62
backward70.8391.8674.93
bi-directions78.8293.2879.36
Hybrid N = 5 77.4794.8283.13
Hybrid N = 10 74.0594.6281.80
Hybrid N = 15 76.8795.0183.06
Hybrid N = 20 75.6493.8681.24
adaptive 79.0495.6186.97
seMetforward71.5893.6874.59
backward71.3493.2473.88
bi-directional71.2494.8775.32
Hybrid N = 5 74.0994.3475.70
Hybrid N = 10 75.3495.4882.88
Hybrid N = 15 75.8695.1481.46
Hybrid N = 20 76.3096.2882.68
adaptive 77.1097.6084.42
demonstrates that the hybrid scanning strategy of our proposed model is adaptive, capturing the spatial features of mesh data adaptively rather than at fixed intervals.
Table 5. Performance of Octree depth on SUM and seMet. This includes mean IoU (mIoU, %), Overall accuracy (OA, %), and mean F1 score (mF1, %), (bold: best).
Table 5. Performance of Octree depth on SUM and seMet. This includes mean IoU (mIoU, %), Overall accuracy (OA, %), and mean F1 score (mF1, %), (bold: best).
DatasetOctree DepthmIoUOAmF1
SUM873.4294.3881.28
775.7395.1483.29
679.0495.6186.97
556.4189.7366.50
458.9491.2367.78
seMet864.4393.3376.30
779.9397.1087.71
677.1097.6084.42
575.1496.2383.76
468.5396.0276.14
Table 6. Performance of the ablation study on SUM and seMet. This includes mean IoU (mIoU, %) (bold: best).
Table 6. Performance of the ablation study on SUM and seMet. This includes mean IoU (mIoU, %) (bold: best).
DatasetHSSSFEFEFmIoUOAmF1
SUM 74.8392.3777.82
78.8094.4280.61
78.8293.2879.36
79.0195.6186.97
seMet 70.0494.8375.19
68.1794.4673.24
71.2494.8775.32
77.1097.6084.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zi, W.; Chen, H.; Li, J.; Wu, J. MambaMeshSeg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using a State Space Model with a Hybrid Scanning Strategy. Remote Sens. 2025, 17, 1653. https://doi.org/10.3390/rs17091653

AMA Style

Zi W, Chen H, Li J, Wu J. MambaMeshSeg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using a State Space Model with a Hybrid Scanning Strategy. Remote Sensing. 2025; 17(9):1653. https://doi.org/10.3390/rs17091653

Chicago/Turabian Style

Zi, Wenjie, Hao Chen, Jun Li, and Jiangjiang Wu. 2025. "MambaMeshSeg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using a State Space Model with a Hybrid Scanning Strategy" Remote Sensing 17, no. 9: 1653. https://doi.org/10.3390/rs17091653

APA Style

Zi, W., Chen, H., Li, J., & Wu, J. (2025). MambaMeshSeg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using a State Space Model with a Hybrid Scanning Strategy. Remote Sensing, 17(9), 1653. https://doi.org/10.3390/rs17091653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop