Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining

Yang, Wenhao; Kuang, Liqun; Wang, Song; Han, Xie; Guo, Rong; Wang, Yongpeng; Yue, Haifeng; Wei, Tao

doi:10.3390/a18070383

Open AccessArticle

Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining

by

Wenhao Yang

¹

,

Liqun Kuang

^1,2,*,

Song Wang

¹,

Xie Han

^1,2,

Rong Guo

¹,

Yongpeng Wang

³,

Haifeng Yue

³ and

Tao Wei

³

¹

School of Computer Science and Technology, North University of China, Taiyuan 030051, China

²

Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan 030051, China

³

Shanxi TZCO Intelligent Mining Equipment Technology Co., Ltd., Taiyuan 030032, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(7), 383; https://doi.org/10.3390/a18070383

Submission received: 31 May 2025 / Revised: 21 June 2025 / Accepted: 23 June 2025 / Published: 24 June 2025

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional semantic segmentation plays a crucial role in accurately identifying terrain features and objects by effectively extracting 3D spatial information from the environment. However, the inherent sparsity of point clouds and unclear terrain boundaries in outdoor mining environments significantly complicate the recognition process. To address these challenges, we propose a novel 3D semantic segmentation network that incorporates spatiotemporal feature aggregation. Specifically, we introduced the Gated Spatiotemporal Clue Encoder, which extracts spatiotemporal context from historical multi-frame point cloud data and combines it with the current scan frame to enhance feature representation. Additionally, the Spatiotemporal Feature State Space Module is proposed to efficiently model long-term spatiotemporal features while minimizing computational and memory overhead. Experimental results show that the proposed method outperforms the baseline model, achieving a 2.1% improvement in mIoU on the self-constructed TZMD_NUC outdoor mining dataset and a 1.9% avg improvement on the public SemanticKITTI dataset. Moreover, the method simultaneously improves computational efficiency, making it more suitable for real-time applications in complex, real-world mining environments. These results validate the effectiveness of the proposed method, offering a promising solution for 3D semantic segmentation in complex, real-world mining environments, where computational efficiency and accuracy are both critical.

Keywords:

3D semantic segmentation; intelligent mining; spatiotemporal feature; point cloud

1. Introduction

The intelligent construction of mining areas [1] has developed rapidly in recent years, with artificial intelligence technologies such as autonomous driving being integrated into mining operations [2]. These innovations have shown tremendous potential in ensuring safety, boosting production efficiency, and improving operational stability, gradually positioning intelligent mining as a key driver of technological progress in the mining industry [3]. For example, autonomous trucks and excavation robots have significantly reduced human labor costs and minimized safety risks in hazardous environments. In particular, the use of 3D point cloud data [4], which provides comprehensive spatial coordinate information, has become central to the intelligent construction of mining areas. By accurately capturing the geometric features of the terrain and mining scenes [5], point cloud data is vital for autonomous vehicles and robotic systems operating in mining environments, enabling them to navigate, detect obstacles, and plan optimal paths [6].

Three-dimensional point cloud semantic segmentation plays a crucial role in this process by providing pixel-level analysis of the environment. It helps to identify critical information, such as terrain structure and object distribution, which is essential for autonomous systems to make real-time decisions, including path planning, obstacle avoidance, and operational optimization. Therefore, achieving accurate 3D semantic segmentation is pivotal for the successful implementation of intelligent mining systems.

The goal of 3D semantic segmentation is to assign a unique category label to each point in a point cloud, providing more accurate and detailed environmental information. As artificial intelligence continues to develop, numerous point cloud semantic segmentation algorithms have been proposed [7,8]. These methods can be broadly categorized into two types: projection-based and point-based methods. Projection-based approaches [9,10] project 3D point cloud data onto a 2D plane, then apply traditional algorithms to label the semantic categories of the projected points. In contrast, point-based methods [11,12] directly extract features from the 3D point cloud, allowing them to better preserve spatial geometric features and achieve superior performance in evaluation metrics. For instance, PointNet [13,14], a pioneering method in this domain, addressed the challenges of point cloud rotation and unordered data using spatial transformation networks and multi-layer perceptrons (MLPs). More recent advancements, such as SphereFormer [15], leverage optimized attention mechanisms to enhance segmentation accuracy, especially for distant points. Point-based methods, due to their ability to preserve more feature information, are the focus of this paper.

However, the inherent sparsity of point cloud data remains a significant challenge for accurate semantic segmentation [16], particularly in outdoor mining environments where the boundary features between ground and material piles are often ambiguous [2]. This sparsity can blur the boundaries of terrain features, making segmentation even more difficult. One promising approach to mitigate this issue is to aggregate temporal data, thereby leveraging long-term spatiotemporal features [17]. Recent advances have shown that integrating spatiotemporal AI architectures with resource-aware learning can significantly enhance recognition tasks under sparse or noisy sensor input conditions, particularly in structural health monitoring and environmental analytics [18,19]. Several studies [20,21,22] have explored the stacking of historical frame data to capture these spatiotemporal features, but these methods often come with substantial increases in memory and computational costs. Other approaches, such as the introduction of new loss functions [23] or knowledge distillation techniques [24], have also been explored, but they tend to focus on specific tasks and fail to fully address the spatiotemporal relationships between historical and current data.

In human recognition processes, object identification typically begins with a preliminary observation, followed by the integration of historical knowledge and current sensory input to form a more accurate understanding through spatiotemporal memory construction [25]. This process is not reliant on a single type of information but is based on the dynamic integration of multiple sources. Inspired by this mechanism, this study proposes a novel spatiotemporal clue-gated encoder that efficiently aggregates historical and current point cloud data, thereby improving the segmentation accuracy. The proposed Encoder consists of a spatiotemporal clue-embedding layer that encodes serialized multi-frame point clouds to extract historical knowledge and spatiotemporal cues, while current frame data is processed using 3D convolutions to capture the latest features. A gating unit dynamically adjusts the relative importance of historical and current data, facilitating the aggregation of long-term spatiotemporal information and enhancing the model’s ability to represent spatiotemporal features.

Furthermore, a spatiotemporal feature select state-space module is designed to fully utilize the aggregated long-term spatiotemporal features. This module consists of an encoder and decoder based on the U-Net structure. Previous works have integrated Transformers into U-Net-based point cloud segmentation networks to enhance performance through a global attention mechanism [26,27]. SphereFormer [15] further optimizes attention mechanisms, improving segmentation accuracy by addressing issues of sparse point cloud information and expanding the receptive field. However, these solutions come with quadratic computational complexity, which limits their efficiency and scalability. Recently, state-space models such as Mamba [28] have achieved groundbreaking advancements by providing linear complexity with selective mechanisms and hardware optimizations, making them highly suitable for the modeling of long-term spatiotemporal information in an efficient manner.

Mamba has been successfully migrated from natural language processing to computer vision [29,30], achieving competitive results while reducing memory consumption. However, its potential in handling complex, unstructured 3D point cloud data remains underexplored [31,32]. To address this issue, this paper embeds Mamba blocks into the spatiotemporal feature select state-space module. The module’s encoder consists of Mamba blocks and 3D sparse convolutions and encodes the input features, while the decoder, consisting of Mamba blocks and 3D sparse inverse convolutions, decodes the output features. Skip connections between the encoder and decoder facilitate information exchange at the spatial level. This design reduces computational costs while maintaining high segmentation accuracy, enhancing the model’s suitability for practical deployment in mining operations. The contributions of this paper are summarized as follows:

(1): A spatiotemporal clue-gated encoder is proposed, efficiently aggregating long-term spatiotemporal features through the spatiotemporal clue-embedding layer. The gating unit dynamically balances the contributions of historical and current data, improving model accuracy.
(2): A spatiotemporal feature select state-space module is designed, combining Mamba’s global context modeling capability with its linear complexity advantage. This leads to both higher accuracy and lower computational costs.
(3): The spatiotemporal feature aggregation network outperforms existing methods on our self-constructed TZMD_NUC dataset (TZ Group Co., Ltd., Taiyuan, China, Mining Dataset by North University of China) and the public SemanticKITTI dataset. Ablation experiments further demonstrate that the proposed approach surpasses Transformer-based methods and significantly reduces both temporal and spatial costs, making it better suited for the practical needs of mining operations.

2. Methodology

The overall architecture of the spatiotemporal aggregation network is illustrated in Figure 1. The input data is first processed by the point cloud input unit and the spatiotemporal clue-gated encoder, after which the processed data is passed through the U-Net network, which embeds multiple spatiotemporal feature select state-space modules. The final results are output through the semantic segmentation head.

Specifically, the point cloud input unit divides the input flow into two branches: one for serialized historical frame data and the other for current frame data. The sequence data consists of multiple frames of point clouds, providing an input format that is well-suited for subsequent feature extraction and learning.

The spatiotemporal clue-gated encoder first performs preliminary feature encoding on the serialized sequence data and current frame data using 3D sparse convolution layers. For the encoded sequence features, the spatiotemporal clue-embedding layer captures the spatiotemporal clues hidden within the multiple historical frame features, resulting in long-term spatiotemporal features. These long-term features are then dynamically balanced with the current features through the gating unit, which aggregates the features to produce effectively combined spatiotemporal representations.

Finally, the spatiotemporal feature select state-space module takes the aggregated spatiotemporal features as input and processes them through multiple 3D sparse convolution layers and Mamba blocks to perform global context modeling of the spatiotemporal features. Simultaneously, it leverages the typical encoder–decoder structure of U-Net to generate feature representations that are beneficial for semantic segmentation.

The following sections (Section 2.1, Section 2.2 and Section 2.3) provide a detailed introduction to the three components mentioned above.

2.1. Point Cloud Input

The point cloud input unit preprocesses the point cloud data to adapt it for the feature extraction methods used in different branches of the network. The architecture adopts a two-branch structure: one branch is dedicated to serialized historical frame data input, while the other handles the current frame data input. For the historical frame data, a series of continuous point cloud frames within a specified time window undergoes standard preprocessing to ensure proper formatting for subsequent processing. These sequence frame data and current frame data are expressed as follows:

X_{t}^{s} = c o n c a t (S_{t}, S_{t - 1}, \dots S_{t - i}), X_{t} = S_{t},

(1)

Here,

S_{t - i}

represents the point cloud data of the

(t - i)

frame, and

c o n c a t (.)

denotes the concatenation operation. Specifically, the superscript s in

X_{t}^{s}

is not a point index but a symbolic indicator distinguishing the concatenated multi-frame point cloud sequence from the single-frame point cloud (

X_{t}

). On the right-hand side of the equation,

c o n c a t (S_{t}, S_{t - 1}, \dots S_{t - i})

, represents the temporal concatenation of (i + 1) consecutive full-frame point clouds from time steps t to t − i. Each

S_{t - j}

contains a set of 3D points with coordinates and intensity attributes. The

c o n c a t (.)

function stacks these frames along the temporal axis to form a unified spatiotemporal point cloud while preserving the per-frame spatial structure. This representation enables the model to extract temporal cues across multiple observations. Each point cloud frame at a given time (

t

) typically consists of

N_{t}

points, where each point (

p_{k} = [x_{k}, y_{k}, z_{k}, r_{k}]^{T}

) is characterized by its 3D coordinates and the LiDAR reflectance intensity (

r_{k}

). The sequence frame data (

X_{t}^{s}

) is constructed by concatenating and aggregating the continuous frames of point cloud data from the t-th frame (

S_{t}

) to the (t − i)-th frame (

S_{t - i}

), where

i

is typically determined based on a comprehensive evaluation of hardware conditions and model performance. The current frame data (

X_{t}

) is represented simply as

S_{t}

.

In our current implementation, the frames are evenly spaced within a fixed-length temporal window. This design choice ensures consistent temporal coverage while maintaining computational simplicity and efficiency. Specifically, when using a window of N frames, they are uniformly sampled from a predefined time span (e.g., 1–2 s), thereby capturing both recent and slightly earlier contextual information.

2.2. Spatiotemporal Clue-Gated Encoder

Living organisms perceive, analyze, and make decisions about external objects using spatiotemporal information. For example, when humans observe an object, they combine visual input with existing experiences and memories to make a comprehensive judgment about the object’s shape, category, and potential behavior, thereby enabling an appropriate response [25]. This ability relies on the efficient integration of historical information and the precise capture of current information. Moreover, humans can flexibly adjust the importance of the information they rely on based on specific circumstances, thereby improving the accuracy of their decisions. For instance, when encountering familiar objects, humans tend to rely more on memory, while for unfamiliar new objects, they tend to rely more on real-time perception. Similarly, in the field of AI, fully mining and leveraging spatiotemporal information can emulate the decision-making capabilities of living organisms, achieving more efficient and accurate responses in complex environments.

Based on these observations and analyses, we propose the spatiotemporal clue-gated encoder, as shown in Figure 2, which processes both the serialized sequence frame data (

X_{t}^{s}

) and the current frame data (

X_{t}

) obtained from the point cloud input unit. These data are passed through separate convolutional layers for preliminary encoding. The encoding process is expressed as follows:

Y = σ (Norm (SpC (X)))

(2)

Here,

SpC

represents 3D sparse convolution;

Norm

refers to batch normalization; and

σ

is the activation function, which uses the

ReLU

function. Three-dimensional sparse convolution is advantageous for extracting point cloud features while maintaining a relatively low computational cost. After encoding, the sequence features (

X_{S F}

) and the current features (

X_{C F}

) are obtained.

To capture the spatiotemporal clues hidden in the sequence features (

X_{S F}

), a spatiotemporal clue-embedding layer is designed to further process these features. The processing is described by the following equations:

X_{S F}^{i n} = σ (Norm (Linear (X_{S F}))

(3)

X_{S T F} = MambaBlock (X_{S F}^{i n})

(4)

Here,

Linear

represents a linear transformation layer, and

σ

uses the GELU activation function. For the sequence features (

X_{S F}

), the linear layer first correlates the features within the sequence and maps them to a high-dimensional space to facilitate the interaction of long-term dependency information. This process helps effectively relate the spatiotemporal clues embedded in the sequence, providing the foundation for subsequent clue extraction. Then, the features processed by the linear layer are input into the Mamba block (further details on the Mamba block can be found in Section 2.3) to further mine and capture the deeper spatiotemporal clues.

The long-term spatiotemporal features (

X_{S T F}

), which are efficiently integrated and encoded, are obtained through the spatiotemporal clue-embedding layer. To enable the model to flexibly adjust the dependence between current information and historical information based on specific circumstances, we propose a gating unit to dynamically balance the long-term spatiotemporal features (

X_{S T F}

) and current features (

X_{C F}

). The process is expressed as follows:

G a R = S i g m o i d (Linear (X_{C F}))

(5)

X_{S T A F} = X_{C F} \times G a R + X_{S T F} \times (1 - G a R)

(6)

The gating mechanism first generates a weight representation (

G a R

) through a linear layer and an activation function (

S i g m o i d

). Then,

X_{S T F}

and

X_{C F}

are multiplied by their respective weight representations, allowing the model to select the most useful features during the learning process. Finally, these weighted features are fused through element-wise addition to obtain the effectively aggregated long-term spatiotemporal features (

X_{S T A F}

). This process allows the model to flexibly adjust the importance of features from each branch while retaining key historical spatiotemporal information, achieving a dynamic balance of information dependence.

2.3. Spatiotemporal Feature Select State-Space Module

Effectively modeling the long-term dependencies of the aggregated long-term spatiotemporal features and fully utilizing these features during the encoding and decoding processes is crucial for achieving accurate segmentation. Although Transformer models excel at capturing contextual information, their computational cost grows quadratically with the input size. This becomes particularly challenging when dealing with high-dimensional, unstructured data such as point clouds, resulting in significant computational overhead and limiting their efficiency and scalability in practical applications. In contrast, Mamba achieves linear complexity, which significantly reduce computational costs while maintaining the ability to process long sequences and model global contexts.

Mamba achieves linear time complexity, primarily due to its state-space model formulation, which enables sequence processing through a recurrent structure that is highly parallelizable and hardware-friendly. Unlike Transformers that rely on quadratic-time self-attention mechanisms, Mamba models temporal dependencies using selective kernel-based state updates, decoupling computation from sequence length. We chose Mamba for two main reasons: (1) its computational efficiency makes it particularly well-suited for long spatiotemporal point cloud sequences in real-time applications, and (2) its strong inductive bias toward sequential structures allows it to capture long-range dependencies more naturally, which is advantageous for structured 3D time-series modeling.

Recent studies have applied Mamba to sparse, unstructured data like point clouds, but these works have focused on either optimizing the internal structure of Mamba [28] or leveraging the advantages of multimodal data [32], without fully exploring the potential of the vanilla Mamba in point cloud processing. Therefore, we propose a spatiotemporal feature select state-space module, which leverages the advantages of 3D sparse convolutions in point cloud data. This allows the vanilla Mamba network to fully utilize its ability for global context modeling.

Specifically, as shown in Figure 3, the spatiotemporal feature select state-space module consists of a spatiotemporal encoder and a spatiotemporal decoder. The spatiotemporal encoder is composed of convolutional layers and Mamba blocks. The convolutional layers progressively expand the receptive field, facilitating the interaction of local features with long-term dependencies. After each convolutional layer, the features undergo global context modeling via Mamba blocks. Compared to efficient attention-based Transformer variants such as Performer [33] and Linformer [34], Mamba adopts a fundamentally different modeling paradigm. While Performer approximates softmax attention using random feature mappings and Linformer reduces attention complexity via low-rank projections, Mamba replaces attention entirely with a structured state-space model, enabling direct modeling of temporal dependencies through selective convolutional recurrence. This formulation achieves linear time and space complexity without relying on pairwise token similarity computations. The process is described by the following equation:

X_{E - L a t e n t} = σ (BatchNorm (SpC (X_{S T A F})))

(7)

X_{E O} = MambaBlock (X_{E - L a t e n t})

(8)

where

SpC

represents 3D sparse convolution and

σ

is the

ReLU

activation function. The Mamba block is composed of the basic Mamba model, with its core being the parameterized

SelectiveSSM

state-space model, which introduces a selection mechanism.

SelectiveSSM

transforms the time-invariant system into a linear time-varying system, which can be represented as follows:

SelectiveSSM (x_{t}) = y_{t}

(9)

y_{t} = C h_{t}

(10)

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}^{'}

(11)

Here,

x_{t}

is the input representation, which is obtained by applying a linear mapping to

X_{E O}

.

h_{t} \in R^{N}

represents the latent state updated at time step

t

, with parameter

N

denoting the state size.

y_{t}

is the output representation, and discrete matrices

\bar{A}

and

\bar{B}

are updated over time

Δ

according to the Zero-Order Hold (ZOH) rule:

\bar{A} = e^{Δ A}

(12)

\bar{B} = (Δ A)^{- 1} (e^{Δ A} - I) \cdot Δ B

(13)

where

A \in R^{N \times N}

,

B \in R^{N \times 1}

, and

C \in R^{1 \times N}

are the model parameters.

The spatiotemporal decoder can be viewed as a mirrored operation of the spatiotemporal encoder, utilizing consecutive inverse convolution layers to reconstruct features while incorporating Mamba blocks to prevent the loss of crucial spatiotemporal context information.

X_{D - L a t e n t} = MambaBlock (X_{E O})

(14)

X_{D O} = σ (BatchNorm (SpInvC (X_{D - L a t e n t})))

(15)

where

SpInvC

is 3D sparse inverse convolution and

σ

is the

ReLU

activation function.

The spatiotemporal feature select state-space module focuses on partial dependencies through convolutional layers while utilizing Mamba blocks to provide a global perspective. This approach integrates local information into a comprehensive understanding of long-term dependencies. Additionally, the encoder and decoder use skip connections to blend features from different layers, enhancing spatial details. This design ensures that the model can capture complex temporal sequence features while maintaining computational efficiency.

3. Experiments

This section is structured as follows: Section 3.1 introduces the datasets and evaluation metrics used to assess the proposed method. Section 3.2 details the experimental setup and deployment. Section 3.3 presents the quantitative comparison results for semantic segmentation. Section 3.4 provides an ablation study to evaluate the contribution of each component. Section 3.5 assesses the efficiency of the proposed method, and Section 3.6 offers qualitative visual comparison results.

3.1. Datasets and Evaluation Metrics

To evaluate the proposed method, we performed validation experiments on our self-constructed TZMD_NUC dataset (TZ Group Co., Ltd. Mining Dataset by North University of China) and conducted generalization experiments on the publicly available SemanticKITTI dataset.

TZMD_NUC Outdoor Mining Dataset: This dataset was collected by our research team at several active mining sites, using a Hesai 64-line LiDAR sensor and a visible light camera with a resolution of 1920 × 1200, as shown in Figure 4. The data was recorded during mining operations, with the instrument mounted on a machinery platform. A total of 9734 point cloud samples were used for model training, split into training, validation, and test sets in a 6:1:3 ratio. Based on the characteristics of the operational environment, the samples were categorized into three classes: mining trucks, roads, and material piles, tailored to the specific needs of semantic segmentation in mining scenarios.

SemanticKITTI dataset: This large-scale, public outdoor point cloud dataset for autonomous driving tasks was collected using a 64-line LiDAR sensor. It contains 22 labeled sequences with over 43,000 frames of labeled scans. In this study, we followed the common experimental division settings for semantic segmentation tasks. Sequences 00 to 07 and 09 to 10 were used as the training set, sequence 08 as the validation set, and sequences 11 to 21 as the test set. Semantic labels were divided following MOS [10] for distance benchmarking.

For performance evaluation, we used the mean Intersection over Union (mIoU) as the primary metric. This metric effectively measures the overall performance of the model in multi-class semantic segmentation tasks and is suitable for unified evaluation across different datasets. The mIoU is calculated by computing the Intersection over Union (IoU) score for each category, then averaging the IoUs of all categories to obtain the final metric value:

mIoU = \frac{1}{C} \sum_{i = 1}^{C} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(16)

where

{T P}_{i}

,

{F P}_{i}

, and

{F N}_{i}

represent the true positives, false positives, and false negatives for category i, respectively, and C is the total number of categories.

3.2. Experimental Setup

The experiments were conducted using two GeForce RTX 3090 GPUs. The experimental environment was built on Ubuntu 20.04.6 LTS, CUDA 11.6, Python 3.9.19, and PyTorch 1.13. During training, the learning rate was set to 0.006, with a weight decay of 0.01. The AdamW optimizer was used in combination with the Poly learning rate strategy, where the power parameter was set to 0.9. The model was trained for 50 epochs, with a batch size of 8 for both the TZMD_NUC and SemanticKITTI datasets.

The hyperparameters in our model were chosen through a combination of empirical evaluation and references to existing literature. For activation functions, ReLU was used in 3D convolutional layers due to its simplicity and efficiency, while GELU was applied within Mamba modules to enhance training stability and representation quality for long-range dependencies. To determine the number of Mamba blocks, we performed a grid search over [1, 2, 3, …] blocks per stage and observed that using two blocks per encoder/decoder stage achieved the best trade-off between segmentation accuracy and computational cost on our deployment device.

During data preprocessing, the input scene range for the TZMD_NUC outdoor mining dataset was limited to [−65 m, −65 m, −8 m] to [65 m, 65 m, 7 m], while for the SemanticKITTI dataset, the input scene range was [−51.2 m, −51.2 m, −4 m] to [51.2 m, 51.2 m, 2.4 m] in order to adapt to the characteristics and scale of each dataset.

Regarding network architecture, to ensure a fair evaluation of the proposed method, we removed the attention structure from the SphereFormer [15] network and used the remaining network as the baseline model. Downsampling in the U-Net structure was implemented as a five-stage hierarchical structure, with the number of feature channels progressively increasing as [32, 64, 128, 256, 256], whereas upsampling uses a mirror structure to gradually reduce the number of channels from 256 to 32.

3.3. Segmentation Performance

TZMD_NUC dataset. Table 1 shows a comparison of evaluation metrics between the proposed method and several mainstream semantic segmentation methods on our self-constructed TZMD_NUC mining dataset, including PointNet++ [14], Cylinder3D [8], PointNeXt [11], Retro-FPN [12], SphereFormer [15], and Mseg3d [20]. The proposed method outperforms recent methods such as Retro-FPN [12] and Mseg3d [20], indicating that leveraging spatiotemporal features can effectively improve model performance. Compared to SphereFormer [15], the proposed method improves by 1.3%, suggesting that the Mamba network performs better when handling spatiotemporal features than traditional attention mechanisms. Additionally, compared to the Cylinder3D [8] method, which directly stacks historical frame data, the proposed method shows a 3.9% improvement in mIoU, highlighting the superiority of the proposed spatiotemporal cue-encoding mechanism. PointNet++ [14] improves local feature learning but lacks deep modeling of the dependencies between local and global features. In contrast, the proposed method constructs a more comprehensive model of feature dependency from local to global, achieving a 4.0% performance improvement. Furthermore, the improved model shows a 2.1% increase over the baseline, further confirming the effectiveness of long-sequence spatiotemporal information in improving model performance in outdoor mining scenarios.

Different Distances on SemanticKITTI Dataset. Due to the physical characteristics of LiDAR, point cloud sparsity increases with distance. In mining scenarios with few targets and high real-time demands, distant regions often lack valid point cloud data. The TZMD_NUC dataset focuses on nearby targets, whereas SemanticKITTI includes abundant long-range data. To evaluate the proposed method’s robustness to point cloud sparsity, its performance on SemanticKITTI across different distance ranges is analyzed, as shown in Table 2. It can be clearly seen from the table that the increase in distance makes the recognition difficulty of the segmentation method further increase, and the performance of different methods in the three indicators of mIoU, recall (recall rate), and precision all decline. Compared with the baseline, our method achieves improvements across all metrics, with the mIoU at different distances increasing by an average of 1.9%. BEV projection-based method LiMoSeg [9] and distance projection method MotionSeg3D [21] both perform poorly at long distances. InsMOS [35] improves segmentation performance by fusing spatiotemporal features from multiple point cloud scan frames, but due to the lack of modeling of long-term spatiotemporal dependencies, it is inferior to the methods explored in this chapter in some indicators. MF-MOS [36] uses a projection branch in the designed dual-branch network structure, which leads to a certain degree of loss of three-dimensional geometric features, thereby limiting the extent of performance improvement. The network design in this chapter is completely based on a point-based method.

3.4. Ablation Study

Ablation studies were conducted on each module of the proposed method, with all experiments performed on the TZMD_NUC dataset. As shown in Table 3, adding the spatiotemporal clue-gated encoder resulted in a 0.6% improvement in mIoU. This improvement was achieved by stacking multiple frames of point cloud data and encoding the spatiotemporal features, demonstrating that effectively mining and encoding the potential spatiotemporal clues from historical data can significantly enhance the model’s performance. Further incorporation of the spatiotemporal feature select state-space module, which provides more comprehensive context integration of long-term spatiotemporal features aggregated by the spatiotemporal clue-gated encoder, led to an additional 1.3% increase in mIoU. This result highlights that effectively modeling long-term dependencies within spatiotemporal features enables for a more complete representation of spatiotemporal information, thereby further improving the model’s semantic segmentation capability.

Additionally, an ablation experiment was conducted to investigating the performance of the spatiotemporal clue-gated encoder with different numbers of stacked input frames, as shown in Figure 5. As the number of input frames increased, the model’s performance peaked at three frames, then plateaued. This suggests that the spatiotemporal clue-gated encoder can efficiently extract key spatiotemporal clues from a small number of historical frames, making optimal use of limited input information. In comparison with the strategy of directly stacking more frames, using the spatiotemporal clue-gated encoder reduces the reliance on a large number of historical frames, saving storage and computational resources while maintaining good performance.

3.5. Model Efficiency Evaluation

Since this study introduces Mamba, the proposed model is able to maintain high computational efficiency, even when processing long sequences of spatiotemporal features. To further validate this advantage, we performed an efficiency evaluation to assess the proposed method’s computational resource usage and real-time performance. In this experiment, the proposed spatiotemporal feature select state-space module was replaced with the attention mechanism [15] for comparison, denoted as the BSTA method. The results shown in Table 4 indicate that the proposed method reduces the number of parameters by 22.0% and latency by 25.3% compared to BSTA method. While reducing computational overhead, the mIoU metric for semantic segmentation improved by 1.3%. These results demonstrate that with the addition of the spatiotemporal feature select state-space module, the model not only maintains computational efficiency but also more effectively mines and models long-term spatiotemporal features, leading to improved semantic segmentation performance.

3.6. Visualization Comparison

Figure 6 shows the qualitative visualization results of the proposed method on the TZMD_NUC dataset. The first column shows the prediction from the baseline model, the second column displays the predictions from the proposed method, and the third column represents the ground-truth labels. Regions where the proposed method outperforms the baseline are marked with red circles, emphasizing the advantages of the proposed approach. It can be observed that the proposed method achieves more accurate segmentation of terrain boundaries, demonstrating that the incorporation of spatiotemporal feature modeling effectively enhances the prediction accuracy of semantic segmentation.

4. Discussion and Insights

In this study, we proposed a spatiotemporal feature aggregation method for 3D semantic segmentation specifically designed to address the challenges posed by sparse point clouds and unclear terrain boundaries in outdoor mining environments. The results of our experiments demonstrate the effectiveness of this approach, and several key insights can be drawn from the findings.

4.1. Importance of Spatiotemporal Information

One of the most significant observations from our experiments is the notable performance improvement achieved by incorporating spatiotemporal information into the segmentation model. As evidenced by the results on both the TZMD_NUC and SemanticKITTI datasets, the proposed method outperforms traditional methods that do not leverage spatiotemporal context, such as PointNet++ and Cylinder3D. This highlights the critical role of temporal cues in enhancing feature representation, especially in dynamic outdoor environments where object and terrain characteristics evolve over time. The spatiotemporal clue-gated encoder is particularly effective in extracting relevant information from historical frames, helping the model better understand long-term dependencies and contextual changes, which ultimately leads to more accurate segmentation.

4.2. Long-Term Dependencies and Computational Efficiency

The ablation study further underscores the importance of modeling long-term spatiotemporal dependencies. The integration of the spatiotemporal feature select state-space module resulted in a significant performance boost, showing that a more comprehensive modeling of the spatiotemporal context is essential for improving segmentation accuracy. Interestingly, the introduction of this module also led to a reduction in computational overhead, with a decrease in both model parameters and latency while still improving segmentation performance. This demonstrates that the proposed method not only enhances the model’s ability to leverage long-term spatiotemporal features but also maintains high computational efficiency, making it suitable for real-time applications in resource-constrained environments like outdoor mining sites.

4.3. Comparison with Multimodal Fusion Methods

Another important insight stems from comparing our method with other multimodal fusion approaches such as Two-streamMOS and LiDAR-IMU-GNSS. While these methods combine point cloud and image data, our method, which focuses on spatiotemporal feature aggregation, achieves better performance by effectively modeling the temporal dependencies of point cloud data. This result suggests that the challenge of aligning multimodal data, particularly the dimensional inconsistency between point clouds and images, leads to the loss of critical feature information during fusion. Our approach, on the other hand, avoids this issue by relying solely on point cloud data, showing that a more focused and effective exploitation of spatiotemporal features can outperform complex multimodal fusion techniques.

4.4. Implications for Practical Applications

The findings of this study have important implications for practical applications in outdoor mining and autonomous driving. In mining scenarios, the ability to accurately segment terrain boundaries and objects such as mining trucks and material piles is crucial for operational safety and efficiency. The proposed method, which combines spatiotemporal feature aggregation and long-term dependency modeling, offers a promising solution to improve semantic segmentation in such challenging environments. Moreover, the demonstrated computational efficiency of our approach makes it suitable for real-time deployment, even in environments with limited computational resources.

However, while the proposed method shows significant improvements over existing models, there are still challenges to address. For instance, the performance of the model can be affected by noise and occlusions in the point cloud data, which are common in real-world mining scenarios. Future work could explore methods for enhancing the robustness of the model to such disturbances, perhaps by incorporating noise reduction techniques or domain adaptation strategies.

4.5. Future Directions

Several directions for future work emerge from this study. First, further optimization of the spatiotemporal feature extraction process could lead to even more efficient models. Exploring more advanced temporal modeling techniques, such as recurrent neural networks or attention-based mechanisms, could provide a deeper understanding of temporal relationships in the data. Additionally, integrating other sensor modalities, such as thermal imaging or radar, could further enhance the model’s robustness and accuracy in complex, real-world environments. While our current approach uses uniformly spaced frames, future work could explore more adaptive frame selection strategies—such as motion-aware sampling, key-frame detection, or content-aware heuristics—which may further enhance efficiency and accuracy by focusing on the most informative temporal segments.

Another potential avenue for improvement is the handling of dynamic elements in the environment, such as moving vehicles or changing terrain conditions. Further research could investigate methods for real-time adaptation to these dynamic changes, enabling more effective and accurate segmentation in ever-evolving scenarios.

In addition, enhancing the gating mechanism used in the spatiotemporal clue-gated encoder presents a promising direction. While the current design employs a simple sigmoid activation over a linear projection to maintain computational efficiency, more dynamic strategies—such as time-aware modulation or residual-based gating—could better capture evolving spatiotemporal patterns and adapt to varying feature contexts. Furthermore, although the current model has shown stable performance under moderate noise levels in historical frames, future work could investigate the sensitivity of the gating unit to more severe sensor noise or outlier contamination, potentially guiding the development of more robust and adaptive feature fusion mechanisms.

5. Conclusions

This paper proposes a novel 3D semantic segmentation method based on spatiotemporal feature aggregation specifically designed to address the challenges posed by blurred terrain boundaries and sparse point cloud data in outdoor mining environments. Both quantitative and qualitative experimental results on the TZMD_NUC outdoor mining dataset demonstrate that the proposed method significantly enhances recognition capabilities, particularly in achieving more accurate boundary feature segmentation. Additionally, experiments on the SemanticKITTI public dataset further validate the method’s effectiveness for semantic segmentation in diverse outdoor environments, confirming its generalizability. Moreover, the spatiotemporal feature select state-space module enhances both computational efficiency and long-term feature modeling. Overall, the proposed method improves segmentation accuracy and reduces computational costs, making it a promising solution for real-time, resource-efficient 3D semantic segmentation in outdoor mining and other complex environments.

Author Contributions

W.Y. and S.W. conceived the algorithms. R.G. provided the theoretical analysis. Y.W., H.Y. and T.W. were responsible for data collection and preprocessing. L.K. and X.H. conceived and supervised the project. All the authors contributed to the analysis of results and manuscript writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62272426), the Shanxi Province Science and Technology Major Special Project (Grant No. 202201150401021), and the Fundamental Research Program of Shanxi Province (Grant Nos. TZLH20230818005, 202303021212189, 202303021211153, 202203021212138, 202303021212206, 202303021212372, and 202403021212166).

Data Availability Statement

The data supporting the results of this work are available from the corresponding author upon reasonable request. Code for this article is available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Yongpeng Wang, Haifeng Yue, Tao Wei were employed by the company Shanxi TZCO Intelligent Mining Equipment Technology Co., LTD. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, L.; Li, Y.; Silamu, W.; Li, Q.; Ge, S.; Wang, F.-Y. Smart mining with autonomous driving in industry 5.0: Architectures, platforms, operating systems, foundation models, and applications. IEEE Trans. Intell. Veh. 2024, 9, 4383–4393. [Google Scholar] [CrossRef]
Zheng, C.; Liu, L.; Meng, Y.; Wang, M.; Jiang, X. Passable area segmentation for open-pit mine road from vehicle perspective. Eng. Appl. Artif. Intell. 2024, 129, 107610. [Google Scholar] [CrossRef]
Ge, S.; Wang, F.-Y.; Yang, J.; Ding, Z.; Wang, X.; Li, Y.; Teng, S.; Liu, Z.; Ai, Y.; Chen, L. Making standards for smart mining operations: Intelligent vehicles for autonomous mining transportation. IEEE Trans. Intell. Veh. 2022, 7, 413–416. [Google Scholar] [CrossRef]
Zhao, R.; Han, X.; Guo, X.; Kuang, L.; Yang, X.; Sun, F. Exploring the point feature relation on point cloud for multi-view stereo. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6747–6763. [Google Scholar] [CrossRef]
Guo, Y.; Sohel, F.; Bennamoun, M.; Lu, M.; Wan, J. Rotational projection statistics for 3D local surface description and object recognition. Int. J. Comput. Vis. 2013, 105, 63–86. [Google Scholar] [CrossRef]
Wang, J.; Li, D.; Long, Q.; Zhao, Z.; Gao, X.; Chen, J.; Yang, K. Real-time semantic segmentation for underground mine tunnel. Eng. Appl. Artif. Intell. 2024, 133, 108269. [Google Scholar] [CrossRef]
Sarker, S.; Sarker, P.; Stone, G.; Gorman, R.; Tavakkoli, A.; Bebis, G.; Sattarvand, J. A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation. Mach. Vis. Appl. 2024, 35, 67. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Li, W.; Ma, Y.; Li, H.; Yang, R.; Lin, D. Cylindrical and asymmetrical 3d convolution networks for lidar-based perception. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6807–6822. [Google Scholar] [CrossRef]
Mohapatra, S.; Hodaei, M.; Yogamani, S.; Milz, S.; Gotzig, H.; Simon, M.; Rashed, H.; Maeder, P. LiMoSeg: Real-time bird’s eye view based LiDAR motion segmentation. arXiv 2021, arXiv:2111.04875. [Google Scholar]
Chen, X.; Li, S.; Mersch, B.; Wiesmann, L.; Gall, J.; Behley, J.; Stachniss, C. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. IEEE Robot. Autom. Lett. 2021, 6, 6529–6536. [Google Scholar] [CrossRef]
Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2022, 35, 23192–23204. [Google Scholar]
Xiang, P.; Wen, X.; Liu, Y.-S.; Zhang, H.; Fang, Y.; Han, Z. Retro-fpn: Retrospective feature pyramid network for point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17826–17838. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 April 2017; Volume 30. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17545–17555. [Google Scholar]
Singh, D.P.; Yadav, M. Deep learning-based semantic segmentation of three-dimensional point cloud: A comprehensive review. Int. J. Remote Sens. 2024, 45, 532–586. [Google Scholar] [CrossRef]
Wu, X.; Hou, Y.; Huang, X.; Lin, B.; He, T.; Zhu, X.; Ma, Y.; Wu, B.; Liu, H.; Cai, D. TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15311–15320. [Google Scholar]
Ouadi, B.; Khatir, A.; Magagnini, E.; Mokadem, M.; Abualigah, L.; Smerat, A. Optimizing silt density index prediction in water treatment systems using pressure-based gradient boosting hybridized with Salp Swarm Algorithm. J. Water Process Eng. 2024, 68, 106479. [Google Scholar] [CrossRef]
Khatir, A.; Capozucca, R.; Khatir, S.; Magagnini, E.; Cuong-Le, T. Enhancing Damage Detection Using Reptile Search Algorithm-Optimized Neural Network and Frequency Response Function. J. Vib. Eng. Technol. 2025, 13, 88. [Google Scholar] [CrossRef]
Li, J.; Dai, H.; Han, H.; Ding, Y. Mseg3d: Multi-modal 3D semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21694–21704. [Google Scholar]
Sun, J.; Dai, Y.; Zhang, X.; Xu, J.; Ai, R.; Gu, W.; Chen, X. Efficient spatial-temporal information fusion for lidar-based 3d moving object segmentation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 11456–11463. [Google Scholar]
Duerr, F.; Pfaller, M.; Weigel, H.; Beyerer, J. Lidar-based recurrent 3d semantic segmentation with temporal memory alignment. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; pp. 781–790. [Google Scholar]
Liu, Y.; Kong, L.; Cen, J.; Chen, R.; Zhang, W.; Pan, L.; Chen, K.; Liu, Z. Segment any point cloud sequences by distilling vision foundation models. Adv. Neural Inf. Process. Syst. 2024, 36, 37193–37229. [Google Scholar]
Xia, Z.; Liu, Y.; Li, X.; Zhu, X.; Ma, Y.; Li, Y.; Hou, Y.; Qiao, Y. Scpnet: Semantic scene completion on point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17642–17651. [Google Scholar]
Sun, J.; Xie, Y.; Zhang, S.; Chen, L.; Zhang, G.; Bao, H.; Zhou, X. You don’t only look once: Constructing spatial-temporal memory for integrated 3D object detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3185–3194. [Google Scholar]
Ma, J.; Zhang, J.; Xu, J.; Ai, R.; Gu, W.; Chen, X. Overlaptransformer: An efficient and yaw-angle-invariant transformer network for lidar-based place recognition. IEEE Robot. Autom. Lett. 2022, 7, 6958–6965. [Google Scholar] [CrossRef]
Li, Q.; Zhuang, Y. An efficient image-guided-based 3D point cloud moving object segmentation with transformer-attention in autonomous driving. Int. J. Appl. Earth Obs. Geoinf. 2023, 123, 103488. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 578–588. [Google Scholar]
Zeng, K.; Shi, H.; Lin, J.; Li, S.; Cheng, J.; Wang, K.; Li, Z.; Yang, K. MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 1505–1513. [Google Scholar]
Liao, D.; Wang, Q.; Lai, T.; Huang, H. Joint Classification of Hyperspectral and LiDAR Data Base on Mamba. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5530915. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Wang, N.; Shi, C.; Guo, R.; Lu, H.; Zheng, Z.; Chen, X. Insmos: Instance-aware moving object segmentation in lidar data. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7598–7605. [Google Scholar]
Cheng, J.; Zeng, K.; Huang, Z.; Tang, X.; Wu, J.; Zhang, C.; Chen, X.; Fan, R. Mf-mos: A motion-focused model for moving object segmentation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12499–12505. [Google Scholar]

Figure 1. The overall architecture of the spatiotemporal aggregation network.

Figure 2. Architecture of the spatiotemporal clue-gated encoder.

Figure 3. Spatiotemporal feature select state-space module.

Figure 4. Mining operation environments.

Figure 5. Ablation study on the impact of input frame counts.

Figure 6. Qualitative visualization results on the TZMD_NUC dataset.

Table 1. Comparison with different methods on our self-constructed TZMD_NUC mining dataset.

Method	mIoU	Mining Truck@IoU	Ground@IoU	Ore Pile@IoU
PointNet++ [14]	79.4	79.6	79.4	79.3
Cylinder3D [8]	79.5	79.8	79.7	79.0
PointNeXt [11]	80.3	80.5	79.8	80.5
Retro-FPN [12]	82.4	82.4	82.6	82.2
SphereFormer [15]	82.6	83.2	82.4	82.3
Mseg3d [20]	81.8	82.1	81.5	81.8
Baseline	81.3	81.4	81.1	81.4
Ours	83.4	83.6	83.7	82.8

Table 2. Comparison with different methods at different distances on SemanticKITTI val set.

Method	Close (<20 m)			Medium (≥20 m, <50 m)			Far (>50 m)
Method	mIoU	Recall	Precision	mIoU	Recall	Precision	mIoU	Recall	Precision
LiMoSeg [9]	70.35	77.64	87.74	45.78	54.28	70.41	8.74	8.92	91.27
MotionSeg3D [21]	71.66	79.97	87.35	52.21	59.27	81.40	4.99	4.99	100.00
InsMOS [35]	75.29	88.78	83.21	57.67	66.81	80.84	10.88	10.89	98.63
SphereFormer [15]	81.08	85.30	92.73	68.70	70.43	79.47	47.25	48.71	95.78
MF-MOS [36]	79.31	84.98	92.23	54.67	64.10	78.81	47.97	50.08	91.94
Baseline	78.96	84.03	90.92	68.71	68.83	77.43	47.30	48.67	93.75
Ours	81.12	86.35	93.62	70.48	72.18	85.59	49.17	52.85	95.92

Table 3. Ablation results of each module.

Module		mIoU	∆
Spatiotemporal Clue-Gated Encoder	Spatiotemporal Feature Select State-Space Module	mIoU	∆
×	×	81.5	-
√	×	82.1	0.6
√	√	83.4	1.3

Table 4. Model efficiency evaluation results.

Method	Parameters	Latency	mIoU
BSTA	32.3 M	132.38 ms	82.1
Ours	25.2 M	98.93 ms	83.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Kuang, L.; Wang, S.; Han, X.; Guo, R.; Wang, Y.; Yue, H.; Wei, T. Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining. Algorithms 2025, 18, 383. https://doi.org/10.3390/a18070383

AMA Style

Yang W, Kuang L, Wang S, Han X, Guo R, Wang Y, Yue H, Wei T. Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining. Algorithms. 2025; 18(7):383. https://doi.org/10.3390/a18070383

Chicago/Turabian Style

Yang, Wenhao, Liqun Kuang, Song Wang, Xie Han, Rong Guo, Yongpeng Wang, Haifeng Yue, and Tao Wei. 2025. "Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining" Algorithms 18, no. 7: 383. https://doi.org/10.3390/a18070383

APA Style

Yang, W., Kuang, L., Wang, S., Han, X., Guo, R., Wang, Y., Yue, H., & Wei, T. (2025). Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining. Algorithms, 18(7), 383. https://doi.org/10.3390/a18070383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatiotemporal Contextual 3D Semantic Segmentation for Intelligent Outdoor Mining

Abstract

1. Introduction

2. Methodology

2.1. Point Cloud Input

2.2. Spatiotemporal Clue-Gated Encoder

2.3. Spatiotemporal Feature Select State-Space Module

3. Experiments

3.1. Datasets and Evaluation Metrics

3.2. Experimental Setup

3.3. Segmentation Performance

3.4. Ablation Study

3.5. Model Efficiency Evaluation

3.6. Visualization Comparison

4. Discussion and Insights

4.1. Importance of Spatiotemporal Information

4.2. Long-Term Dependencies and Computational Efficiency

4.3. Comparison with Multimodal Fusion Methods

4.4. Implications for Practical Applications

4.5. Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI