You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

5 July 2024

Memory-Augmented 3D Point Cloud Semantic Segmentation Network for Intelligent Mining Shovels

,
,
,
,
,
and
1
School of Mechatronics Engineering, Henan University of Science and Technology, Luoyang 471023, China
2
School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China
3
Henan Intelligent Manufacturing Equipment Engineering Technology Research Center, Luoyang 471003, China
4
Henan Engineering Laboratory of Intelligent Numerical Control Equipment, Luoyang 471003, China
This article belongs to the Section Sensing and Imaging

Abstract

The semantic segmentation of the 3D operating environment represents the key to intelligent mining shovels’ autonomous digging and loading operation. However, the complexity of the operating environment of intelligent mining shovels presents challenges, including the variety of scene targets and the uneven number of samples. This results in low accuracy of 3D semantic segmentation and reduces the autonomous operation accuracy of the intelligent mine shovels. To solve these issues, this paper proposes a 3D point cloud semantic segmentation network based on memory enhancement and lightweight attention mechanisms. This model addresses the challenges of an uneven number of sampled scene targets, insufficient extraction of key features to reduce the semantic segmentation accuracy, and an insufficient lightweight level of the model to reduce deployment capability. Firstly, we investigate the memory enhancement learning mechanism, establishing a memory module for key semantic features of the targets. Furthermore, we address the issue of forgetting non-dominant target point cloud features caused by the unbalanced number of samples and enhance the semantic segmentation accuracy. Subsequently, the channel attention mechanism is studied. An attention module based on the statistical characteristics of the channel is established. The adequacy of the expression of the key features is improved by adjusting the weights of the features. This is done in order to improve the accuracy of semantic segmentation further. Finally, the lightweight mechanism is studied by adopting the deep separable convolution instead of conventional convolution to reduce the number of model parameters. Experiments demonstrate that the proposed method can improve the accuracy of semantic segmentation in the 3D scene and reduce the model’s complexity. Semantic segmentation accuracy is improved by 7.15% on average compared with the experimental control methods, which contributes to the improvement of autonomous operation accuracy and safety of intelligent mining shovels.

1. Introduction

Mining shovels represent the fundamental mechanical apparatus employed in the extraction of mineral resources from open pits. They integrate the two primary functions of ‘digging’ and ‘loading’, as illustrated in Figure 1 []. The overall performance of mining shovels significantly impacts the efficiency and economic viability of the entire mining operation, which is of paramount importance to national resource extraction and energy security [].
Figure 1. Operation of the mining shovel in the open pit mine: (a) mining shovels for digging operations; (b) loading operations with mining shovels.
Traditional mining shovels rely on manual operation, and there are many problems in the operation process. These include a low total bucket rate, high energy consumption, high failure rate, poor operational safety, and fatigue-related injury of operators. Consequently, research into the intelligence of mining shovels has become an important topic of industry development. Intelligent mining shovels can operate autonomously, utilizing advanced technologies such as artificial intelligence, robotics, automatic control, and others, to integrate functions related to sensing the operating environment, detecting the shovel’s position, planning the shovel’s chassis path, and planning the trajectory of the shovel’s digging motion.
Semantic segmentation of a 3D operation scene refers to the division of the scene into several target regions, each of which has the same or similar attributes. Each area is then assigned semantic information according to the attributes of the region and the target features. This enables the segmentation and recognition of different kinds of targets in the operation scene. The semantic segmentation of a 3D operating scene represents the foundation of the perception of an operating environment. It is also a pivotal aspect of the autonomous operation of intelligent mining shovels. The low accuracy of semantic segmentation in the 3D operation scene will render the intelligent mining shovels incapable of accurately segmenting and identifying the principal operation targets. These targets include the material pile and the mining cart. This will result in a high degree of deviation in the autonomous excavation and loading operation, and a concomitant reduction in the accuracy of autonomous operation. Concurrently, accurately segmenting and identifying objects—such as people and vehicles—within the operational environment is challenging. This can potentially lead to safety incidents.
The targets to be sensed in the operational environment of intelligent mining shovels include many objects including stockpiles, mining trucks, shovels, the ground, people, vehicles, auxiliary equipment, and other miscellaneous objects. These are depicted in Figure 2. Due to the extensive range of targets to be sensed and the diversity of shape structures and physical attributes, the key features of the objects are obscured by redundant features during the deep learning training process. This results in lower accuracy of semantic segmentation. Concurrently, the number of point cloud samples of different objects obtained from the actual sampling exhibits a considerable degree of unevenness, which is attributable to the influence of the characteristics of the operating conditions. It is typical for objects such as stockpiles, mining trucks, buckets, and the ground to have a more significant number of point cloud samples; whereas objects such as vehicles and people tend to have a smaller number of samples. In the case of non-dominant targets with a limited number of samples, the key features of these objects are easily overlooked during the deep learning training process, which subsequently leads to a reduction in the accuracy of semantic segmentation. Furthermore, the intricate nature of the component systems of intelligent mining shovels renders the computational task particularly onerous. The operating environment sensing system, digging trajectory planning system, and chassis path planning system all make high demands on computational resources; this necessitates the lightweight level and deployment capability of the semantic segmentation method for 3D operating scenes. Consequently, accurate and lightweight implementation of semantic segmentation of 3D operation scenes under unbalanced samples is a pivotal technical challenge. This challenge must be addressed urgently to facilitate the sensing of intelligent mining shovels within 3D operational environments.
Figure 2. The main segmentation objects for the operating environment of the intelligent mining shovel.

3. Semantic Segmentation of 3D Operational Scenes Based on Memory Enhancement and Lightweight Attention Mechanisms

This paper presents a semantic segmentation network model for intelligent mining shovels with 3D operation scenes constructed using PointConv as the basic framework. The main reason for using PoincConv is that it can efficiently perform convolution operations on non-uniformly sampled 3D point cloud data, with both translation invariance and point sequential permutation invariance. The overall network structure, along with the memory module, channel attention module, and network lightweight mechanism are described in this section.

3.1. Overall Network Structure

In order to address the challenges of sample imbalance, insufficient key feature extraction, and model deployment encountered in the semantic segmentation of smart mining shovels, this paper presents a network model for semantic segmentation based on the memory module and the lightweight attention mechanism, MALMConv (PointConv + Memory Module + Attention Module + Lightweight Module, MALMConv). This model offers an accurate and lightweight implementation of intelligent mining shovels’ semantic segmentation under unbalanced samples, as illustrated in Figure 3.
Figure 3. Overall architecture of MALMConv.
The MALMConv employs PointConv as its fundamental framework, adding memory modules between the encoder and decoder of the original PointConv framework. Furthermore, corresponding memory modules are incorporated into the hop-connected branched network. Consequently, the enhancement of all the target point clouds’ key semantic features extracted by the encoder is achieved prior to their entry into the decoder. The introduction of the memory module has resulted in a significant improvement in the accuracy of semantic segmentation. This improvement is particularly notable in intelligent mining shovels operating in 3D environments. This is due to the fundamental mitigation of forgetting the training of non-dominant target features caused by the unbalanced number of target point cloud samples. Furthermore, in order to enhance the efficacy of PointConv’s key feature extraction in the encoding stage, a channel attention module has been incorporated to dynamically adjust the weights of distinct feature channels, thereby optimizing the expression of pivotal key point cloud features pertinent to the current task. This, in turn, serves to reinforce the precision of the semantic segmentation of the 3D point clouds. Concurrently, in order to optimize the utilization of limited computational resources and enhance the deployment capabilities of the model, depth separable convolution is employed in lieu of conventional convolution during the coding stage. This approach results in a reduction in the number of model parameters and an improvement in the model’s lightweight nature, as well as its actual engineering deployment capability. Typically, depth-separable convolution can be employed in conjunction with the attention module to extract point cloud features. This involves applying depth-separable convolution initially, followed by adjusting the weights of feature channels based on the channel attention module.

3.2. Memory Module

As illustrated in Figure 4, the memory module is composed of two primary components: a memory pool and an an addresser. The memory pool serves to record all the key semantic features of the target to be segmented, while the addresser is employed to access the most relevant key semantic features stored in the memory pool. The memory pool comprises a number of memory slots, each of which is designed to record the stored features of a target to be segmented. Upon receipt of an input feature, the addresser will calculate the degree of similarity between the input feature and each stored feature within the memory pool. This value will then be used as a weighted average of all stored features within the memory pool, with the aim of ultimately obtaining the memory feature corresponding to the input feature. The ReLU activation function employed in the calculation of weights serves to suppress smaller weights, thereby enabling the retention of only those semantic features exhibiting greater similarity to the input features. This process is essential for the successful implementation of semantic segmentation in relation to the category labels.
Figure 4. Memory module. (The different colors are to distinguish the feature vectors).
During training, the memory content in the memory pool is dynamically updated along with the encoders and decoders throughout the segmentation network, driven by the goal of the 3D point cloud semantic segmentation task. Due to the sparse addressing strategy, the memory module is motivated to efficiently use the memory slots in the memory pool to store the key semantic features of the target point cloud acquired during the batch sample learning process. During training, the memory module dynamically learns and remembers all the key semantic features of the targets to be segmented. In the test phase, the learned memories are fixed. Based on the input features, the memory module retrieves the most relevant key semantic features stored in the memory pool and feeds the memory-enhanced point cloud semantic features to the decoder for 3D point cloud semantic segmentation. Therefore, under the action of the memory module, the point cloud semantic features of the non-dominant targets can be enhanced by retrieving the key semantic features stored in the memory pool, thus alleviating the problem of forgetting the features of the non-dominant targets and improving the accuracy of semantic segmentation of intelligent mining shovels in 3D operational scenes.
The memory pool is designed as a matrix M R N × D , where N is the number of memory slots and D is the dimension of the memory feature. Typically, N is the number of the semantic categories, and D is the dimensional size of the input point cloud features. f. m i R 1 × D is the ith row vector of the memory pool M, representing the memory contents in the ith memory slot, each corresponding to a semantic category. The memory pool can be viewed as a dictionary that continuously records and updates the key semantic features of the target point cloud to be segmented during the training process.
The addresser is mainly used to obtain the addressing vector and retrieve the key semantic feature f ^ most relevant to the input feature f from the memory pool based on this addressing vector to achieve memory enhancement of the input feature. First, the addresser obtains the address weight wi by calculating the similarity between the input feature f and the key semantic feature mi stored in the memory pool, where wi is the ith element of the address vector w. The address weight wi is calculated as follows:
w i = exp ( τ ( f , m i ) ) i = 1 N exp ( τ ( f , m i ) ) ,
where τ ( f , m i ) is a metric function used to compute the feature similarity between the input point cloud features f and the key semantic features mi stored in the memory pool. τ ( f , m i ) is computed as follows:
τ ( f , m i ) = j = 1 D ( f j f ¯ ) ( m i j m ¯ i ) j = 1 D ( f j f ¯ ) 2 j = 1 D ( m i j m ¯ i ) 2 ,
where the means f ¯ and m ¯ i are calculated as follows:
f ¯ = 1 D j = 1 D f j
m ¯ i = 1 D j = 1 D m i j
To further suppress the influence of irrelevant memory content on the semantic features of the memory-enhanced point cloud, and to make the memory module more efficient at storing the key features, the addressing weight wi is sparsified using the ReLU function. The sparsified address weights w ˜ i are computed as follows:
w ˜ i = max ( w i β , 0 ) w i | w i β | + ε
where max ( , 0 ) is the ReLU function, β is the shrinkage threshold, and ε is the moderating factor. After the sparsification process, the sparsified addressing weight w ˜ i is normalized. The normalized weight w ^ i is calculated as follows:
w ^ i = w ˜ i i = 1 N | w ˜ i |
Finally, the key semantic features f ^ of the point cloud retrieved from the memory module are computed as follows:
f ^ = i = 1 N w ^ i m i ,
By constructing the memory module, the storage of the key semantic features of the point cloud of the target to be segmented is achieved, which fundamentally alleviates the feature forgetting problem of the non-dominant target during the training process of the semantic segmentation network, and effectively improves the semantic segmentation accuracy rate of intelligent mining shovels in the 3D operation scene.

3.3. Channel-Wise Attention Module

The framework of the feature channel attention module is shown in Figure 5. It consists of three key components: feature compression, excitation, and weight adjustment.
Figure 5. Channel-wise attention module. (Different colors indicate different feature vectors after weight adjustment).

3.3.1. Feature Compression

On each feature channel, the input point feature P = { P k | 1 k c } is compressed into a real number by feature compression, and the real number is used as the channel descriptor, where k denotes the channel sequence number. M ¯ = { μ k | 1 k c } denotes the channel descriptor based on the mean value over all feature channels, where the channel descriptor μ k corresponding to feature channel k is calculated as follows:
μ k = 1 h × w i = 1 h j = 1 w U k ( i , j ) ,
where h × w is the spatial dimension of the input features. D ¯ = { σ k | 1 k c } is the standard deviation-based channel descriptor over all feature channels, where the channel descriptor σ k corresponding to feature channel k is calculated as follows:
σ k = 1 h × w i = 1 h j = 1 w ( U k ( i , j ) μ k ) 2 ,
R = { r k | 1 k c } denotes the two-paradigm number based on two statistically significant features—mean and standard deviation—over all feature channels, where the channel descriptor r k corresponding to feature channel k is computed as follows:
r k = μ k 2 + σ k 2 ,
The feature compression operation compresses and aggregates multiple spatial dimensional features on each feature channel to generate a channel descriptor. The channel descriptor has a global information-sensing field that can effectively reflect the average value and range of variation of all features on the feature channel.

3.3.2. Activation

The channel descriptor R is obtained by the first fully connected layer operation to obtain the weight coefficient W 1 R 1 × 1 × c / r , which is activated using the ReLU function to obtain the weight coefficient W ˜ 1 R 1 × 1 × c / r , where r denotes the scaling factor of the gating mechanism. Adopting ReLU as the activation function helps to alleviate the vanishing gradient problem and improve the convergence speed. Then, the weight coefficients W 2 R 1 × 1 × c are obtained by the second fully connected layer operation, and the weight coefficients S R 1 × 1 × c are obtained after activation using the sigmoid function. The sigmoid is used as the activation function, which is mainly used to obtain normalized weight coefficients with values ranging from 0 to 1. The weight coefficient S R 1 × 1 × c is used to measure the importance of each feature channel for the semantic segmentation task and is calculated as follows:
S = β ( W 2 δ ( W 1 R ) ) ,
where δ is the ReLU activation function and β is the Sigmoid activation function.

3.3.3. Weighting

The input features are weighted using the weight coefficients S R 1 × 1 × c . After fitting, the feature T = { T k | 1 k c } is obtained, and the weight calibration is calculated as follows:
T k = s k P k ,
By constructing the channel attention module to adjust the weights of the features, the weights can be calibrated according to the magnitude of the role of different feature channels. This can stimulate the useful information that is more critical to the current task, and suppress the irrelevant information. Then, it can improve the exploitation of the key features within the limited computational resources, and contribute to the improvement of the semantic segmentation accuracy.

3.4. Depthwise Separable Convolution

In the 3D point cloud semantic segmentation network proposed in this paper, depthwise separable convolution is employed instead of regular convolution. The objective is to reduce the number of parameters of the model. This enhances the lightweight nature of the model, thereby facilitating its deployment in practical applications. As illustrated in Figure 6a, conventional convolution employs a convolution kernel with an identical number of channels to those of the input data, thus enabling the execution of both cross-channel operations and spatial operations. Figure 6b illustrates that depthwise separable convolution decomposes the regular convolution operation into depth-by-depth convolution and point-by-point convolution. Firstly, depth-by-depth convolution performs separate independent convolution operations on each feature channel. Secondly, point-by-point convolution employs the 1 × 1 convolution operation to combine the depth-by-depth convolution outputs on different channels in order to achieve cross-channel operations. The decomposition of cross-channel and spatial operations enables the reduction of the number of parameters in semantic segmentation network models. This is achieved through the use of depthwise separable convolution.
Figure 6. Regular convolution and depthwise separable convolution.

4. Experimental Results and Analysis

This section presents the establishment of a semantic segmentation dataset of intelligent mining shovels for 3D operation scenes. Subsequently, a comparison is conducted between MALMConv and other representative semantic segmentation networks. This demonstrates the segmentation effect of the semantic segmentation network proposed in this paper. Finally, a series of ablation experiments are conducted to validate the memory module, the channel attention module, and the lightening mechanism, among other key technologies.

4.1. Dataset Creation

Relying on the State Key Laboratory of Mining Equipment and Intelligent Manufacturing, this thesis discusses the demand for semantic segmentation of intelligent mining shovels for 3D operation scenes, and reasonably arranges the experimental scene with reference to the real operating environment of mining shovels in the mine site. By modifying the spatial arrangement of objects—such as shovels, people, mining trucks, boxes, and so forth—as well as the composition and topography of the pile of materials and ground surface, and by varying the sampling position and angle, it is possible to expand the range of experimental scenarios and enhance the generalizability of the dataset. Concurrently, to enhance the dataset’s practicality, the frequency of different targets in the laboratory scenario is reasonably adjusted through field research of the real operating conditions in the mine site. This is done to ensure that the dataset can effectively reflect the imbalance of the number of point cloud samples. It highlights the differences in sample numbers for different types of targets in the mine operating scenario.
We conducted point cloud data acquisition at the State Key Laboratory of Mining Equipment and Intelligent Manufacturing in 2022. The equipment used for data collection is our self-developed 3D operating environment sensing system, which is shown in Figure 7.
Figure 7. The 3D operating environment sensing system.
These data are then used to establish a semantic segmentation dataset of the 3D operating scene of intelligent mining shovels through manual annotation. Figure 8 illustrates the typical 3D point cloud data following manual annotation. The dataset comprises semantic targets to be segmented, including eight types of target objects. These include 1:7 scaled intelligent mining shovels prototypes, mining trucks, piles, floors, people, walls, ladders, boxes, and other items. The dataset contains a wealth of operational objects and targets to be segmented. Among the aforementioned targets, five types are of particular importance: mining shovels, mining trucks, stockpiles, ground, and people. These are the most crucial semantic targets to be segmented in the real operating scenarios of mining shovels at the mine site. Consequently, the model training and experimental validation based on this dataset are of great significance. They are crucial for subsequent research on the perception method of intelligent mining shovels in the mine site. Furthermore, in order to ensure the robustness of the algorithm, three types of laboratory scenario targets—such as walls, ladders, and boxes—are also semantically labeled in the dataset as targets to be segmented, in addition to considering the existence of some clutter in the real operating scenarios of mines.
Figure 8. 3D point cloud semantic segmentation data set for the intelligent mining shovels.
We collected 2200 point clouds, of which 2000 point clouds are used as the training set and 200 point clouds are used as the test set. The number of point clouds corresponding to different targets in the practice set is shown in Table 1, in which the number of samples corresponding to targets such as piles, floors, walls, ladders, electric shovels, and mining trucks is relatively high and is the dominant target. The number of samples corresponding to people and boxes is relatively small and are non-dominant targets. The dataset covers the main targets to be sensed in the actual operating scenarios of intelligent mining shovels in the mining site; and the physical attributes are diverse, and the sample imbalance problem is significant, so it can effectively respond to the problems existing in the actual operating scenarios of intelligent mining shovels. The study of semantic segmentation algorithms for 3D operational scenes of intelligent mining shovels based on this dataset is of great significance. It is crucial for the next stage of conducting experimental research in the mining field.
Table 1. Statistics of point cloud samples for different semantic segmentation objects.

4.2. Experimental Setup and Performance Evaluation Index

The programming language used for the experiments in this paper is Python 3.6, the deep learning framework is TensorFlow 1.12, and the GPU is Tesla P100. The training optimizer used for MALMConv is ADAM; with an initial learning rate of 0.001, a momentum of 0.9, a batch size of 8, a decay step of 200,000, and a decay rate of 0.7.
The performance evaluation metrics used in the semantic segmentation experiments in this paper include:
(1)
Single class accuracy (SAC): Denotes the semantic segmentation accuracy for each semantic category. For any semantic category c , SAC is calculated as follows:
SAC = 1 n c i = 1 n c N i c t N i c ,
The notation n c represents the number of point clouds in the dataset that belong to semantic category c . The symbol N i c denotes the ideal value of the number of points corresponding to semantic category c in the ith sub-point cloud. The symbol N i c t represents the number of points in the ith sub-point cloud correctly predicted for semantic category c .
(2)
Average accuracy (AAC): Denotes the mean of semantic segmentation accuracy for all semantic categories. For the eight semantic categories in the dataset of this paper, the AAC is calculated as follows:
AAC = j = 1 8 SAC j 8 ,
where S A C j is the semantic segmentation accuracy for the jth semantic category.
(3)
Number of parameters (Np): this indicates the number of weight parameters in the semantic segmentation network model. It is an important evaluation index to measure the level of model lightness and affect the difficulty of model deployment.

4.3. Semantic Segmentation Accuracy Comparison Experiments and Visualizations

In order to verify the performance of MALMConv, the 3D point cloud semantic segmentation network proposed in this paper, PointNet, PointNet++, DGCNN and PointConv are used as comparison methods. The comparison experiments are conducted to validate the semantic segmentation of intelligent mining shovels. These experiments are based on the semantic segmentation dataset of the 3D operation scene established in this paper. Table 2 shows the single-category semantic segmentation accuracy for eight categories of targets such as piles, floors, mining trucks, people, mining shovels, walls, ladders, and boxes.
Table 2. Semantic segmentation accuracy based on different segmentation methods.
As illustrated in Table 2, the proposed MALMConv exhibits the most optimal overall performance; achieving the highest semantic segmentation accuracy on almost all objectives. The semantic segmentation accuracy of MALMConv for two non-dominant targets—namely, the human and the box—exhibits a marked improvement compared to other methods. The primary rationale for this approach is that the memory module can effectively address the issue of non-dominant target features being forgotten during the training process, thereby enhancing the accuracy of semantic segmentation of 3D point clouds. Concurrently, the attention module—which is based on the statistical characteristics of the feature channels—can adaptively learn the weights of different feature channels. This enables the limited computational resources to be used to filter the feature information that is more critical to the current task. It also improves the adequacy of the feature expression. This, in turn, further improves the accuracy rate of semantic segmentation of 3D point clouds.
PointNet exhibits the lowest level of semantic segmentation accuracy among the four methods under comparison. This is primarily attributable to PointNet not considering the interrelationships between neighboring points. Additionally, it does not extract and utilize the local neighborhood features of the point cloud. This ultimately affects the accuracy of the semantic segmentation of the point cloud in a complex scene. The semantic segmentation accuracy of PointConv is higher than that of PointNet, PointNet++, and DGCNN. This is primarily due to the effective implementation of convolutional operations on 3D point clouds by PointConv, which enables the extraction and exploitation of local neighborhood features of the point cloud, thus improving the semantic segmentation accuracy of the point cloud. However, compared with the MALMConv proposed in this paper, PointConv is more susceptible to the imbalanced number of samples. Furthermore, it is relatively straightforward to overlook the point cloud features of the non-dominant targets during the training process, which subsequently reduces the semantic segmentation accuracy rate of the non-dominant targets. Additionally, PointConv is unable to model the relationship between the different feature channels explicitly, and it is also unable to differentiate the importance of the different feature channels, which ultimately leads to a reduction in the semantic segmentation accuracy.
Figure 9 compares the semantic segmentation accuracies of two non-dominant targets based on the five segmentation methods. It can be observed that the MALMConv network proposed in this paper is efficacious in improving the accuracy of semantic segmentation of non-dominant targets. This is in comparison to PointNet, PointNet++, DGCNN, and PointConv. For the non-dominant target of people, MALMConv improved semantic segmentation accuracy by 36.5% compared to PointNet, 27.9% compared to PointNet++, 26.1% compared to DGCNN, and 14.1% compared to PointConv. For the non-dominant goal of boxes, MALMConv improved semantic segmentation accuracy by 37.2% compared to PointNet, 27.7% compared to PointNet++, 23.1% compared to DGCNN, and 7.3% compared to PointConv. In conclusion, it can be demonstrated that MALMConv is an effective method for alleviating the forgetting problem of non-dominant target point cloud features during the training process. This results in an improvement in the accuracy of 3D point cloud semantic segmentation.
Figure 9. Comparison of semantic segmentation accuracy of non-dominant objects.
Figure 10 presents a comparison of the average accuracy of semantic segmentation across the five segmentation methods on all the point cloud samples in the test set. It can be observed that the MALMConv proposed in this paper exhibits a notable improvement in terms of average accuracy of semantic segmentation. This is in comparison to PointNet, PointNet++, DGCNN, and PointConv. Among the evaluated methods, MALMConv demonstrated superior performance in semantic segmentation, with an average accuracy improvement of 11.6% compared to PointNet, 7.6% compared to PointNet++, 6.6% compared to DGCNN, and 2.8% compared to PointConv. In summary, it can be proved that MALMConv can achieve high semantic segmentation accuracy for different sample numbers and different kinds of 3D point cloud targets. This can effectively improve intelligent mining shovels’ perception and recognition ability for complex operation scenes.
Figure 10. Comparison of average accuracy for semantic segmentation.
To more intuitively demonstrate the semantic segmentation effect of the MALMConv proposed in this paper, two typical 3D point cloud data of intelligent mining shovels operation scenes are selected. These are from the semantic segmentation test set established based on the State Key Laboratory of Mining Equipment and Intelligent Manufacturing for the visual demonstration. These data are shown in Figure 10. Among them, Figure 11a shows the true value map, and Figure 11b shows the semantic segmentation map based on MALMConv. The figure shows that MALMConv can accurately segment the semantic targets in the 3D operation scene of intelligent mining shovels, and the 3D point cloud semantic segmentation effect is very close to the actual value graph. Where the black box shows the segmentation inaccuracies.
Figure 11. Visual display of the semantic segmentation results with MALMConv.

4.4. Ablation Study

This section presents ablation studies conducted to validate the roles of memory, channel attention, and lightweight modules in the semantic segmentation task of intelligent mining shovels for 3D operational scenes. To achieve this objective, PointConv is employed as the fundamental network framework, adding a memory module for enhancement, and resulting in the generation of MMConv (PointConv + Memory Module, MMConv). Conversely, incorporating a memory module and attention module leads to MAMConv (PointConv + Memory Module + Attention Module, MAMConv). In contrast, the combination of a memory module, attention module, and lightweight module gives rise to MALMConv.
Table 3 demonstrates the single-category semantic segmentation accuracies and average accuracies of PointConv, MMConv, MAMConv, and MALMConv on eight categories of objects such as piles, floors, mining trucks, people, mining shovels, walls, ladders, and boxes. Table 4 shows the number of parameters for PointConv, MMConv, MAMConv, and MALMConv.
Table 3. Ablation studies 1.
Table 4. Ablation studies 2.
Table 3 shows that, after introducing the memory module alone, the semantic segmentation accuracy of MMConv is improved to varying degrees in all target categories relative to PointConv. Especially for non-dominant targets, MMConv’s improvement in semantic segmentation accuracy is significant. For the non-dominant target of people, the semantic segmentation accuracy of MMConv is improved by 15.7% compared to PointConv; for the non-dominant target of boxes, the semantic segmentation accuracy of MMConv is improved by 7.2% compared to PointConv. It can be proved through experiments that the memory module can effectively alleviate the forgetting problem of non-dominant target point cloud features in the training process, and improve the accuracy of 3D point cloud semantic segmentation of non-dominant targets in intelligent mining shovels operation scenarios.
With the introduction of both the memory module and the channel attention module, the semantic segmentation accuracy of MAMConv is further improved to varying degrees in terms of overall object categories relative to MMConv. Among them, the most obvious enhancement effect is the target object of boxes, where the semantic segmentation accuracy of MAMConv is improved by 12.7% compared with PointConv, and the semantic segmentation accuracy of MAMConv is improved by 5.5% compared with MMConv. It can be demonstrated through experiments that the channel attention module can learn the weights of different feature channels adaptively. It improves the adequacy of feature expression, further enhancing the accuracy of 3D point cloud semantic segmentation for each sample target in the intelligent mining shovels operation scene.
A comprehensive look at Table 3 and Table 4 shows that with the simultaneous introduction of the memory module, the channel attention module, and the lightweight mechanism, the comprehensive performance of MALMConv achieves an optimal balance in terms of the semantic segmentation accuracy, the number of model parameters, and the running time. MALMConv gets a significant reduction in both the number of parameters and the running time compared with other networks. The experiment proves that using a lightweight mechanism can effectively improve the lightweight level of the model and its ability to be deployed in practical applications. Regarding semantic segmentation accuracy, MALMConv is significantly improved compared to PointConv and similar to MMConv.
The three key technologies—the memory module, the channel attention module, and the lightweight mechanism—can be used individually or in a free combination. In the practical application process, appropriate choices can be made according to the needs of actual working conditions. In this case, the highest semantic segmentation accuracy of the model is achieved when both memory and channel attention modules are used. Still, the number of model parameters and computations increased. This combination of modules can be preferred when the computational resources of the on-board industrial controller allow it. This can maximize the accuracy of 3D point cloud semantic segmentation. With the simultaneous use of a memory module, a channel attention module, and a lightweight mechanism, the model achieves an optimal balance between semantic segmentation accuracy and lightweight. In the limited computational resources of on-board industrial controllers, this combination of modules can be preferred to improve the underlying network model to achieve a higher semantic segmentation accuracy under limited computational resources. In addition, these key techniques have a wide range of applicability. They can be used to improve a wide range of representative 3D point cloud semantic segmentation network frameworks such as PointNet, PointNet++, DGCNN, PointConv, and so on.

5. Conclusions

In this paper, we study the semantic segmentation method of 3D operation scenes of intelligent mining shovels. This is based on memory enhancement and a lightweight attention mechanism. The main contributions are summarized as follows:
(1)
Aiming at the problem of low semantic segmentation accuracy of small sample targets, we propose a memory enhancement learning mechanism to fundamentally alleviate the problem of forgetting the point cloud features of non-dominant targets during training;
(2)
Aiming at the problem of low accuracy of semantic segmentation due to the interference of key features by redundant features, we construct an attention module based on channel statistical feature analysis, which adaptively adjusts feature weights during training to improve the adequacy of key feature expression;
(3)
To improve the model lightweight level and engineering deployment ability, we use depth-separable convolution instead of traditional convolution, reducing the number of parameters of 3D point cloud semantic segmentation network model;
(4)
A semantic segmentation network for intelligent mining shovels in 3D operation scenes has been constructed based on the memory module, channel attention module, and lightweight mechanism. The model has been trained and tested, and the results demonstrate that it can effectively improve the semantic segmentation accuracy and engineering deployment capability of intelligent mining shovels in 3D operation scenes. The average improvement in semantic segmentation accuracy is 7.15%, contributing to enhanced autonomous operations and the safety of intelligent mining shovels.

Author Contributions

Methodology, Y.C.; writing—original draft, Y.C., Z.Z. (Zhihui Zhang) and Z.Z. (Zhidan Zhong); conceptualization and software, Y.A., F.Y., J.W. and K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 62173055, in part by the Major Science and Technology Project of Henan Province under Grant 231111222900, in part by the Joint Fund of Science and Technology Research and Development Plan of Henan Province under Grants 232103810038, and 222102220044, in part by The Tribology Science Fund of State Key Laboratory of Tribology in Advanced Equipment (SKLTKF22B12), in part by the Key Research Projects of Higher Education Institutions of Henan Province under Grant 24A460009 and 20A460012, in part by the Key Technology Research on Heavy Duty Mobile Robot (AGV) for Intelligent Mineral Processing Line 222102220044, in part by the Natural Science Foundation Program of Liaoning Province under Grant 2023-MS-093, and in part by the Science and Technology Major Project of Shanxi Province under Grant 20191101014.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, Y. Intelligent development status and trend of large mining shovels at home and abroad. Mech. Manag. Dev. 2021, 36, 283–285. [Google Scholar] [CrossRef]
  2. Feng, Y.; Wu, J.; Lin, B.; Guo, C. Excavating Trajectory Planning of a Mining Rope Shovel Based on Material Surface Perception. Sensors 2023, 23, 6653. [Google Scholar] [CrossRef] [PubMed]
  3. Agrawal, A.; Nakazawa, A.; Takemura, H. MMM-classification of 3D Range Data. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; IEEE: New York, NY, USA, 2009; pp. 2003–2008. [Google Scholar]
  4. Zhu, J.; Pierskalla, W.P., Jr. Applying a weighted random forests method to extract karst sinkholes from LiDAR data. J. Hydrol. 2016, 533, 343–352. [Google Scholar] [CrossRef]
  5. Lai, X.; Yuan, Y.; Li, Y.; Wang, M. Full-waveform LiDAR point clouds classification based on wavelet support vector machine and ensemble learning. Sensors 2019, 19, 3191. [Google Scholar] [CrossRef]
  6. Karsli, F.; Dihkan, M.; Acar, H.; Ozturk, A. Automatic building extraction from very high-resolution image and LiDAR data with SVM algorithm. Arab. J. Geosci. 2016, 9, 635. [Google Scholar] [CrossRef]
  7. Niemeyer, J.; Rottensteiner, F.; Soergel, U. Contextual classification of lidar data and building object detection in urban areas. ISPRS J. Photogramm. Remote Sens. 2014, 87, 152–165. [Google Scholar] [CrossRef]
  8. Golovinskiy, A.; Kim, V.G.; Funkhouser, T. Shape-based recognition of 3D point clouds in urban environments. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: New York, NY, USA, 2009; pp. 2154–2161. [Google Scholar]
  9. Yao, W.; Wei, Y. Detection of 3-D individual trees in urban areas by combining airborne LiDAR data and imagery. IEEE Geosci. Remote Sens. Lett. 2013, 10, 1355–1359. [Google Scholar] [CrossRef]
  10. Zhao, H.; Liu, Y.; Zhu, X.; Zhao, Y.; Zha, H. Scene understanding in a large dynamic environment through a laser-based sensing. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; IEEE: New York, NY, USA, 2010; pp. 127–133. [Google Scholar]
  11. Wang, H.; Wang, C.; Luo, H.; Li, P.; Cheng, M.; Wen, C.; Li, J. Object detection in terrestrial laser scanning point clouds based on Hough forest. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1807–1811. [Google Scholar] [CrossRef]
  12. Wang, Z.; Zhang, L.; Fang, T.; Mathiopoulos, P.T.; Tong, X.; Qu, H.; Xiao, Z.; Li, F.; Chen, D. A multiscale and hierarchical feature extraction method for terrestrial laser scanning point cloud classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2409–2425. [Google Scholar] [CrossRef]
  13. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  14. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  15. Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; Lu, C. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
  16. Zhao, H.; Jiang, L.; Fu, C.W.; Jia, J. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5565–5573. [Google Scholar]
  17. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
  18. Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, QC, 3–8 December 2018; Volume 31. [Google Scholar]
  19. Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LO, USA, 18–24 June 2022; pp. 9621–9630. [Google Scholar]
  20. Moreno-Barea, F.J.; Jerez, J.M.; Franco, L. Improving classification accuracy using data augmentation on small data sets. Expert Syst. Appl. 2020, 161, 113696. [Google Scholar] [CrossRef]
  21. Leng, B.; Yu, K.; Jingyan, Q. Data augmentation for unbalanced face recognition training sets. Neurocomputing 2017, 235, 10–14. [Google Scholar] [CrossRef]
  22. Ma, X.; Deng, X.; Qi, L.; Jiang, Y.; Li, H.; Wang, Y.; Xing, X. Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields. PLoS ONE 2019, 14, e0215676. [Google Scholar] [CrossRef] [PubMed]
  23. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  24. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  25. Jang, Y.; Jeong, I.; Younesi Heravi, M.; Sarkar, S.; Shin, H.; Ahn, Y. Multi-Camera-Based Human Activity Recognition for Human–Robot Collaboration in Construction. Sensors 2023, 23, 6997. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.