VB-Net: Voxel-Based Broad Learning Network for 3D Object Classiﬁcation

: Point clouds have been widely used in three-dimensional (3D) object classiﬁcation tasks, i.e., people recognition in unmanned ground vehicles. However, the irregular data format of point clouds and the large number of parameters in deep learning networks a ﬀ ect the performance of object classiﬁcation. This paper develops a 3D object classiﬁcation system using a broad learning system (BLS) with a feature extractor called VB-Net. First, raw point clouds are voxelized into voxels. Through this step, irregular point clouds are converted into regular voxels which are easily processed by the feature extractor. Then, a pre-trained VoxNet is employed as a feature extractor to extract features from voxels. Finally, those features are used for object classiﬁcation by the applied BLS. The proposed system is tested on the ModelNet40 dataset and ModelNet10 dataset. The average recognition accuracy was 83.99% and 90.08%, respectively. Compared to deep learning networks, the time consumption of the proposed system is signiﬁcantly decreased.


Introduction
In recent years, point clouds have been researched in various fields, such as autonomous robot systems [1], three-dimensional (3D) face recognition [2], intelligence surveillance [3], and 3D modeling [4]. Unlike images captured by cameras, the point clouds are not likely to be influenced by lighting and color, and they maintain a high resolution. However, when using most existing neural network structures, point clouds are hard to process directly because of their irregularity.
To solve this problem, researchers developed various methods to extract features from raw point clouds for existing deep learning structures, such as spin image (SI) [5], clustered viewpoint feature histogram [6], and view feature histogram [7]. As deep learning is gaining success in many fields, researchers also proposed or employed several deep learning structures for different tasks [8][9][10], e.g., 3D model retrieval. However, most deep learning structures contain a large number of parameters, which increases the time consumption of the whole system. It is also inconvenient to add new categories into a trained deep learning network or change the structure of a deep learning network, because the whole network then has to be retrained. These problems make it difficult to apply deep learning methods in fields requiring real-time and massive data processing.
Recently, the broad learning system (BLS) was developed and applied for image classification [11]. Compared to deep learning structures, the time consumption of BLS is decreased significantly, and there is no need to retrain the whole network upon changing the network structure. However, in BLS, it is difficult to directly process irregular point cloud data. Thus, a preprocessing of point cloud data is required to employ BLS in 3D object classification tasks. This paper proposes a 3D object classification system using BLS with a pretrained feature extractor for point cloud data. The pretrained feature extractor intends to change irregular point cloud data into regular features for BLS to process. Another goal of the feature extractor is to find more suitable descriptions for original objects, so as to improve classification accuracy. The proposed system was tested on the ModelNet40 dataset and achieved a recognition accuracy of 83.99%, while the time consumption for training and testing was about 30 s in total, which is far shorter than that of deep learning networks. Experimental results imply that the proposed method is applicable in the field of real-time processing of massive data, i.e., environment perception.
The paper is organized as follows: Section 2 surveys related works in 3D object classification area; Section 3 introduces the proposed system; Section 4 describes the experiments and discusses the results; Section 5 gives the conclusion and the suggestions for future research.

Related Works
Researchers employ many methods in 3D object recognition, i.e., machine learning-based methods, graph-based methods, and feature-based methods. Feature-based methods are usually employed when using point cloud data for classification tasks.
Generally, there were two kinds of methods mainly used for feature-based object recognition: the global feature-based method and the local feature-based method. When adopting global feature-based methods, a segmentation step is required before classification. Drost et al. [12] developed a global descriptor which consisted of point pair features. By grouping similar point pairs, a voting scheme could work on the local space for the classification job. Their methods performed well under a noise situation. Chen et al. developed the global Fourier histogram (GFH) descriptor [13] that used cylindrical angular coordinates. They defined the cylindrical support region in [13] for calculating the GFH by counting the points in each region. With the help of a global reference frame [14] and Fourier transform, GFH also became rotation-invariant. Although global features performed well in many classification tasks, they failed to distinguish objects with similar shapes [13,15]. Thus, researchers introduced some local features to address this and other problems with global features. Guo et al. proposed Tri-Spin-Image (TriSI) [16] and overcame the weak point of the traditional SI descriptor. Employing principal component analysis and a local reference frame, TriSI was robust to noise and mesh resolution. Yang et al. introduced a descriptor named local feature statistics histogram (LFSH) [17], which exploited histograms by counting local features around the chosen points using one-dimensional (1D) arrays. LFSH worked faster than other local feature descriptors, e.g., SI, yet employing 1D histograms led to some information loss.
Numerous deep learning networks were employed for point cloud object classification works in recent years. Although some cloud architectures consume raw point cloud data directly [18,19], many deep learning networks still require various preprocessing of point clouds. The Multiview 3D network proposed by Chen et al. [20] converted 3D point clouds into 3D bounding boxes for multiview feature fusion. By combining these multiview features, their network enabled fast classification of objects. Klokov et al. proposed Kd-network [21] which did not require rasterization of point clouds before inputting into the network. Exploiting the indexing structure called Kd-tree, Kd-network saved the memory usage and reduced time consumption in the training and testing process. Esteves et al. introduced spherical convolutional neural network [22], which converted 3D models into spherical representations and took these spherical representations as the inputs of the network. The spherical representations reduced the input size and made the model become rotation-invariant at the same time.
As researchers developed voxel representation for point clouds, they employed convolutional neural networks (CNNs) in 3D object classification tasks. Zhou et al. developed VoxelNet [23] that converted a point cloud into even voxels to encode features. The application of VoxelNet made it possible to employ a region proposal network for object classification in a point cloud. Li et al. proposed field-probing neural networks (FPNN) [24] that employed field-probing filters to extract features from voxels. The field-probing filters were adaptively distributed in 3D space, and their shape was changed to perceive 3D space more effectively. Compared to traditional 3D CNNs, FPNN performed well in 3D object classification tasks with low time consumption. Wang et al. introduced NormalNet [25], which combined normal vectors of object surfaces and voxels of the 3D object for classification, because normal vectors of object surfaces had stronger capability in classification than voxels. To minimize parameters and extract features for 3D vision tasks, they also employed reflection-convolution-concatenation for convolutional layers. However, a limitation of deep learning networks was the computational cost increased as the network became deeper. Introducing time-consuming 3D convolution made it worse. Thus, many researchers applied graphics processing units (GPUs) in deep learning architecture for acceleration.
In recent years, Chen et al. introduced a new neural network structure called the broad learning system [11]. Unlike deep learning networks, BLS did incremental training without retraining the whole network, which meant that changing the structure of BLS was easier than deep learning networks. Introducing the pseudo inverse matrix of the weight matrix, the updating process of BLS was also faster than that of deep learning networks. Thus, BLS significantly reduced the time consumption in both training and testing processes, and it was more flexible than most deep learning networks. This is the reason why this paper applied BLS for object classification.

VB-Net
This section provides a detailed description of the proposed VB-Net, including the BLS training algorithm, the pretraining of our feature extractor, and the calculation of the final output.

System Overview
The structure of the proposed VB-Net is shown in Figure 1. In this system, raw point clouds are voxelized into d × d × d voxels, then input into the pretrained feature extractor to extract features. Those features are mapped to the features in the feature layer (Z 1, 2, . . . , n ) of the BLS system using mapping matrices W i , i = 1, 2, . . . , n, and the features in enhancement layer (H 1, 2, . . . , m ) are calculated using the feature layer. The nodes in the output layer (o 1, 2, . . . , k ) are calculated using Z 1, 2, . . . , n and H 1, 2, . . . , m . This paper defines the transformation matrices between the feature layer and enhancement layer as W j , j = 1, 2, . . . , m, and the transformation matrix connects output layer with the feature layer and enhancement layer as W*. This paper also defines the feature layer Z as Z = Variables n, m, and k represent the number of nodes in the feature layer, enhancement layer, and output layer, respectively.

VoxNet Feature Extractor
The first step of the proposed system is converting raw point clouds into d × d × d voxels.

VoxNet Feature Extractor
The first step of the proposed system is converting raw point clouds into d × d × d voxels. We employ a feature extractor that is similar to VoxNet [26] as shown in Figure 1, because the BLS cannot process voxel data directly. Conv3d (f, e, s) represents 3D convolution with f filters, kernel size e, and stride s. Pooling (p) means a max pooling layer with pool size p. The activation function for Conv3d is ReLU. The outputs of the flatten layer are the features applied by object classification tasks.
The feature extractor also needs a pretraining process to extract more suitable features from input voxel data. In order to obtain better training results, we only keep the last fully connected layer in the original VoxNet [26], as shown in Figure 2. Variable c is equal to the number of categories in the dataset. Consequently, the feature extractor is trained as a deep learning network.  We employ a feature extractor that is similar to VoxNet [26] as shown in Figure 1, because the BLS cannot process voxel data directly. Conv3d (f, e, s) represents 3D convolution with f filters, kernel size e, and stride s. Pooling (p) means a max pooling layer with pool size p. The activation function for Conv3d is ReLU. The outputs of the flatten layer are the features applied by object classification tasks.
The feature extractor also needs a pretraining process to extract more suitable features from input voxel data. In order to obtain better training results, we only keep the last fully connected layer in the original VoxNet [26], as shown in Figure 2. Variable c is equal to the number of categories in the dataset. Consequently, the feature extractor is trained as a deep learning network.

VoxNet Feature Extractor
The first step of the proposed system is converting raw point clouds into d × d × d voxels. We employ a feature extractor that is similar to VoxNet [26] as shown in Figure 1, because the BLS cannot process voxel data directly. Conv3d (f, e, s) represents 3D convolution with f filters, kernel size e, and stride s. Pooling (p) means a max pooling layer with pool size p. The activation function for Conv3d is ReLU. The outputs of the flatten layer are the features applied by object classification tasks.
The feature extractor also needs a pretraining process to extract more suitable features from input voxel data. In order to obtain better training results, we only keep the last fully connected layer in the original VoxNet [26], as shown in Figure 2. Variable c is equal to the number of categories in the dataset. Consequently, the feature extractor is trained as a deep learning network.

Training of BLS
The features gained by extractor F in Section 3.1 are mapped to the feature layer Z using Equation (1) where W i and β i consist of random numbers. To generate nodes in the enhancement layer, Equation (2) is employed where W j is a predefined matrix filled with random numbers following the standard normal distribution; β j is a randomly generalized bias vector in this paper.
Nodes in output layer o i , o i ∈ O, I ∈ [1, k] are calculated using the following equation in [11] where matrix W* is the training target of the training process:

of 13
This paper defines a new matrix M as M = [Z | H]; then, the output matrix O is calculated using MW*. Therefore, our goal is to narrow the gap between the results of MW* and values in output matrix O. The ridge regression approximation algorithm is applied for adjusting the values in matrix W*. The l 2 norm method is adopted for improving the generalization performance of the BLS as shown in Equation (4).
The optimal solution W * 0 is calculated using Considering that it is hard to compute M + directly, an approximation method is adopted here as shown in Equation (5), where matrix I is an identity matrix, variable λ ∈ R.
As the output matrix O was already computed in Equation (3), only the matrix M + needs calculation. As mentioned in [11], the solution M + 0 is close to the original pseudo inverse matrix as variable λ tends to 0. Thus, Equation (6) is employed to calculate matrix M + .
The training process of the proposed BLS with the feature extractor is done. If classification accuracy does not meet the required number on the test set, we can increase or decrease the number of nodes in the enhancement layer and/or feature layer to improve the classification accuracy of our model.

Experiments and Analysis
The proposed VB-Net was tested on the ModelNet40 dataset [27], which contained 12,311 objects from 40 categories, and its orientation-aligned subset ModelNet10 [27]. Samples of point cloud objects in the dataset and its voxel representations are presented in Figure 3. Experiments on ModelNet40 were run on a computer using the Windows 10 operating system, with an Intel ® Xeon ® (Intel, Santa Clara, CA, USA) Silver 4110 central processing unit (CPU) @ 2.10 GHz, with 16 GB random-access memory (RAM). Experiments on ModelNet10 were run using the Windows 10 operation system, with an Intel ® Core™ (Intel, Santa Clara, CA, USA) i7-4720HQ CPU @ 2.59GHz, with 8 GB RAM.

Performance of VB-Net
The performance of the BLS is affected by the number of enhancement nodes and the number of feature nodes [11]. Therefore, we designed two sets of experiments to test our system. First, we set 900 feature nodes in our system, increasing the enhancement nodes. We added 26 × 26 × 26 and 38 ×

Performance of VB-Net
The performance of the BLS is affected by the number of enhancement nodes and the number of feature nodes [11]. Therefore, we designed two sets of experiments to test our system. First, we set 900 feature nodes in our system, increasing the enhancement nodes. We added 26 × 26 × 26 and 38 × 38 × 38 voxel resolutions to observe the performance of VB-Net under different resolutions. Experimental results on the ModelNet40 dataset are shown in Table 1. As shown in Table 1, the average accuracy initially increased with the number of enhancement nodes. The accuracy reached its maximum with 2500 enhancement nodes in VB-Net under 20 × 20 × 20 resolution (82.61%) and 38 × 38 × 38 resolution (80.75%). Under 26 × 26 × 26 resolution or 32 × 32 × 32 resolution, the accuracy reached its max value with 1500 nodes in the enhancement layer (80.75%) or 3000 enhancement nodes in the system (82.41%). A further increase in the number of enhancement nodes no longer improved accuracy; on the contrary, the classification accuracy then decreased with the number of enhancement nodes.
Another set of experiments involved increasing feature nodes in VB-Net. According to Table 1, VB-Net showed good performance with 3000 enhancement nodes in the enhancement layer at both resolutions. Hence, we set the number of enhancement nodes to 3000 and kept increasing the feature nodes in VB-Net. The results of these experiments on the ModelNet40 dataset are shown in Table 2.
The highest accuracy reached 82.41% when 800 feature nodes were used in the proposed network under 20 × 20 × 20 resolution or 80.59% with 400 feature nodes under 26 × 26 × 26 resolution. When 100 feature nodes were used in VB-Net, the highest accuracy reached 83.99% under 32 × 32 × 32 resolution. For 38 × 38 × 38 resolution, the highest accuracy was 80.99% with 500 feature nodes in the system. Unlike the first set of experiments, classification accuracy under 32 × 32 × 32 resolution decreased directly. From the data in Tables 1 and 2, it is evident that the performance of VB-Net improved as more feature nodes or enhancement nodes were utilized in the calculation of final results. When the number of nodes exceeded the peak amount at which performance was best, the performance of VB-Net became worse due to the interference of too much data. The appropriate number varied with voxel resolutions and strategies.
The time consumption in training and test processes for two different training strategies mentioned above under 20 × 20 × 20 voxel resolution was also computed. Although the time consumption increased with the number of nodes in the network, time consumption with increasing feature nodes increased more quickly than with increasing enhancement nodes, as shown in Figure 4a,b. However, the former strategy held fewer nodes in the network, which was different from the case where time consumption increased with the number of nodes in the network. We tested the situation with 5900 nodes in network (900 feature nodes and 5000 enhancement nodes), where the time consumption was 38.86 s in total, which was far shorter than the method with increasing feature nodes. When we increased the number of feature nodes in BLS, we increased time consumption upon feature mapping, with simultaneous enhancement node generalization and calculation of output, while increasing only the number of enhancement nodes increased the time consumption of the last two steps. This phenomenon is explained in Figure 4.  To prove the proposed system had better performance in object classification tasks, we compared our classification results with some existing deep learning networks using the same dataset, as shown in Table 3. Although VoxNet [26] performed better than our method on the ModelNet10 dataset, our method ran faster than VoxNet [26]. We tested the time consumption of VB-Net and VoxNet [26] on ModelNet10, where the training epochs for VoxNet and the feature To prove the proposed system had better performance in object classification tasks, we compared our classification results with some existing deep learning networks using the same dataset, as shown in Table 3. Although VoxNet [26] performed better than our method on the ModelNet10 dataset, our method ran faster than VoxNet [26]. We tested the time consumption of VB-Net and VoxNet [26] on ModelNet10, where the training epochs for VoxNet and the feature extractor were 30. The experiment showed that our method took only 39.13 s while VoxNet took 3573.78 s.  [27] Voxel 77% 83.5% VoxNet [26] Voxel 83% 92% DeepPano [28] Image 77.63% 85.45% Geometry Image [29] Image 83.9% 88.4% Soltani et al. [30] Image 82.10% -ECC [31] Graph 83.2% 90.0% PointNet [32] Point -77.6%

Performance of Feature Extractor
To evaluate the effect of our feature extractor, we employed an ablation experiment on the ModelNet10 dataset, as shown in Table 4, with 32 × 32 × 32 voxel resolution. We set 450 feature nodes and 1100 enhancement nodes in BLS and VB-Net. The feature extractor was trained 30 times. FE refers to the VoxNet feature extractor and OFE refers to the original VoxNet [26]. The "time" column in Table 4 records the total time consumption (training time pluses test time) of one module. Comparing models A, C, and D, the feature extractor had great influence on total time consumption. BLS with a feature extractor ran much faster than the original BLS, and the classification accuracy also improved by about 3%. According to average accuracy and total time consumption, BLS performed better than the deep learning network. Although model E reached higher classification accuracy than model B, model C still achieved higher accuracy than model D. This proves that our feature extractor is more efficient than VoxNet [26] when combined with BLS. We adopted another ablation experiment to examine whether our feature extractor could be more efficient. Experiments were done on the ModelNet10 dataset under 20 × 20 × 20 voxel resolution using feature extractors trained 120 times. Settings of VB-Net were 420 feature nodes and 900 enhancement nodes. Results are presented in Table 5.
Model F gained higher accuracy than our method (model I) because more features were input into BLS in model F, but our method ran faster than F. When comparing models F, G, and H, the max pooling layer had an influence on the performance of the feature extractor. Without the max pooling layer, the classification accuracy of VB-Net dropped about 2%. Performance on models G and F proved that the first 3D convolutional layer not only reduced the input features of BLS, but also improved the accuracy. The second convolutional layer helped VB-Net run more efficiently by reducing the input features of BLS.

Noise Resistance Test
To examine noise resistance in our model, we added Gaussian noise to original point clouds, and then generated the voxel. Figure 5 gives examples under Gaussian noise with 20 × 20 × 20 and 32 × 32 × 32 voxel resolution with original voxels in the ModelNet40 and ModelNet10 datasets. We employed our method, VoxNet [26], and model D mentioned in Section 4.2 in the experiments. Experimental results are shown in Table 6. The VoxNet [26], feature extractors in model D, and VB-Net were trained 120 times on the training set (accuracies of VoxNet [26] on the ModelNet40 and ModelNet10 datasets under 32 × 32 × 32 voxel resolution were cited from [26] directly).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 13 Model F gained higher accuracy than our method (model I) because more features were input into BLS in model F, but our method ran faster than F. When comparing models F, G, and H, the max pooling layer had an influence on the performance of the feature extractor. Without the max pooling layer, the classification accuracy of VB-Net dropped about 2%. Performance on models G and F proved that the first 3D convolutional layer not only reduced the input features of BLS, but also improved the accuracy. The second convolutional layer helped VB-Net run more efficiently by reducing the input features of BLS.

Noise Resistance Test
To examine noise resistance in our model, we added Gaussian noise to original point clouds, and then generated the voxel. Figure 5 gives examples under Gaussian noise with 20 × 20 × 20 and 32 × 32 × 32 voxel resolution with original voxels in the ModelNet40 and ModelNet10 datasets. We employed our method, VoxNet [26], and model D mentioned in Section 4.2 in the experiments. Experimental results are shown in Table 6. The VoxNet [26], feature extractors in model D, and VB-Net were trained 120 times on the training set (accuracies of VoxNet [26] on the ModelNet40 and ModelNet10 datasets under 32 × 32 × 32 voxel resolution were cited from [26] directly).  VoxNet [26] was the most affected by Gaussian noise, and its accuracy dropped by half on ModelNet10, with even worse performance on ModelNet40. The accuracies of methods with BLS decreased much more slightly compared to VoxNet [26], while some even increased a little. When VoxNet [26] was the most affected by Gaussian noise, and its accuracy dropped by half on ModelNet10, with even worse performance on ModelNet40. The accuracies of methods with BLS decreased much more slightly compared to VoxNet [26], while some even increased a little. When comparing the original BLS and the BLS employing VoxNet [26] as the feature extractor, our method also had better performance. We carried out more experiments with different scales of Gaussian noise to test the robustness of our system, as shown in Figure 6. comparing the original BLS and the BLS employing VoxNet [26] as the feature extractor, our method also had better performance. We carried out more experiments with different scales of Gaussian noise to test the robustness of our system, as shown in Figure 6. Obviously, the accuracy of VoxNet [26] decreased as the scale of noise increased, while the accuracies of methods with BLS were stable at about 90%. We also noticed that, in some experiments, the accuracy of data with noise was higher than that without noise. This situation happened in all methods with BLS. We think this phenomenon was mainly due to the settings in BLS. According to Section 4.1, the performance of BLS was influenced by the number of nodes in the feature layer and enhancement layer. The reason for the improvement in accuracy was that we did not find the best combination of enhancement nodes and feature nodes in the experiments. Considering that there is no theory regarding the relationship among accuracy, nodes in BLS, and voxel resolutions, and considering that the number of possible combinations is large, we think that meeting this phenomenon is acceptable.

Conclusions
This paper developed a 3D object classification system using VB-Net for point cloud data. The proposed system employed voxels to present point clouds, whereas a pretrained VoxNet was employed as a feature extractor, and the BLS was used for classification tasks. The proposed system was tested on the ModelNet40 dataset and ModelNet10 dataset. The average accuracies were as follows: 83.99% under voxel resolution of 32 × 32 × 32 and 90.08% under voxel resolution of 20 × 20 × 20 on ModelNet40; 82.41% under voxel resolution of 32 × 32 × 32 and 89.64% under voxel resolution of 20 × 20 × 20 on ModelNet10. An appropriate increment in voxel resolution improved the performance of the proposed system. Equally, an appropriate addition of enhancement nodes and/or feature nodes improved the classification accuracy of VB-Net. The proposed VB-Net also showed resistance in point cloud data with Gaussian noise. In future work, we will try to find compositions of enhancement nodes and feature nodes in VB-Net to achieve higher classification accuracy.   Obviously, the accuracy of VoxNet [26] decreased as the scale of noise increased, while the accuracies of methods with BLS were stable at about 90%. We also noticed that, in some experiments, the accuracy of data with noise was higher than that without noise. This situation happened in all methods with BLS. We think this phenomenon was mainly due to the settings in BLS. According to Section 4.1, the performance of BLS was influenced by the number of nodes in the feature layer and enhancement layer. The reason for the improvement in accuracy was that we did not find the best combination of enhancement nodes and feature nodes in the experiments. Considering that there is no theory regarding the relationship among accuracy, nodes in BLS, and voxel resolutions, and considering that the number of possible combinations is large, we think that meeting this phenomenon is acceptable.

Conclusions
This paper developed a 3D object classification system using VB-Net for point cloud data. The proposed system employed voxels to present point clouds, whereas a pretrained VoxNet was employed as a feature extractor, and the BLS was used for classification tasks. The proposed system was tested on the ModelNet40 dataset and ModelNet10 dataset. The average accuracies were as follows: 83.99% under voxel resolution of 32 × 32 × 32 and 90.08% under voxel resolution of 20 × 20 × 20 on ModelNet40; 82.41% under voxel resolution of 32 × 32 × 32 and 89.64% under voxel resolution of 20 × 20 × 20 on ModelNet10. An appropriate increment in voxel resolution improved the performance of the proposed system. Equally, an appropriate addition of enhancement nodes and/or feature nodes improved the classification accuracy of VB-Net. The proposed VB-Net also showed resistance in point cloud data with Gaussian noise. In future work, we will try to find compositions of enhancement nodes and feature nodes in VB-Net to achieve higher classification accuracy.