In this section, we will introduce the network structure and provide details of the feature learning process of the entire network.
3.2. Feature Learning
As shown in
Figure 1, the proposed PEMCNet has three main hierarchies in the network to realize the multi-scale feature learning for the point cloud. Each hierarchy of the PEMCNet in the encoding part consists of two feature learning branches. Each branch consists of a proposed PEG unit, ARSE unit, and a commonly used Shared MLP [
36] module.
The PEG strategy adopted the network aim at finding neighboring points around the centroid at given expansion rate
e. This strategy conducts simultaneous K-NN searching at different expansion rates, which prompts an efficient learning process and lightweight storage space. The two PEG units in each hierarchy are assigned different expansion rates which realize multi-scale feature learning. In each PEG unit, every input point is considered to be the center point and sparsely sampling at an expanded step will be conducted to find its neighboring points to be used for summarizing the local information. Specifically, an expansion rate parameter
e is needed to realize the sparse sampling. The principle behind the PEG unit is illustrated in
Figure 2. Suppose
, it means that three nearest neighboring point features
of the
nth center point
q. Likewise, the points set
at the other scale
will be picked instead. In this way, it can be clearly understood that the proposed PEG strategy enables enlargement of the receptive field in dense point feature learning and realize multi-scale feature learning by varying the expansion rate. It needs to be highlighted that this multi-scale searching process is without additional computation. Therefore, this is what makes the proposed network efficient.
Given that the input point cloud set for classification is
and
, where
is the number of input points and
represents the dimension of the original point feature which includes the
x-
y-
z coordinates, pulse intensity and the return number of each point. It can also include RGB information of each point (depends on the specific dataset). Through a PEG unit,
K neighboring points are sampled for each input point at a specific expansion rate to acquire the local point features at one scale. In each hierarchy, such as the first hierarchy, as illustrated in
Figure 1, two sets of point feature vectors
and
will be obtained as a two-scale set of point features.
The ARSE strategy deployed in the classification network helps to better depict the spatial relationship between each centroid and its neighboring points which increases the point feature dimension of each point with more abundant spatial information. Embedding the
x-
y-
z coordinates (absolute position) of all neighboring points in feature learning of the point cloud is argued not informative enough and the relative position between points is also significant. Therefore, after extraction of neighboring points through the PEG units, the ARSE unit works to unit information both of absolute position and the relative position between points. As shown in
Figure 1, the ARSE unit is deployed following each PEG unit to encode the absolute and relative positions of the
nth center point
q and it’s the
kth neighboring point as follows:
where
is the relationship evaluation function of the proposed ARSE unit,
denotes the x-y-z positions of the
nth center point
q,
calculates the Euclidean distance and ⊕ is the feature concatenation operation which means that the dimension of the original point feature vector is expanded by splicing the absolute position and relative position information of each point. After this unit, the channel number of the point feature vector in each hierarchy will increase by
d dimensions. For example, the data flow of each branch in the first hierarchy varies as:
,
. This in fact will benefit the entire network to learn local spatial structures better.
After local information learning through the PEG unit and ARSE unit, a three-layer Shared MLP follows to realize further feature mapping. Through this module, the extracted local point information is summarized with max-pooling operation. The number of 1D convolution kernels in three layers of the Shared MLP in the first hierarchy are 32, 32, and 64, respectively, thus the input point features through Shared MLP are learned as
. (same for
in the other branch). Then, the point features at two scales are spliced together and a Farthest Point Sampling (FPS) [
25] algorithm is conducted at the same time to implement down-sampling, the new point feature map is learned as
. The point feature learning process of the Hierarchy (2) and Hierarchy (3) is the same as Hierarchy (1). The three-layer Shared MLP structure used in those two hierarchies are
and
, respectively. A down-sampling operation is performed between each hierarchy. Throughout the encoding process, the point feature map obtained by each hierarchy is
,
,
, where
is the number of points after down-sampling. As can be seen that the feature dimension of per-point is increased in each hierarchy to retain more information. For the decoding part, the encoded features go through a couple of FP modules. In each FP module, the weighted average of the inverse distance between points is first calculated to find the nearest neighbor point for each centroid, so that the point feature set obtained from the previous layer can be up-sampled utilizing the nearest neighbor interpolation [
25]. The feature maps obtained via up-sampling are then concatenated with the intermediate feature maps achieved with a corresponding encoding hierarchy (shown in
Figure 1) for further learning with a Shared MLP. Finally, the fused point features are processed through the FC layers, where the output is a probability vector that has the length as the number of categories of the specified task, to realize point cloud classification.