This section introduces the design of MeshCaps in detail. First, the overall network architecture is introduced. According to the characteristics of the grid data, in order to directly apply the convolution to the grid data, while considering the simplicity of the expression of the parameter equation, we design the convolution template in the form of a parameter equation. The input data is reorganized by extracting features through the polynomial convolution kernel, and the corresponding weight value is calculated according to the relative position of the vertices in the local space to capture the fine geometric changes in the local area of the grid. The improved multilayer capsule network structure classifies the features of the fusion shape and posture.
3.1. MeshCaps Framwork
The MeshCaps network structure is shown in
Figure 3, and the training is divided into two stages. (1) Convolution feature mapping stage: the polynomial template is used as a convolution kernel to perform feature extraction operations on the entire model, and finally the convolution feature map at this stage is generated. (2) The training phase of the capsule network: the capsule network is composed of a capsule composition layer, a primary capsule layer, and a Mesh capsule layer, and the final output is used for classification. Compared with the ordinary capsule network, MeshCaps adds a capsule composition layer to map polynomial parameter features to the primary capsule layer extracts more representative features; at the same time, weight sharing is used to train the pose transformation matrix between the capsule layers, which no longer depends on the size of the input model.
In the
Figure 3, the box of feature extraction is the polynomial fitting of the surface element through the generalized least squares method(GLS), and the polynomial parameters are used as the surface features. For the convenience of comparison, the polynomial function F(X,Y,Z) is visualized in the feature extraction box of
Figure 3. N represents the number of window surfaces, K represents the number of neighborhood points, N X 10 represents N surface elements, each surface is represented by a 10-dimensional high-order equation parameter, d represents the primary capsule dimension, and c represents the number of categories.
3.2. Local Shape Feature Extraction
We try to apply a more concise and light weight feature extraction method to the network model. Given a three-dimensional deformation target mesh model
M, The local surface window of the 3D mesh model is defined as taking the vertex of the mesh model as the center of the window, using breadth-first search to get the front
K-
1 Neighbor vertices, the selected vertex and the edge between the vertices form a connected local mesh surface, that is, a local surface window
, where:
K is the size of the convolution template window. For the selection of the k value, multiple sets of experiments have been carried out, and the best k value is 152, that is, the size of the convolution window is 152.
In order to avoid the influence of rigid transformation and nonrigid transformation, a local coordinate system is established in the window and the absolute coordinates of the vertices in the window are converted to the coordinate representation in the local coordinate system. Considering that the local surface in the window is relatively simple, the local coordinates in the window the system uses high-order polynomial equations to describe its shape, such as Formula (3):
where
F is a continuous function used to describe the shape of the local grid window.
is the parameter representation function of the grid.
,
and
are the coordinates of the vertices in the window in the local coordinate system.
During the experiment, it was found that when the local window size is set very small, the grid shape is basically the same, and when
K increases, the grid in the window becomes more complicated. Only the local m coordinate information of the vertex,
,
,
is not enough to describe the grid shape. Therefore, we want to introduce the geodesic distance to improve the expression of a polynomial function. However, the calculation of the geodesic distance is a time-consuming operation, which affects the performance of the entire network, so the block distance
is used as an approximate expression of the geodesic distance.
Among them, the block distance of a point in the convolution window is expressed as the block distance between the point and the center point of the surface unit.
For a grid window, assuming the relative coordinates of the vertices in the window
, the window fitting function is as Formula (5):
The window fitting function , which is a continuous function used to describe the shape of the local window. Encode the local triangle set information, describe the local shape of the patch, and capture the shape transformation of the grid window. is the parameter representation of the grid. Where x, y, z, d represent , is the z in the formula, which is the z-axis coordinate of the point on the grid, which is used to measure the fitting error. The entire function F after fitting can be used as an approximate representation of the local grid.
The surface fitting results are shown in
Figure 4. The blue scatter plot represents the distribution of vertices in the grid window, and the red surface represents the result of fitting using a polynomial function. The fitting error is the mean error
of all vertices of the surface.
In the fitting process, in order to avoid the influence of different positions and postures of the surface on the feature layer, the mesh is first positively definite. The center point is aligned with the origin of the three-dimensional coordinate system, and the normal vector is horizontal to the z-axis of the three-dimensional coordinate system. The equation parameters are solved by the generalized least squares method (GLS).
It can be seen from
Figure 5 that each model
can be represented by
n parameter equations after sliding convolution through the window, and the parameters
can be used as the shape feature descriptor of a certain mesh fragment under the window. At the same time, in order to introduce the surface pose information, after extracting the shape features of the mesh surface, the coordinates of the center point and the normal vector of the surface are added, so that the network can learn the direction information of the surface.
3.3. Mesh Capsule Networks
The traditional capsule network first uses the convolutional layer for feature extraction, and then gradually integrates it into deeper features through the capsule layer and uses them for classification results. However, because the result of feature extraction in the previous article is a shallow feature, certain spatial information is retained, and contains less semantic information, so after the feature extraction module, a capsule composition layer is added to map the equation parameter feature vector output to the primary capsule layer. For the convolutional feature layer, each patch is represented as a 10 polynomial of dimensional parameters. Through three one-dimensional convolutions, the number of channels is continuously increased to extract features of higher dimensions, and at the same time, each convolution uses a normalization layer to speed up the training and convergence of the network.
As shown in
Figure 3, the primary capsule layer has
N capsules, and each capsule has a dimension of
d. The capsule composition layer maps feature vectors to the primary capsule layer
, The capsules of each primary capsule layer are expressed as:
As a measure of the significant degree of a vector feature, the capsule network is normalized by a compression function. The capsule value is mapped to the [0,1] range, so that the length of the capsule vector can represent the probability of this feature, while preserving the eigenvalues of each dimension in the vector.
v is the output vector of the capsule, and S is the input vector of the capsule. Therefore, in order to calculate the output of each primary capsule, the activation function is applied.
Among them is the output of primary capsule.
Because the three-dimensional mesh model data set is different from the two-dimensional image, the size and size of each model are different in the input dimension, so an improvement has been made in the capsule network is changing the posture matrix
between the bottom-level capsule and the high-level capsule in the network to the posture matrix
with shared weights The training parameters are reduced, and because the pose matrix is changed from M × N to M × 1, the network can adapt to the input of three-dimensional models of different sizes. After all input vectors are mapped through the same pose matrix, the clustering results are output. Its expression is as follows:
is the output vector of the primary capsule layer, which is the pose matrix in the primary capsule layer, trained by the backpropagation algorithm.
The input of the Mesh capsule layer is the weighted sum of all capsule prediction vectors
in the primary capsule layer.
where
is the coupling coefficient that determines the similarity between
and
in the dynamic routing algorithm.
The initial logarithm is the log prior probability that capsule i should be coupled with capsule j. The dynamic routing algorithm of MeshCaps is the same as the routing algorithm in the original formula.
MeshCaps is only applied to three-dimensional mesh classification, so the reconstruction module and reconstruction loss in the traditional capsule network are discarded in the training and prediction process, which reduces the complexity of the model and helps improve the training efficiency of the model. Such as the formula:
Among them, c is the category, and is the indicator function of the classification. If the category c is correctly predicted, is equal to 1, otherwise it is 0. as the upper bound, that is, it is predicted that the category c exists but does not exist and the recognition is wrong. is the lower bound, that is, it is predicted that class c does not exist but does exist, and it is not recognized. λ is the proportional coefficient, adjust the proportion of the two. The parameter settings are as follows: . The total margin loss is calculated by summing the individual margin loss for all C classes.