AMS-Net: An Attention-Based Multi-Scale Network for Classification of 3D Terracotta Warrior Fragments

As an essential step in the restoration of Terracotta Warriors, the results of fragments classification will directly affect the performance of fragments matching and splicing. However, most of the existing methods are based on traditional technology and have low accuracy in classification. A practical and effective classification method for fragments is an urgent need. In this case, an attentionbased multi-scale neural network named AMS-Net is proposed to extract significant geometric and semantic features. AMS-Net is a hierarchical structure consisting of a multi-scale set abstraction block (MS-BLOCK) and a fully connected (FC) layer. MS-BLOCK consists of a local-global layer (LGLayer) and an improved multi-layer perceptron (IMLP). With a multi-scale strategy, LGLayer can parallel extract the local and global features from different scales. IMLP can concatenate the high-level and low-level features for classification tasks. Extensive experiments on the public data set (ModelNet40/10) and the real-world Terracotta Warrior fragments data set are conducted. The accuracy results with normal can achieve 93.52% and 96.22%, respectively. For real-world data sets, the accuracy is best among the existing methods. The robustness and effectiveness of the performance on the task of 3D point cloud classification are also investigated. It proves that the proposed end-toend learning network is more effective and suitable for the classification of the Terracotta Warrior fragments.


Introduction
As one of the critical channels for spreading Chinese culture, the Terracotta Warriors have completed the third large-scale excavation study since they were first discovered in 1974. Extensive Terracotta Warriors have been predominantly found in fragments due to the natural environment and human factors ( Figure 1). Therefore, it is of great significance to implement the protection and restoration of cultural relics. As the Terracotta Warriors were found in fragments and randomly mixed, the process of traditional manual restoration methods may spend much time and tedious work. Remote sensing can realize rapid multisource data analysis and the dynamic monitoring of cultural relics and their surrounding environments [1]. With the development of remote sensing, remote sensing archaeology has become an increasingly common method for researchers to investigate cultural sites. Light detection and ranging (LiDAR) data are easy to obtain, which includes height and structure information of objects [2]. A point cloud can efficiently reduce secondary damage to cultural relics. The point cloud is of great significance to implement the protection and restoration of cultural relics as it can restore the structural relationship Some studies have proposed various traditional approaches to classify fragments of cultural relics. Most of the previous methods are mainly based on color features [4], texture features [5], color and texture features [6][7][8], texture and shape features [9], multiple-features fusion [10], and other features. The traditional methods usually require experts to design accurate feature description operators and spend much time. Experts manually classify and calibrate fragments of cultural relics with non-salient features or fusion features based on experience. These are the reasons why the traditional methods have relatively low classification accuracy.
With the development of deep learning, convolutional neural networks (CNNs) have shown significant success in image recognition [11,12], video analysis [13,14], speech emotion recognition (SER) [15][16][17][18], and other domains. Based on the above work, Wang [19] propose an improved CNN specialized for the classification of Terracotta Warrior fragments for the first time. Compared with traditional methods, the proposed method can reduce the time complexity of the algorithm and improve the efficiency of fragments classification. However, the accuracy of image-based deep learning classification methods of Terracotta Warrior fragments is still relatively low. In recent years, the excellent results of deep neural networks for 2D image processing have motivated a data-driven approach to learning features on 3D models. Unlike the 2D image, several common 3D data representations are volumetric grids, depth images, and point clouds. According to the input data type for networks, 3D shape classification methods can be classified into volumetric-based [20,21], multi-view-based [22,23], and point-based methods. Compared with the former two data types, the point cloud is one of the most straightforward 3D shape representations and has been widely used. However, a key challenge is that the raw point cloud is irregular and unordered. PointNet [24] directly takes point cloud as its input and achieves permutation invariance with a symmetric function as a pioneering work. Inspire by PointNet, Gao et al. [25] present an automatic method combined with template guidance to classify 3D fragments of the Terracotta Warriors. In [26], the proposed method can directly consume the point cloud and texture image of the fragment and outputs its category. Experimental results demonstrate that Some studies have proposed various traditional approaches to classify fragments of cultural relics. Most of the previous methods are mainly based on color features [4], texture features [5], color and texture features [6][7][8], texture and shape features [9], multiplefeatures fusion [10], and other features. The traditional methods usually require experts to design accurate feature description operators and spend much time. Experts manually classify and calibrate fragments of cultural relics with non-salient features or fusion features based on experience. These are the reasons why the traditional methods have relatively low classification accuracy.
With the development of deep learning, convolutional neural networks (CNNs) have shown significant success in image recognition [11,12], video analysis [13,14], speech emotion recognition (SER) [15][16][17][18], and other domains. Based on the above work, Wang [19] propose an improved CNN specialized for the classification of Terracotta Warrior fragments for the first time. Compared with traditional methods, the proposed method can reduce the time complexity of the algorithm and improve the efficiency of fragments classification. However, the accuracy of image-based deep learning classification methods of Terracotta Warrior fragments is still relatively low. In recent years, the excellent results of deep neural networks for 2D image processing have motivated a data-driven approach to learning features on 3D models. Unlike the 2D image, several common 3D data representations are volumetric grids, depth images, and point clouds. According to the input data type for networks, 3D shape classification methods can be classified into volumetric-based [20,21], multi-view-based [22,23], and point-based methods. Compared with the former two data types, the point cloud is one of the most straightforward 3D shape representations and has been widely used. However, a key challenge is that the raw point cloud is irregular and unordered. PointNet [24] directly takes point cloud as its input and achieves permutation invariance with a symmetric function as a pioneering work. Inspire by PointNet, Gao et al. [25] present an automatic method combined with template guidance to classify 3D fragments of the Terracotta Warriors. In [26], the proposed method can directly consume the point cloud and texture image of the fragment and outputs its category. Experimental results demonstrate that the two methods perform better than traditional methods. However, the baseline model of the two methods is PointNet, which fails to capture local features adequately. To capture local structures better, the subsequent works have also been proposed (e.g., PointNet++ [27], PointCNN [28] and DGCNN [29]).
Although the existing deep learning models have shown suitable performances in point cloud classification, there are still some shortcomings. During the classification experiments, we find most existing deep learning models have the following problems: (1) The receptive fields are fixed-size, which cannot learn complex features by extracting features from different scales in parallel. The characteristics of Terracotta Warrior fragments are different in size and location. Some fragments have a smooth surface and few salient features, while fragments from the body have detailed features of the plackets. For the classification of 3D Terracotta Warrior fragments, the selection and extraction of salient and representative features are still challenging tasks. (2) Capturing long-range dependencies is crucial in deep neural networks. Most of the existing methods use the large receptive fields formed by the deep stacks of convolution to obtain long-range dependencies. However, blindly increasing the depth of the network can reduce the performance of the network. To make matters worse, the network becomes more complex as the depth of the network increases. (3) In most existing deep learning methods for point cloud understanding, the features are abstracted into higher dimensions through the MLP layer and then aggregated by a max/avg-pooling operation. However, the pooling-based feature aggregation methods can hardly encode the correlation between feature vectors in the feature. How to aggregate those learned local region features and their spatial relationships is still a challenging task.
In order to solve the mentioned problems, an end-to-end attention-based multi-scale neural network, named AMS-Net, is introduced to specialize in the classification of the 3D Terracotta Warrior fragments. A multi-scale set abstraction block (MS-BLOCK) is designed to extract local and global features from different scales and capture the longrange dependencies from the input data. In addition, high-level features contain more semantic information but less spatial information. Low-level features have more space coordinates information, but the semantic information is insufficient. The improved multilayer perceptron (IMLP) can retain both the high-level and low-level features well. Then, aggregated features with abundant information, which, using a skip connection strategy, are fed to a fully connected (FC) layer for further processing. Finally, a softmax classifier is used for the classification. Extensive experiments demonstrate that the proposed network achieves improved performance on the classification of the 3D Terracotta Warrior fragments. The main contributions of this work are summarized as follows: • A novel hierarchical network called AMS-Net is proposed to enhance the capability of extracting the features of the 3D Terracotta Warrior fragments. In order to decrease the computational cost, our AMS-Net is proposed to extract contextual features in a multi-scale way instead of stacking many layers to increase the receptive field directly.
The self-attention model is adopted to integrate the semantic and spatial relationships between features. To the best of our knowledge, this is the first work to apply the multi-scale structure and self-attention strategy to classify 3D cultural relic fragments; • A local-global module is proposed, which can effectively achieve local region feature aggregation and capture long-range dependencies well. The two main components are local features aggregated cell (LFA-Cell), and global features aggregated cell (GFA-Cell). However, LFA-Cell is proposed to preserve complex local structures, which are explicitly encoded with the spatial locations from the original 3D space. The global geometric features are obtained by GFA-Cell based on self-attention. As one of the important components in LFA-Cell, a self-attention feature aggregation method named attentive aggregation sub-unit (AAS) is proposed. Compared with the traditional max-pooling-based feature aggregation networks, AAS can explicitly learn not only geometric features of local regions but also the spatial relationships among them; • As the performance of the feature extractor is strongly affected by the dimension of the max-pooling layer, a feature fusion named IMLP is proposed in a targeted manner for our multi-scale structure, which can aggregate both low-level and high-level features with rich local information; • Our AMS-Net can explicitly learn not only geometric features of local regions but also the spatial relationships among them. The proposed method is more suitable for the characteristic of the Terracotta Warrior fragments and can achieve a suitable classification result.
The remainder of this paper is organized as follows: the related work is introduced in Section 2. Then, the detailed overview of the proposed system and its sub-modules are described in detail in Section 3. In Section 4, the data preprocessing method of the Terracotta Warrior fragments and the experimental results are provided. Finally, the conclusions and the limitations of this study and future works are illustrated in Section 5.

Traditional Classification Methods of Terracotta Warrior Fragments
As the most critical step, 3D shape classification of the Terracotta Warrior fragments plays a vital role in the protection and restoration of cultural relics. Many studies have focused on the issue of archaeology to find solutions based on images or 3D models, and some researchers are interested in the proposed methods of classifying fragments. Kampel et al. [4] focus on the classification of two-dimensional fragments based on the properties of color. In contrast, Qi et al. [5] deal with the problem on the basis of surface texture properties. Some researchers have proposed to classify cultural relic fragments with two or more feature description operators. Nada A. Rasheed et al. [6][7][8] present algorithms that rely on the intersection of the RGB color between the archaeological fragments and extraction of texture features from fragments based on gray-level co-occurrence matrix (GLCM). Wei et al. [9] extract the texture features and shape features by the scale-invariant feature transform (SIFT) algorithm and Hu invariant moments. Combined with the above features, a new method based on a support vector machine (SVM) for the classification of the Terracotta Warrior fragments has been proposed. Zhao et al. [10] extract the fragments' significant region features based on the region and shape features. The earth mover's distance (EMD) method is used to match the region features and the fragments to achieve coarse classification. The shape features are extracted by the Hu invariant moment. The salient features on the surface of Terracotta Warrior fragments are obtained by clustering local surface descriptions introduced by Kang et al. [30]. Lu et al. [31] present a local descriptor used to extract the fragments' rotational projection features and salient local features. The corresponding similarity measure matching method is proposed. The weight of characteristics is adaptively calculated according to the measurement results of each type of feature. Du et al. [32] propose a modified point feature histogram (PFH) descriptor to match fragments with templates. Karasik and Smilansky [33] propose a method that relies on the computerized morphological classification of ceramics.

Deep Learning on Point Clouds
According to the network architecture used for the feature learning of each point, methods can be divided into point-wise MLP [24,27,34], convolution-based [28,35,36], graph-based [29,37], and other methods. PointNet++ [27] improves performance by introducing a hierarchical approach to complete the feature extraction, which can capture local structures better. Due to the irregular format of the point cloud, convolutional kernels for the 3D point cloud are challenging to design. PointCNN [28] is a generalization of CNN into leveraging spatially local correlation from data represented in the point cloud. Relationshape convolution [35], as a learn-from-relation convolution operator, can explicitly encode the geometric topology constraint among points. Based on the proposed convolution, a hierarchical architecture RS-CNN (relation-shape convolutional neural network) is presented. An SOM (self-organizing map) [36] is built to model the spatial distribution of the input point cloud, which enables hierarchical feature extraction on both individual points and SOM nodes. It can extend regular grid CNN to irregular configuration for achieving contextual shape-aware learning of point cloud. Wang et al. [37] presented a spectral graph convolution on a local graph and combined it with recursive cluster pooling to make full use of the neighboring points' relative structure and features. The method requires no pre-computation of the graph Laplacian matrix and graph coarsening hierarchy. As there is a lack of large-scale data sets of partial views of real objects, Par3DNet [38] is proposed to fill the gap between synthetic and real data, which can take a partial 3D view of the object as an input and is able to accurately classify it. Hou et al. [39] propose a novel method for detecting gold foil damage on stone carving relics by making use of multi-temporal 3D LiDAR point cloud.

Multi-Scale Structure
Feature extraction is a crucial part, and its performance plays an important role in the quality of the classification results. As a useful technology, research on the multiscale structure is increasing gradually. Zhao et al. [40] present a novel transfer learning framework based on a deep multi-scale convolutional neural network (MSCNN). MSCNN is applied to the intelligent fault diagnosis of rolling bearings and has excellent performance. Another elegant mechanism to significantly increase the receptive field size is dilated convolution network (DCN). Huang et al. [41] present a workflow for LiDAR point cloud classification, which is combined multi-scale feature extraction with manifold learningbased dimensionality reduction. Mustaqeem et al. [42] propose a one-dimensional dilated convolutional neural network (DCNN) architecture for the SER system. The proposed framework uses the dilated convolution layer (DCL) in order to easily enhance the usage of the features and to improve the current baseline methods.

Attention Mechanism
In recent years, the self-attention mechanism has made remarkable achievements in the field of computer vision. It has become an essential part that can capture long-term dependencies. The self-attention mechanism ignores irrelevant features through the score function and focuses on crucial features. Mustaqeem et al. [43] propose a self-attention module (SAM) for the SER system, which is the first time use attention mechanism in the SER domain. The experiments on speech emotional databases prove the effectiveness of the SER system. Vaswani et al. [44] propose a model architecture, which entirely relies on an attention mechanism to draw global dependencies between input and output. Wang et al. [45] present nonlocal operations to capture long-range dependencies in video sequences and explain that self-attention can be viewed as a form of the nonlocal mean. Inspired by the self-attention, two critical components of PointASNL [46], which are the adaptive sampling (AS) module and local-nonlocal (L-NL) module, are proposed. PointASNL can deal with point cloud with noise effectively and achieve suitable performance by combining local neighbors and global context interaction.

Methods
Firstly, the multi-scale framework for 3D point cloud classification with hierarchical architecture is presented (in Section 3.1). Secondly, the local-global module, which can effectively extract local and global geometric information (in Section 3.2). The model can be plugged into existing deep neural networks. Thirdly, the local-global layer (LGLayer), which is composed of M (M = 3)-independent local-global module, can generate multiscale features (in Section 3.3). Finally, the improved method called IMLP is explained to obtain more about both low-level and high-level features (in Section 3.4). In the following subsections, each cell in the pipeline is introduced in detail. There are many symbols and notations in each cell. In order to understand, we have added a dedicated table to define all these symbols and notations in supplementary. The notations and definitions are shown in Table A1.

Our Proposed AMS-Net
Motivated by multi-scale structure, a novel attention-based multi-scale neural network named AMS-Net is proposed, which is illustrated in Figure 2. The input of our network is a raw point set χ = x i ∈ R 3+c in , i = 1, 2, . . . , N , where N is the size of point cloud χ. Each point is composed of a 3D coordinate (x, y, z) and other features (e.g., RGB, normal, etc.). The main components of our hierarchical structure AMS-Net are MS-BLOCK and FC layer. On each level, the module MS-BLOCK has two components: LGLayer and IMLP. Firstly, N FPS points χ FPS = x 1 , · · · , x i , · · · x N FPS are selected to define the centroids of local regions by the farthest point sampling (FPS). After that, LGLayer is used to capture abundant local geometric information and share geometric features with distant points in each scale, respectively. As shown in Figure 2b, we know that LGLayer is consisted of m-independent local-global module to generate a multi-scale feature with M × c out channels. The output point cloud mlg is concatenated by the extracted point cloud slg from each local-global module. The structure of the local-global module is shown in Figure 2c. Then, point cloud g1 and g2 are obtained by the proposed model IMLP, which can contain both low-level and high-level features. Finally, the learned global feature G is obtained by the connection of the former two levels, which can be applied to shape classification. In summary, the proposed framework can exhibit impressive performance in the point cloud classification by hierarchical multi-layer learning.
to obtain more about both low-level and high-level features (in Section 3.4). In the following subsections, each cell in the pipeline is introduced in detail. There are many symbols and notations in each cell. In order to understand, we have added a dedicated table to define all these symbols and notations in supplementary. The notations and definitions are shown in Table A1.

Our Proposed AMS-Net
Motivated by multi-scale structure, a novel attention-based multi-scale neural network named AMS-Net is proposed, which is illustrated in Figure 2. The input of our network is a raw point set = { ∈ ℝ 3+c , = 1,2, … , }, where is the size of point cloud . Each point is composed of a 3D coordinate (x, y, z) and other features (e.g., RGB, normal, etc.). The main components of our hierarchical structure AMS-Net are MS-BLOCK and FC layer. On each level, the module MS-BLOCK has two components: LGLayer and IMLP. Firstly, points = { 1 , ⋯ , , ⋯ } are selected to define the centroids of local regions by the farthest point sampling (FPS). After that, LGLayer is used to capture abundant local geometric information and share geometric features with distant points in each scale, respectively. As shown in Figure 2b, we know that LGLayer is consisted of m-independent local-global module to generate a multi-scale feature with × channels. The output point cloud is concatenated by the extracted point cloud from each local-global module. The structure of the local-global module is shown in Figure 2c. Then, point cloud g1 and g2 are obtained by the proposed model IMLP, which can contain both low-level and high-level features. Finally, the learned global feature G is obtained by the connection of the former two levels, which can be applied to shape classification. In summary, the proposed framework can exhibit impressive performance in the point cloud classification by hierarchical multi-layer learning.
LGLayer    LFA-Cell is composed of two parts: local geometric relation and features encode (LGRFE) and attentive aggregation sub-unit (AAS) and can effectively learn complex local structures. GFA-Cell can capture long-range dependencies. The local-global module can well extract the structural and semantic features in local and global sections. More details are introduced in Section 3.2.1 and Section 3.2.2.  Figure 4, the LFA-Cell has two key components: local geometric relation and features encode (LGRFE) and attentive aggregation sub-unit (AAS). The overall procedure of LFA-Cell is described as follows:  Firstly, the -neighboring points of the point are obtained by the KNN method, which is denoted as ( ) = { | = 1,2, ⋯ , } , where is the jth point of the k-neighboring points of the point (namely, ∈ ℝ 3+ → ( ) ∈ ℝ ×(3+ ) ). Secondly, the local geometric features are re-encoded from the 3D coordinates of the -neighboring points, and the dimensions of geometric features are changed into 1 (namely, ( ) ∈ ℝ ×(3+ ) → ( ) ∈ ℝ ×( 1 + ) ). A mid-dimensional feature

GFA-Cell (G)
where N FPS is the size of the current sampled point cloud.
As illustrated in Figure 4, the LFA-Cell has two key components: local geometric relation and features encode (LGRFE) and attentive aggregation sub-unit (AAS). The overall procedure of LFA-Cell is described as follows: LFA-Cell is composed of two parts: local geometric relation and features encode (LGRFE) and attentive aggregation sub-unit (AAS) and can effectively learn complex local structures. GFA-Cell can capture long-range dependencies. The local-global module can well extract the structural and semantic features in local and global sections. More details are introduced in Section 3.2.1 and Section 3.2.2.  Figure 4, the LFA-Cell has two key components: local geometric relation and features encode (LGRFE) and attentive aggregation sub-unit (AAS). The overall procedure of LFA-Cell is described as follows:  Firstly, the -neighboring points of the point are obtained by the KNN method, which is denoted as ( ) = { | = 1,2, ⋯ , } , where is the jth point of the k-neighboring points of the point (namely, ∈ ℝ 3+ → ( ) ∈ ℝ ×(3+ ) ). Secondly, the local geometric features are re-encoded from the 3D coordinates of the -neighboring points, and the dimensions of geometric features are changed into 1 (namely, ( ) ∈ ℝ ×(3+ ) → ( ) ∈ ℝ ×( 1 + ) ). A mid-dimensional feature Firstly, the k-neighboring points of the point x i are obtained by the KNN method, which is denoted as Neig(x i ) = x i j j = 1, 2, · · · , k , where x i j is the jth point of the k-neighboring points of the point x i (namely, x i ∈ R 3+c in → Neig(x i ) ∈ R k×(3+c in ) ). Secondly, the local geometric features are re-encoded from the 3D coordinates of the k-neighboring points, and the dimensions of geometric features are changed into c 1 (namely, Neig(x i ) ∈ R k×(3+c in ) → Neig(x i ) ∈ R k×(c 1 +c in ) ). A mid-dimensional feature vector can be denoted as f con i ∈ R c 1 +c in , which is obtained by features of the local geometric positions and its k-nearest neighbors. Thirdly, a high-dimensional feature is aggregated from the obtained k mid-dimensional features by using the AAS (namely, . Finally, the output point x i out is obtained by skip connections with the final feature of 1 × c out . The two sub-units are described in the following:

1.→ Local Geometric Relation and Features Encode (LGRFE)
To effectively learn complex local structures, how to represent point cloud should be the primary consideration. For a point x i , its absolute and relative positions are incomplete as local neighborhood information. It should also be represented by all the points within its k-nearest neighbors of point x i , and the Euclidean distances between point x i and its neighbors. Combining the former four components, more comprehensive local geometric features can be obtained. The k-nearest neighbors of point x i can be denoted as x i 1 , . . . , x i j , . . . , x i k , and the corresponding coordinate feature vector can be denoted as The encoded local geometric feature is defined as Equation (1): where C denotes concatenation operation, and M indicates the function conducted by the MLP. This process contributes to learn more comprehensive local features and obtain suitable performance. Therefore, the encoded local geometric feature vector of point x i can be denoted as F i geo . For each neighboring point x i j , a synthesized feature vector f con i j is obtained by connecting the encoded local geometric feature f geo i j with its corresponding additional feature f f ea i j . Finally, a new encoded neighboring feature vector F con i = f con i 1 , · · · , f con i j , · · · , f con i k , The key idea of AAS is to aggregate the features of k-neighboring points. Given a set of the feature vector F con i = f con i 1 , · · · , f con i j , · · · , f con i k , which is extracted from LGRFE. A single fixed output x i out ∈ R 1×c out is formed by AAS. The main steps of AAS are as follows: First, the set of the feature vector F con i is fed into a shared function T . For less computation, T is the form of a linear transformation of point features. A set of the new feature vector AC i = ac i 1 , · · · , ac i j , · · · , ac i k is obtained by an FC layer without bias. That is, ac i j = T f con i j , W + b, where W is learnable weight and b = 0. In the above formulation, f con i j ∈ R 1×(c 1 +c in ) , W ∈ R (c 1 +c in )×(c 1 +c in ) , and ac i j ∈ R 1×(c 1 +c in ) . Then, the learned attention score vector SC i = sc i 1 , · · · , sc i j , · · · , sc i k is normalized by softmax operation. The jth element of SC i is defined as: Moreover, the feature vector F att i = f att i 1 , · · · , f att i j , · · · , f att i k is weighted summed as follows: Finally, to avoid losing the low-level features, a skip connection is used to combine the newly aggregated features with the raw features. The final output point x i out is obtained with the size of 1 × c out .

Global Features Aggregated Cell (GFA-Cell) Based on Self-Attention
As mentioned in LFA-Cell, the χ FPS denotes a sampled point cloud, and the corresponding feature vector is of global features aggregated based on general self-attention is shown in Figure 5. The sampled set χ FPS with the size of N FPS and the entire point cloud χ can be regarded as query points and key points individually. To reduce the computation of the cell, bottleneck layers are added. In this work, each bottleneck layer's size is set to be half of the output channels ( c mid = 1/2 c out ). We compute the dot products of the query points with key points, with scaling by c mid , and apply a softmax function to obtain the weights on the values, then aggregate them with function A(·). Therefore, for a sampled point x i , the convolutional operation of global features aggregated can be denoted as: Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 23

Global Features Aggregated Cell (GFA-Cell) Based on Self-Attention
As mentioned in LFA-Cell, the denotes a sampled point cloud, and the corresponding feature vector is = { 1 , ⋯ , , ⋯ , }, ∈ ℝ 3+ . The overall process of global features aggregated based on general self-attention is shown in Figure 5. The sampled set with the size of and the entire point cloud can be regarded as query points and key points individually. To reduce the computation of the cell, bottleneck layers are added. In this work, each bottleneck layer's size is set to be half of the output channels ( = 1/2 ). We compute the dot products of the query points with key points, with scaling by , and apply a softmax function to obtain the weights on the values, then aggregate them with function (•). Therefore, for a sampled point , the convolutional operation of global features aggregated can be denoted as: For simplicity, is considered in the form of a linear, that is ( ) = • , where is learnable weight, "·" denotes element-wise multiplication. ℎ(•) and (•) are also linear functions. The updated global feature of the sampled point can be written as: where (•) is a nonlinear function. In the last step, to ensure the same dimension as the output of LFA-Cell, the global features are fused by . The skip connection is also used to combine the generated global features with the raw features. The output vector with × is obtained. Therefore, GFA-Cell can break the limitations of local regions and capture more long-range dependencies.

LGLayer
According to the explanation in Section 3.2, the output feature of the local-global module can be written as: where (•) is a nonlinear activation function, with size of N FPS × c out . In order to obtain sufficient structural information and stabilize the network, the M-independent local-global modules are concatenated to generate a multi-scale feature with × channels.
is the total number of scales, and we set = 3 in this study. As shown in For simplicity, g is considered in the form of a linear, that is g( f i ) = W g · f i , where W g is learnable weight, "·" denotes element-wise multiplication. h(·) and r(·) are also linear functions. The updated global feature of the sampled point x i can be written as: where v(·) is a nonlinear function. In the last step, to ensure the same dimension as the output of LFA-Cell, the global features are fused by v. The skip connection is also used to combine the generated global features with the raw features. The output vector with N FPS × c out is obtained. Therefore, GFA-Cell can break the limitations of local regions and capture more long-range dependencies.

LGLayer
According to the explanation in Section 3.2, the output feature of the local-global module can be written as: where δ(·) is a nonlinear activation function, f slg with size of N FPS × c out . In order to obtain sufficient structural information and stabilize the network, the M-independent local-global modules are concatenated to generate a multi-scale feature with M × c out channels. M is the total number of scales, and we set M = 3 in this study. As shown in Figure 6, the output of LGLayer is a multi-scale feature that concatenates the structural and semantic features, both local and global. Finally, the multi-scale feature is defined as: where C denotes concatenation operation. f slgm is a concatenated feature of the m-th local-global module.

IMLP
The input point cloud is with a size of × (3 + • ), which is obtained by multi-scale local-global modules and xyz-coordinates. The output point cloud is with a size of × ′ . As the MLP can only extract the maximum value from the last layer. The maximum value is regarded as the global feature of the point cloud. However, it does not make full use of the information contained in the low-level and mid-level. The low-level features include the rich geometric structure of the original point cloud. IMLP is proposed to solve the above problems, which can aggregate features effectively. The details of IMLP are shown in Figure 6. Different from MLP, we perform three different scales of convolutions on the features obtained from the previous layer and maximize the three-scale feature vectors' outputs. As shown in Figure 6, each point can be encoded into the dimensions of ( 1 , 2 , 3 ). The feature vector of different layers can be denoted as , and the combined vector D is obtained by concatenating all the , = 1, 2, 3. The feature vector D with a size of ( 1 + 2 + 3 ), which includes low-level, mid-level, and high-level features. Finally, the dimension of the feature is changed to ′ through a convolution operation.

Data set and Implementation Detail
Data set. In this section, to demonstrate the proposed framework's effectiveness and efficiency, experiments are conducted on a benchmark of point cloud classification, which is the ModelNet40/10 data set of CAD models. ModelNet40 and ModelNet10 comprise 9843/3991 training objects and 2468/908 test objects in 40 and 10 classes, respectively.
For the real-world data set, the Terracotta Warrior fragments' quantity is large, and the structure is complex. Meanwhile, the fragments always vary in shape. To prevent secondary damage to the Terracotta Warrior fragments during the restoration process, the point cloud models of the Terracotta Warrior fragments are obtained by using Creaform VIU 718 hand-held 3D scanners from Canada. Figure 7 shows some point cloud models we used in the experiments. However, the scanning data generally have holes and noise, and data preprocessing must be performed to ensure the accuracy of

IMLP
The input point cloud is with a size of N FPS × (3 + M·c out ), which is obtained by multi-scale local-global modules and xyz-coordinates. The output point cloud is with a size of N FPS × c . As the MLP can only extract the maximum value from the last layer. The maximum value is regarded as the global feature of the point cloud. However, it does not make full use of the information contained in the low-level and mid-level. The low-level features include the rich geometric structure of the original point cloud. IMLP is proposed to solve the above problems, which can aggregate features effectively. The details of IMLP are shown in Figure 6.
Different from MLP, we perform three different scales of convolutions on the features obtained from the previous layer and maximize the three-scale feature vectors' outputs. As shown in Figure 6

Data Set and Implementation Detail
Data set In this section, to demonstrate the proposed framework's effectiveness and efficiency, experiments are conducted on a benchmark of point cloud classification, which is the ModelNet40/10 data set of CAD models. ModelNet40 and ModelNet10 comprise 9843/3991 training objects and 2468/908 test objects in 40 and 10 classes, respectively.
For the real-world data set, the Terracotta Warrior fragments' quantity is large, and the structure is complex. Meanwhile, the fragments always vary in shape. To prevent secondary damage to the Terracotta Warrior fragments during the restoration process, the point cloud models of the Terracotta Warrior fragments are obtained by using Creaform VIU 718 handheld 3D scanners from Canada. Figure 7 shows some point cloud models we used in the experiments. However, the scanning data generally have holes and noise, and data preprocessing must be performed to ensure the accuracy of fragments classification. Data preprocessing usually includes three steps: noise removal, hole filling, and simplification. We first use Geomagic software (Geomagic Wrap, Shenzhen, China) to remove noise points and repair holes manually. For example, 006192-Arm-38 is chosen from the models of Terracotta Warrior fragments. The preprocessed model of 006192-Arm-38 is shown in Figure 8a, and then the mesh model is converted to a point cloud model (see Figure 8b). A total of 26,923 points are redundant for processing, and the raw point model should be reduced by random sampling. Figure 8c shows the sampled point cloud, which has been down to 40%. In this experiment, 11,996 point cloud patches train the network extracted from 40 whole Terracotta Warriors. The proposed network has a limited size of the point cloud (e.g., 2048 points or 1024 points), and ten thousands of points in model 006192-Arm-38 cannot be directly used to be input point cloud. The sampled model (e.g., model in Figure 8c) should be divided into patches; each patch contains fixed numbers of 2048 by uniform sampling. Figure 9 presents several patches of different parts in model 006192-Arm-38. For the Terracotta Warrior fragments data set, there are four categories: Arm, Body, Head, and Leg. Among them, 10,144 patches for training (Arm: 2656, Body: 2720, Head: 2272, Leg: 2496) and remained 1852 for testing (testArm: 476, testBody: 504, testHead: 428, testLeg: 444). All the training and testing data are the same preprocessed data set as [24]. For training, we sample 1024 points and normalize them into a unit ball as input. The point cloud is augmented by randomly rotating, jittering each point's position by Gaussian noise with zero mean and random dropout 20% points.
processing, and the raw point model should be reduced by random sampli shows the sampled point cloud, which has been down to 40%. In this exper point cloud patches train the network extracted from 40 whole Terracotta W proposed network has a limited size of the point cloud (e.g., 2048 points or and ten thousands of points in model 006192-Arm-38 cannot be directly use point cloud. The sampled model (e.g., model in Figure 8c) [24]. For training, we points and normalize them into a unit ball as input. The point cloud is au randomly rotating, jittering each point's position by Gaussian noise with ze random dropout 20% points.   fragments classification. Data preprocessing usually includes three steps: noise removal, hole filling, and simplification. We first use Geomagic software (Geomagic Wrap, Shenzhen, China) to remove noise points and repair holes manually. For example, 006192-Arm-38 is chosen from the models of Terracotta Warrior fragments. The preprocessed model of 006192-Arm-38 is shown in Figure 8a, and then the mesh model is converted to a point cloud model (see Figure 8b). A total of 26,923 points are redundant for processing, and the raw point model should be reduced by random sampling. Figure 8c shows the sampled point cloud, which has been down to 40%. In this experiment, 11,996 point cloud patches train the network extracted from 40 whole Terracotta Warriors. The proposed network has a limited size of the point cloud (e.g., 2048 points or 1024 points), and ten thousands of points in model 006192-Arm-38 cannot be directly used to be input point cloud. The sampled model (e.g., model in Figure 8c) should be divided into patches; each patch contains fixed numbers of 2048 by uniform sampling. Figure 9 presents several patches of different parts in model 006192-  [24]. For training, we sample 1024 points and normalize them into a unit ball as input. The point cloud is augmented by randomly rotating, jittering each point's position by Gaussian noise with zero mean and random dropout 20% points.   Table 1. Training All experiments are implemented in the following hardware: a 3.2 G AMD Ryzen 7 2700 Eight-Core Processor with 16 GB of Kingston Impact 2666 MHz CL10 DDR4 RAM on an Asus TUF GAMING B550M-PLUS motherboard. We trained AMS-Net for 251 epochs on an NVIDIA GTX 1080Ti GPU and TensorFlow v1.13 u Adam optimizer with an initial learning rate of 0.001, the decay rate of 0.1 every 5 steps, the momentum of 0.9, and a batch size of 8. The decay rate for batch normaliza starts at 0.5 and is gradually increased to 0.99. Batch-normalization and ReLU activa are applied after each layer except the last fully connected layer.   Table 1. Table 1. Architecture configurations. N_down denotes the number of sampled points by FPS; K is the number of group neighbors in LFA-Cell; ml p indicates a list for MLP construction in layers; c out is the dimension of output (in Figures 4 and 5), which determines the size of bottleneck and scale; ml p_adv denotes a list for IMLP.

Layer
N_down Training All experiments are implemented in the following hardware: a 3.2 GHz AMD Ryzen 7 2700 Eight-Core Processor with 16 GB of Kingston Impact 2666 MHz and CL10 DDR4 RAM on an Asus TUF GAMING B550M-PLUS motherboard. We trained our AMS-Net for 251 epochs on an NVIDIA GTX 1080Ti GPU and TensorFlow v1.13 using Adam optimizer with an initial learning rate of 0.001, the decay rate of 0.1 every 500 K steps, the momentum of 0.9, and a batch size of 8. The decay rate for batch normalization starts at 0.5 and is gradually increased to 0.99. Batch-normalization and ReLU activation are applied after each layer except the last fully connected layer.

Comparing with Other Methods
We compare our AMS-Net with several methods in 3D shape classification on Mod-elNet 40/10, respectively. The representations of input data are voxel or point cloud. As known to all, the computational costs increase exponentially when 3D data are rasterized into voxel representations, and most computations are redundant as the sparsity of 3D data. As a form of scalability, regular structures of octree are suitable for deep learning techniques. However, as illustrated in Table 2, our method outperforms the networks using octree or voxel grids as input by 2.34% and 1.61%, respectively, in terms of instance accuracy for ModelNet40 (e.g., O-CNN (90.6%) and VRN (91.33%)). Interestingly, our AMS-Net (92.94%) using 1024 points still outperforms the previous models such as Kd-Net (32k points, 91.8%), DeepSets (5k points, 90.0%), SO-Net (2k points, 90.9%). Compared with the xyz-input networks with 1024 point cloud, our AMS-Net also shows suitable performance. Our AMS-Net outperforms PointCNN, PointASNL by 0.74% and 0.04%, respectively. For using a normal vector, our method outperforms the methods shown in Table 2 except for RS-CNN and point transformer. Our AMS-Net achieves competitive performance with 0.32% higher accuracy than PointASNL and surpasses the methods that include more points (5k), such as PointNet++ and SO-Net. Table 2. Classification results on ModelNet40/10. ("CA" stands for per-class accuracy; "OA" stands for overall accuracy; "pnt" stands for xyz-coordinates of point and "nor" stands for surface normal vector; "-" stands for unknown.).

Method
Representation Input

Robustness Test
To evaluate the robustness of our AMS-Net on point cloud density, we train our network with 1024 points and test it with different sizes of sparser points. Random input points obtain the numbers 512, 256, and 128 of test data drop out. As shown in Figure 10a, it is hard to identify the overall shape and obtain geometrical and locational relations of the point cloud when the points become sparse. Figure 10b indicates that the number of points is reduced by half; the model can still obtain suitable results. If there are too few points (e.g., less than 256), the accuracy drops sharply.
Remote Sens. 2021, 13, x FOR PEER REVIEW To further verify the ability to deal with noise, we do the experiment like P [24] and KC-Net [28] on random noise in the input point cloud. To achieve fair c isons, the same training set and test set are the same with KC-Net, which replace tain number of randomly selected points with random noise ranging (−1.0, 1.0) testing. The comparisons with PointNet and KC-Net are shown in Figure 11. Th racy of PointNet drops 58.6% when 10 points are replaced with random noi KC-Net drops 23.8%, while our AMS-Net only drops 4.08% (from 93.52% to 91. shown in Figure 11, our AMS-Net is relatively robust to noise. The decrease in a becomes larger when AAS is replaced with max-pooling in the LFA-Cell.

Complexity Analysis
To demonstrate the effectiveness of our model, we compare complexity t methods in terms of model size and forward time. The forward time is recorded the same conditions with batch size 8, a single GTX 1080 Ti, and 1024 points as th Table 3 shows AMS-Net can achieve a tradeoff between the model complex computational complexity. However, due to the multi-scale strategy, the forward our AMS-Net and PointNet++ (MSG) is longer than PointNet and PointNet++ which are based on single-scale grouping. However, our AMS-Net achieves the classification accuracy among the models listed in Table 3. To further verify the ability to deal with noise, we do the experiment like PointNet [24] and KC-Net [28] on random noise in the input point cloud. To achieve fair comparisons, the same training set and test set are the same with KC-Net, which replaces a certain number of randomly selected points with random noise ranging (−1.0, 1.0) during testing. The comparisons with PointNet and KC-Net are shown in Figure 11. The accuracy of PointNet drops 58.6% when 10 points are replaced with random noise, and KC-Net drops 23.8%, while our AMS-Net only drops 4.08% (from 93.52% to 91.6%). As shown in Figure 11, our AMS-Net is relatively robust to noise. The decrease in accuracy becomes larger when AAS is replaced with max-pooling in the LFA-Cell.
Remote Sens. 2021, 13, x FOR PEER REVIEW To further verify the ability to deal with noise, we do the experiment like P [24] and KC-Net [28] on random noise in the input point cloud. To achieve fair c isons, the same training set and test set are the same with KC-Net, which replace tain number of randomly selected points with random noise ranging (−1.0, 1.0) testing. The comparisons with PointNet and KC-Net are shown in Figure 11. Th racy of PointNet drops 58.6% when 10 points are replaced with random noi KC-Net drops 23.8%, while our AMS-Net only drops 4.08% (from 93.52% to 91. shown in Figure 11, our AMS-Net is relatively robust to noise. The decrease in a becomes larger when AAS is replaced with max-pooling in the LFA-Cell.

Complexity Analysis
To demonstrate the effectiveness of our model, we compare complexity t methods in terms of model size and forward time. The forward time is recorded the same conditions with batch size 8, a single GTX 1080 Ti, and 1024 points as th Table 3 shows AMS-Net can achieve a tradeoff between the model complex computational complexity. However, due to the multi-scale strategy, the forward our AMS-Net and PointNet++ (MSG) is longer than PointNet and PointNet++ which are based on single-scale grouping. However, our AMS-Net achieves the classification accuracy among the models listed in Table 3.

Complexity Analysis
To demonstrate the effectiveness of our model, we compare complexity to other methods in terms of model size and forward time. The forward time is recorded under the same conditions with batch size 8, a single GTX 1080 Ti, and 1024 points as the input. Table 3 shows AMS-Net can achieve a tradeoff between the model complexity and computational complexity. However, due to the multi-scale strategy, the forward time of our AMS-Net and PointNet++ (MSG) is longer than PointNet and PointNet++ (SSG), which are based on single-scale grouping. However, our AMS-Net achieves the pleased classification accuracy among the models listed in Table 3. To further verify that our AMS-Net can obtain a state-of-the-art classification result on 3D Terracotta Warrior fragments, some methods are employed to be the baseline. As shown in Table 4, the highest mean accuracy of the existing traditional methods is 87.64%. Our AMS-Net without normal (95.68%) can achieve competitive performance with 8.04% higher accuracy than the best traditional classification method, which shows its great potential for real applications. Compared with PointNet, our AMS-Net improves accuracy by 6.75%. With the normal vector, the mean accuracy is up to 96.22%. In [26], a dual-modal that incorporated geospatial and texture information of the fragments is proposed. However, the accuracy only reaches 91.41% with the complex algorithm. Our AMS-Net is not only simple and effective but also improves the accuracy by 4.27%, which is attributed to the two strategies of multi-scale and self-attention mechanism. Results can prove that our proposed method is more suitable for the characteristic of the Terracotta Warrior fragments and can achieve a suitable classification result. Table 4. Comparison with the methods proposed in the references.

Method Input Data Type Deep Model OA (%)
Method in [9] image F 74.66 Method in [31] image F 84.34 Method in [10] image F 86.86 Method in [19] image (cnn-based) T 89.54 Method in [32] pnt. F 87.64 PointNet [24] pnt., T 88.93 Method in [25] pnt., T 90.94 Method in [26] pnt Without normal, the classification accuracies of the four classes are 98.1% (Body), 98.0% (Head), 94.2% (Leg), and 92.4% (Arm). Figure 12 show some representative fragments of the four classes. From the results, we know that the accuracy of class Body is the highest, while the accuracy of class Arm is the lowest. The main reason is that most of the body parts are wearing armor, or the clothes have more folds. The characteristics of class Body are more obvious in general (see in Figure 12). As shown in Figure 13, there are many distinctive characteristics in eyes, nose and headwear. The result is slightly lower than class Body. The characteristics of class Arm are similar to class Leg. The two fragments in the upper row of Figure 14 are from class Arm. The two fragments in the bottom row are from class Leg. The features of class Leg are relatively smooth. If the fragment from class Arm with smooth, it would be misclassified as Leg. The result is shown in Figure 14e.

Shape Classification with Noise
Even though the coordinate value of Terracotta Warrior fragments is not normalized, the accuracy of the classification is considerable. The above method of adding random noise of (−1.0, 1.0) to Modelnet40 does not apply to the 3D Terracotta Warrior fragments. So the Gaussian noise with a mean of 0 and a variance of 1 is added to the input point cloud. Figure 15 shows the arm with 10 points being replaced with random noise. The accuracy of our AMS-Net is 91.6% when 1 point is replaced with noise point (Nnoise = 1), where Nnoise is the number of noise points. The remaining results are 90.82% (Nnoise = 10), 79.3% (Nnoise = 50) and 67.27% (Nnoise = 100). The accuracy drops only 4.86% when 10 points are replaced with noise points (from 95.68% to 90.82%).

Shape Classification with Noise
Even though the coordinate value of Terracotta Warrior fragments is not n ized, the accuracy of the classification is considerable. The above method of addin dom noise of (−1.0, 1.0) to Modelnet40 does not apply to the 3D Terracotta W fragments. So the Gaussian noise with a mean of 0 and a variance of 1 is added input point cloud. Figure 15 shows the arm with 10 points being replaced with ra noise. The accuracy of our AMS-Net is 91.6% when 1 point is replaced with noise (Nnoise = 1), where Nnoise is the number of noise points. The remaining results are 9 (Nnoise = 10), 79.3% (Nnoise = 50) and 67.27% (Nnoise = 100). The accuracy drops only when 10 points are replaced with noise points (from 95.68% to 90.82%).

Shape Classification with Noise
Even though the coordinate value of Terracotta Warrior fragments is not normalized, the accuracy of the classification is considerable. The above method of adding random noise of (−1.0, 1.0) to Modelnet40 does not apply to the 3D Terracotta Warrior fragments. So the Gaussian noise with a mean of 0 and a variance of 1 is added to the input point cloud. Figure 15 shows the arm with 10 points being replaced with random noise. The accuracy of our AMS-Net is 91.6% when 1 point is replaced with noise point (Nnoise = 1), where Nnoise is the number of noise points. The remaining results are 90.82% (Nnoise = 10), 79.3% (Nnoise = 50) and 67.27% (Nnoise = 100). The accuracy drops only 4.86% when 10 points are replaced with noise points (from 95.68% to 90.82%).

Shape Classification with Noise
Even though the coordinate value of Terracotta Warrior fragments is not normalized, the accuracy of the classification is considerable. The above method of adding random noise of (−1.0, 1.0) to Modelnet40 does not apply to the 3D Terracotta Warrior fragments. So the Gaussian noise with a mean of 0 and a variance of 1 is added to the input point cloud. Figure 15 shows the arm with 10 points being replaced with random noise. The accuracy of our AMS-Net is 91.6% when 1 point is replaced with noise point (Nnoise = 1), where Nnoise is the number of noise points. The remaining results are 90.82% (Nnoise = 10), 79.3% (Nnoise = 50) and 67.27% (Nnoise = 100). The accuracy drops only 4.86% when 10 points are replaced with noise points (from 95.68% to 90.82%).  Even though the coordinate value of Terracotta Warrior fragments is ized, the accuracy of the classification is considerable. The above method of dom noise of (−1.0, 1.0) to Modelnet40 does not apply to the 3D Terrac fragments. So the Gaussian noise with a mean of 0 and a variance of 1 is a input point cloud. Figure 15 shows the arm with 10 points being replaced w noise. The accuracy of our AMS-Net is 91.6% when 1 point is replaced with (Nnoise = 1), where Nnoise is the number of noise points. The remaining result (Nnoise = 10), 79.3% (Nnoise = 50) and 67.27% (Nnoise = 100). The accuracy drops when 10 points are replaced with noise points (from 95.68% to 90.82%).

Ablation Study
The subsection conducts ablation experiments on 3D Terracotta Warrior fragments to further evaluate each cell's effectiveness in our framework. The input data is a point cloud with a normal vector. 4.4.1. Experiments of Partial Detail Setting in LFA-Cell 1.
Ablation Studies on LGRFE As a critical cell for extracting local information, LGRFE concatenates much spatial information to obtain local relationships. Five different forms of encoding local geometric features are tested, and the representative symbols are the same as those defined f geo i j in Equation (1). As summarized in Table 5, we can see that model E has excellent classification performance, which contains full spatial information. The relative distance significantly influences obtaining the local information, so model D is the suboptimal model. On the contrary, the coordinates of a single point cannot show the local spatial relationship well; hence model A has the lowest accuracy.

Model
Local Geometric Feature Definition Channels Acc.

Ablation Studies on AAS
To verify the proposed AAS unit's effect on aggregation features, there are two symmetric functions: max-pooling (Max) and average-pooling (Avg). As shown in Table 6, we can see that our AAS achieves the best performance. The reason may be that our AAS uses the attention mechanism to combine all local point features. By comparison, other methods often lead to most information lost, and they are more difficult to aggregate neighborhood features. The experimental results also show the effectiveness of the self-attention mechanism.  Table 7. MLP denotes the widely used way in PointNet; CMLP is proposed in [51]. Compared with the CMLP, our IMLP adds a feature fusion step to better use high-level and low-level information. As shown in Table 7, we can see that our IMLP obtains the best accuracy of 95.68%. To verify the effectiveness of multi-scale, Table 8 shows the single-scale and multi-scale model's accuracy results. ASS-Net denotes the single-scale model, which has a similar structure to our MS-Net. The specific parameters of ASS-Net are defined as follows: the number of sampled points is 512 in the first level, the neighbor point k 1 is 32, and a set of MLPs is (64, 64, 128) to abstract features into higher dimensions; in the second level, the sampled point is 128, k 2 = 64, and MLPs is (128, 128, 256). It can be seen from Table 8 that our approach outperforms ASS-Net by 1.69%, which is owing to our network can extract multi-scale detail features effectively. Table 8. Accuracy results (%) of ASS-Net vs. AMS-Net.

Conclusions and Future Direction
As one of the great discoveries in the history of archaeology in the 20th century, Terracotta Warriors have become an important channel for spreading Chinese culture. To avoid secondary damage caused by manual repair, virtual splicing and repair have become a research hotspot. However, most of the current methods are based on traditional methods and have low accuracy. As a critical step in cultural relic restoration, the accuracy of the computer-aided fragments classification can directly affect the matching and splicing efficiency.
In this paper, combined the self-attention mechanism with a multi-scale structure, we proposed a dynamic fusion framework, which mainly focuses on improving the classification accuracy by using the complex local structures and long-range dependencies. Firstly, an effective method of local feature aggregation that can capture the local geometric features is proposed. The proposed local operator combines four types of geometric features, e.g., the coordinates of a point, the relative distance, and the Euclidean distance. Thus, LFA-Cell can contain rich local information. Furthermore, to obtain more about the point cloud model's overall structure, GFA-Cell based on self-attention is presented. Then, the local-global module is integrated by the above two cells, which can be plugged into the existing deep neural networks.
LGLayer consists of M-independent local-global module, which can obtain multi-scale features. We evaluate our AMS-Net (with normal) over ModelNet40, and real-world Terracotta Warrior fragments data set, which achieves 93.52% and 96.22% accuracy, respectively. The experimental results show that the suggested model outperforms many previous methods and can obtain a state-of-the-art classification accuracy for the 3D Terracotta Warrior fragments. In summary, our AMS-Net can achieve improved performance on point cloud classification. Meanwhile, it is the first attempt to apply our AMS-Net to the real-world Terracotta Warrior fragments data set. Experiments have verified the suitability for real-world Terracotta Warrior fragments applications. We also hope this work can provide a new way for the classification of cultural relics.
However, there are still shortcomings in our method. Although the classification accuracy of our AMS-Net has been improved to a certain extent, the approach is only able to be trained and operate over a fixed-size point cloud, which is generally 2048 or 1024. When the point number is larger than the fixed size, it must be sampled to a new sparse point cloud. This will lose important geometric information, which is not conducive to learning local features. In addition, manually labeled data require the high cost of human labor and may limit the generalization ability of the learned models.
In the future, we can further design an efficient and lightweight encoder, which can directly extend to any size of large-scale point clouds without preprocessing steps such as sample to a fixed size and can effectively obtain local information. Unsupervised learning is an attractive direction to obtain generic and robust representations for the 3D Terracotta Warrior fragments. Learning useful features from unlabeled data is a challenging problem for the virtual restoration of cultural relics and is also our next main work.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to that the Terracotta Warriors involve a policy of secrecy over cultural heritage.

Acknowledgments:
We thank the Emperor Qinshihuang's Mausoleum Site Museum for providing the Terracotta Warriors data.

Conflicts of Interest:
The authors declare no conflict of interest.