ConvLSTM-Att: An Attention-Based Composite Deep Neural Network for Tool Wear Prediction

: In order to improve the accuracy of tool wear prediction, an attention-based composite neural network, referred to as the ConvLSTM-Att model (1DCNN-LSTM-Attention), is proposed. Firstly, local multidimensional feature vectors are extracted with the help of a one-dimensional convolutional neural network (1D-CNN), which avoids the loss of wear features caused by manual feature extraction. Then the temporal relationship learning between multidimensional feature vectors is performed by introducing a long short-term memory (LSTM) network to make up for the lack of long-short distance dependence of the captured sequence of the CNN network. Finally, an attention mechanism is applied to strengthen the ability to extract key information from tool-wearing temporal features. The proposed ConvLSTM-Att model is trained with the measured tool wear data and then performs as a tool wear predictor. The model is compared with several state-of-the-art models on the PHM tool wear data sets. It signiﬁcantly outperforms the other models in terms of prediction accuracy, but with similar computational complexity.


Introduction
As a result of China's vigorous promotion of the Manufacturing Industry 2025, the machinery manufacturing industry has increasingly higher requirements for intelligence [1,2].As an important part of machinery production and processing, the degree of wear and tear on tools severely affects the accuracy of workpieces and the manufacturing costs of enterprises.Most of the traditional tool changing methods are based on experience, determining the timing of tool stopping and tool changing.Changing a tool too early will cause wastage of the tool [3], whereas changing a tool too late will reduce the quality of a workpiece and lead to scrapping.During the machining process, timely and accurate prediction of tool wear is beneficial to both improving the machining accuracy of products and reducing the manufacturing and labor costs of enterprises.Therefore, intelligent and accurate prediction of tool wear has become an important topic.
Direct and indirect measurements are the two main approaches to tool wear prediction [4].The direct measurement approach requires off-line measurement of the tool between machining intervals, which greatly affects the coherence of machining and is difficult to apply in production practice.The indirect measurement method is primarily used to predict tool wear by mining and analyzing the relationship between the data taken during machining and the tool wear data.However, the data acquired during machining is subject to noise in the industrial environment, which reduces the validity of the data [5].
Traditional machine learning approaches, such as artificial neural networks [6], fuzzy logic [7], the hidden Markov model and support vector machine, as well as metaheuristic approaches [8,9], can be implemented for tool wear prediction, but the prediction accuracy is generally low [10].It is difficult for traditional machine learning approaches to predict the true data directly from the measured data [11].Therefore, some preprocessing methods Machines 2023, 11, 297 2 of 13 such as time-frequency domain analysis [12], principal component analysis (PCA) [13], and empirical modal decomposition (EMD) [14] are usually required for the data.However, extracted features are easy to underweight in the data that reflect tool wear, and the tool wear data are essentially time series, so the above approaches cannot tap the time-series features between different data samples.
In recent years, deep learning [6], with its excellent ability to automatically learn features, has provided a new idea for dealing with large-scale mechanical data [15].It is widely used in the fields of computer vision [16], speech recognition [17], natural language processing [18], and mechanical fault diagnosis [19].The tool wear prediction model based on deep learning has significantly improved the accuracy of tool wear prediction by virtue of its powerful data processing as well as feature extraction capability [19].
Convolutional neural networks (CNN) have excellent performance in processing data, text, and image recognition.They can analyze the original data and extract highdimensional hidden features, effectively circumventing the problems that arise from handcrafted feature extraction.Kothuru et al. [20] established a CNN-based tool condition prediction model for tool wear prediction by analyzing the spectral characteristics of the acoustic signals during machining.Xu et al. [21] used CNN to extract features from the collected vibration data during machining and designed an extended convolutional residual block and fully connected layer to achieve effective prediction of tool wear.The long short-term memory (LSTM) network is a type of recurrent neural network that effectively avoids the vanishing gradient problem in the recurrent neural network and better mines the temporal features between different temporal data sets [22].In order to capture the long-term dependence of tool wear data, Zhao et al. [23] presented a deep LSTM model to predict tool wear by regression, confirming that LSTM has certain advantages compared with conventional recurrent neural networks in processing temporal data.Cai et al. [24] proposed a hybrid model based on LSTM to extract the temporal features from the raw sequence data through a designed stacked LSTM and finally in a nonlinear regression model to obtain the predicted tool wear.Chan et al. [25] combined CNN and LSTM and proposed an LSTM-CNN model.It uses CNN to extract tool wear features, and then mines the temporal features of tool wear by LSTM, to achieve effective prediction of tool wear.Schwendemann et al. [26] propose a deep learning method based on transfer learning.This method uses the windowed envelope, de-noising, and normalization processing of low-frequency sensor data to form the intermediate domain image and uses CNN and LSTM to perform the estimation of remaining useful life and realize effective transfer learning between different bearing types.Qiao et al. [27] collected vibration and current signals as well as tool wear as input to construct a training data set and input the features extracted by the multi-scale convolutional LSTM model to a bidirectional LSTM model to predict tool wear.It meets the requirements of high accuracy and low latency.Attention is a weight assignment mechanism that improves learning accuracy by learning to continuously update the attention weights corresponding to different features and to ignore unimportant signals [28].Huang et al. [29] combined CNN with attention and proposed a multiscale CNN based on attention fusion, which will improve the accuracy of tool wear prediction by extracting tool wear data through multi-layer convolution as input and passing them to a multilayer attention mechanism.
The above methods consider, to some extent, the feature extraction of tool wear data in spatial and temporal dimensions and the contribution of different data to tool wear in the spatial dimension, respectively.However, they do not consider the different specifics of different data features fused with spatial and temporal dimensions for tool wear prediction.In order to enhance the extraction of key information while fusing different dimensional features of the measured data, we propose a composite neural network model based on an attention mechanism.The model uses a 1D-CNN neural network to extract multidimensional feature data.Then, in order to fuse the features in the temporal dimension, an LSTM network is introduced to extract sequential features by learning the temporal relationship between multidimensional features through an LSTM layer.Finally, the ability of the model to obtain key wear data features is improved with the help of the attention mechanism, resulting in an efficient and accurate prediction of tool wear.

Composition of the Prediction Model
The CNNs are feedforward neural networks with special structures that excel at image and speech recognition.Among them, the one-dimensional convolutional neural network (1D-CNN) has excellent performance in processing text data [30].The 1D-CNN structure is shown in Figure 1, which is composed of three major layers, namely, the convolutional layer, the pooling layer, and the fully connected layer [31].The convolutional layer slides with the elements in the perceptual field at a prespecified step and makes a nonlinear mapping through the activation function to extract the feature vectors.It is the core layer of the whole network.The pooling layer reduces the dimensions of the feature vectors outputted from the convolutional layer and keeps the most significant ones locally to reduce operations.The fully connected layer is a fully connected neural network, which does a nonlinear transformation of the feature vectors to generate the specified dimensions and passes them on to generate the classification result [32].
sion, an LSTM network is introduced to extract sequential features by learning the temporal relationship between multidimensional features through an LSTM layer.Finally, the ability of the model to obtain key wear data features is improved with the help of the attention mechanism, resulting in an efficient and accurate prediction of tool wear.

One-Dimensional Convolutional Neural Network (1D-CNN)
The CNNs are feedforward neural networks with special structures that excel at image and speech recognition.Among them, the one-dimensional convolutional neural network (1D-CNN) has excellent performance in processing text data [30].The 1D-CNN structure is shown in Figure 1, which is composed of three major layers, namely, the convolutional layer, the pooling layer, and the fully connected layer [31].The convolutional layer slides with the elements in the perceptual field at a prespecified step and makes a nonlinear mapping through the activation function to extract the feature vectors.It is the core layer of the whole network.The pooling layer reduces the dimensions of the feature vectors outputted from the convolutional layer and keeps the most significant ones locally to reduce operations.The fully connected layer is a fully connected neural network, which does a nonlinear transformation of the feature vectors to generate the specified dimensions and passes them on to generate the classification result [32].

Long Short-Term Memory (LSTM) Network
The LSTM is a recurrent neural network with long-term and short-term memories between feature vectors, and the memories can be dynamically adjusted with the input, which greatly solves problems such as memory degradation caused by too long sequences [33].At the same time, LSTM can also solve the problem of vanishing gradients that exists in conventional recurrent neural networks and can better exploit the temporal relationship between tool and feature vectors.The core idea of LSTM is to introduce input gates, forgetting gates, and output gates for each memory unit [34].The structure of LSTM is shown in Figure 2.

Long Short-Term Memory (LSTM) Network
The LSTM is a recurrent neural network with long-term and short-term memories between feature vectors, and the memories can be dynamically adjusted with the input, which greatly solves problems such as memory degradation caused by too long sequences [33].At the same time, LSTM can also solve the problem of vanishing gradients that exists in conventional recurrent neural networks and can better exploit the temporal relationship between tool and feature vectors.The core idea of LSTM is to introduce input gates, forgetting gates, and output gates for each memory unit [34].The structure of LSTM is shown in Figure 2.
The forget gate decides what is deleted from the memory cell according to the previous hidden layer state h t−1 ∈ R n , n is the LSTM number of hidden layer neurons, with the current input d t ∈ R k , k being the number of convolution kernels, by the sigmoid function.The input gate decides what to save from the previous hidden state h t−1 with the current input d t according to the sigmoid function and gets a candidate parameter c t ∈ R n according to the tanh function.Combining the forget gate with the input gate, the current state c t ∈ R n of the memory cell is updated.The output gate uses the sigmoid function to determine what is output from h t−1 and d t .The outputs of the output gate are combined with c t which is processed by the tanh function to selectively output the hidden layer state h t at the current time.The LSTM model is given by where f t ∈ R n is the forget gate output at the time t, i t ∈ R n is the input gate output at time t, c t is the state of the memory cell at the time t, o t ∈ R n is the output gate output at the time t, σ 1 (•) is the sigmoid function, σ 2 (•) is usually selected as the tanh function, h t is the hidden layer state at time t, d t is the input at time t,

Attention Mechanism
The attention mechanism is a bionic mechanism in deep learning, whose essence is to give more weight to the parts that need to be focused on [35].The essential idea of the attention mechanism is shown in Figure 3.
The calculation of the attention mechanism can be divided into two steps.First, the degree of correlation between the state output h t of the hidden layer of the LSTM layer and the query vector q ∈ R n is calculated using the attention score function s(.), and the corresponding attention weight α t is obtained using the softmax function.Second, the output ŷ t ∈ R t of the attention layer is obtained by weighted summation based on the attention weights.The steps are given by where m is the number of time steps.
Then, input ŷ t into the regression layer to get the predicted tool wear at the current time.where  is the number of time steps.Then, input  into the regression layer to get the predicted tool wear at the time.

ConvLSTM-Att Model Construction
According to Zhou et al. [36], the most frequently used sensors in the study o process sensor configurations are force, vibration, and acoustic sensors.

ConvLSTM-Att Model Construction
According to Zhou et al. [36], the most frequently used sensors in the study of milling process sensor configurations are force, vibration, and acoustic sensors.Therefore, in this paper, the cutting force and vibration along the x-, y-, and z-axes, and the acoustic emission at time t, x i t ∈ R 1×7 , i ∈ {1, 2, . . . ,m} being the tap index, are selected as the inputs of the model for training the ConvLSTM-Att model, . A one-dimensional CNN is first used to extract the high-dimensional features in the measured time-domain data through multiple convolutional layers.The weights of the CNN are updated by backpropagating the error calculated by the loss function.As the number of network layers increases, the problem of the vanishing gradient becomes increasingly obvious.To reduce the impact of vanishing gradients, each convolutional layer is followed by a connection layer, which includes batch normalization, ReLU, and max-pooling layers.The structure is shown in Figure 4.The calculation of the attention mechanism can be divided into two steps.First, the degree of correlation between the state output  of the hidden layer of the LSTM layer and the query vector  ∈  is calculated using the attention score function (.), and the corresponding attention weight  is obtained using the softmax function.Second, the output  ∈  of the attention layer is obtained by weighted summation based on the attention weights.The steps are given by where  is the number of time steps.Then, input  into the regression layer to get the predicted tool wear at the current time.

ConvLSTM-Att Model Construction
According to Zhou et al. [36], the most frequently used sensors in the study of milling process sensor configurations are force, vibration, and acoustic sensors.Therefore, in this paper, the cutting force and vibration along the -, -, and -axes, and the acoustic emission at time ,  ∈  ,  ∈ 1,2, … ,  being the tap index, are selected as the inputs of the model for training the ConvLSTM-Att model,  = ( , ⋯ ,  ),  ∈  .A one-dimensional CNN is first used to extract the high-dimensional features in the measured time-domain data through multiple convolutional layers.The weights of the CNN are updated by backpropagating the error calculated by the loss function.As the number of network layers increases, the problem of the vanishing gradient becomes increasingly obvious.To reduce the impact of vanishing gradients, each convolutional layer is followed by a connection layer, which includes batch normalization, ReLU, and maxpooling layers.The structure is shown in Figure 4.As shown in Figure 5, the ConvLSTM-Att model first automatically mines the features in the tool wear data x t through the 1D-CNN layer and uses them as the input of the LSTM layer.Then, through the LSTM layer, it learns the temporal relationship between the multidimensional vectors and gets the corresponding hidden layer state vector h t .Finally, through the attention layer, the attention weight α t of each input is calculated, and the predicted tool wear ŷt ∈ R t is obtained through the nonlinear regression layer.
tures in the tool wear data  through the 1D-CNN layer and uses them as the the LSTM layer.Then, through the LSTM layer, it learns the temporal relation tween the multidimensional vectors and gets the corresponding hidden layer sta  .Finally, through the attention layer, the attention weight  of each input is ca and the predicted tool wear  ∈  is obtained through the nonlinear regression where ŷi is the predicted wear and y i ∈ R t is the measured wear.The accuracy indices are the mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination R2, given by where y is the average of all measured wear.

Experimental Setting
The proposed method was validated using the tool wear data set published by the PHM Association in 2010 [37].The experimental parameters are shown in Table 1.The tool used for the experiments was a three-tooth end mill, and the material machined was HRC52 stainless steel.The experiments were conducted at room temperature with dry cutting-that is, no cutting fluid was used.The force and vibration data in the x-, yand z-axes during each machining process and the acoustic emission data during the tool move process were collected using a Kistler force gauge, an acceleration sensor, and an acoustic emission sensor, respectively.During each machining process, the tool cut in the x-direction had a length of 108 mm.After each tool move, the wear of the rear face of each tool cutter flute was measured with a LEICA MZ12 microscope to get the wear of each tool move.The sensor setup is shown in Figure 6.
A total of three tools (C1, C4, and C6) were used to collect data for the experiments, each acquiring 315 samples for a total of 945 samples.Each sample contains seven components: cutting force F x , F y , F z , vibration Z x , Z y , Z z , acoustic emission β as input, and the wear of three cutter flutes as output.According to the recommendations of ISO 8688-2 (1989), the average of the wears of the three cutter flutes was taken as the measured wear of that sample, which was used as the output in the data set.Take the tool C1 as an example, as shown in Figure 7.According to the trend of tool wear, 0~50 tool moves (wear < 90 µm) were classified as the initial wear stage, 51~190 tool moves (wear < 120 µm) as the normal wear stage, and 191~315 tool moves (wear < 173 µm) as the severe wear stage.
The experiments were conducted at a sampling frequency of 50 kHz, resulting in up to 200,000 data points per sample.In order to exclude the interference of the incoming and outgoing cutters and simplify the computation, three segments of 5000 data-points were selected, as shown in Figure 8.It can be seen from Figure 8 that the maximum and minimum values of the data in each segment are similar.In order to retain the data characteristics of each segment, the average value of the three segments was taken as the model input for training, so that each wear corresponded to tool wear data with a size of 5000 × 7.In order to obtain a reliable and stable model, cross validation was used, and data sets were divided.Among the C1, C4, and C6 data sets, two of the data sets are divided into a training set and a validation set for model training and model parameter adjustment.The ratio of the training set to the verification set is 8:2.The third data set was used as a test set for model evaluation.  2 (1989), the average of the wears of the three cutter flutes was taken as the measur of that sample, which was used as the output in the data set.Take the tool C1 as a ple, as shown in Figure 7.According to the trend of tool wear, 0~50 tool moves (w μm) were classified as the initial wear stage, 51~190 tool moves (wear < 120 μm normal wear stage, and 191~315 tool moves (wear < 173 μm) as the severe wear s   The experiments were conducted at a sampling frequency of 50 kHz, resultin to 200,000 data points per sample.In order to exclude the interference of the incom outgoing cutters and simplify the computation, three segments of 5000 data-poin selected, as shown in Figure 8.It can be seen from Figure 8 that the maximum an mum values of the data in each segment are similar.In order to retain the data ch istics of each segment, the average value of the three segments was taken as th input for training, so that each wear corresponded to tool wear data with a size of 7. In order to obtain a reliable and stable model, cross validation was used, and d were divided.Among the C1, C4, and C6 data sets, two of the data sets are divide

Experimental Design and Parameter Setting
In order to verify the accuracy as well as the effectiveness of the tool wear prediction based on the proposed ConvLSTM-Att model, experiments were conducted using the same CNN, CNN-LSTM, and ConvLSTM-Att data set.
In the convolutional model, a small convolutional kernel can reduce the complexity of the model computation, and a small convolutional step size can improve the accuracy of the feature extraction results.Therefore, the size of the convolution kernel is set to (3,7), the step length is set to 2, and the boundary fill is 1.To avoid losing too many features in the downscaling process, the pooling layer is selected as max-pooling, and the pooling size is set (4,4).The number of epochs is 500, the initial parameter of the learning rate is 0.001, which decays as the epoch increases, and the optimizer is Adam.To improve the robustness and generalization ability of the model, a dropout layer is added to the regression layer, and the retention rate is set to 0.5.The number of LSTM hidden layer neurons is 128.The maximum error is 6.3.

Comparison and Analysis of Experimental Results
The models are trained on the training set, the trained model parameters are adjusted on the validation set, and the models are evaluated on the test set.For each model, RMSE, MAE, and R2 are calculated on each test set separately.If the error is large, it means that the model is not properly fitted, and we need to readjust the parameters or train the model structure.The smaller the error, the higher the accuracy of the model.The prediction errors of tool wear for different models on different test sets are shown in Table 2

Experimental Design and Parameter Setting
In order to verify the accuracy as well as the effectiveness of the tool wear prediction based on the proposed ConvLSTM-Att model, experiments were conducted using the same CNN, CNN-LSTM, and ConvLSTM-Att data set.
In the convolutional model, a small convolutional kernel can reduce the complexity of the model computation, and a small convolutional step size can improve the accuracy of the feature extraction results.Therefore, the size of the convolution kernel is set to (3, 7), the step length is set to 2, and the boundary fill is 1.To avoid losing too many features in the downscaling process, the pooling layer is selected as max-pooling, and the pooling size is set (4,4).The number of epochs is 500, the initial parameter of the learning rate is 0.001, which decays as the epoch increases, and the optimizer is Adam.To improve the robustness and generalization ability of the model, a dropout layer is added to the regression layer, and the retention rate is set to 0.5.The number of LSTM hidden layer neurons is 128.The maximum error is 6.3.

Comparison and Analysis of Experimental Results
The models are trained on the training set, the trained model parameters are adjusted on the validation set, and the models are evaluated on the test set.For each model, RMSE, MAE, and R2 are calculated on each test set separately.If the error is large, it means that the model is not properly fitted, and we need to readjust the parameters or train the model structure.The smaller the error, the higher the accuracy of the model.The prediction errors of tool wear for different models on different test sets are shown in Table 2.The calculation times of the 1D-CNN, CNN-LSTM, and ConvLSTM-Att models are shown in Table 3.The performance results of different models for tool wear prediction on different test sets are shown in Figures 9-11.It can be seen that the 1D-CNN model has the worst overall performance, which is shown in Figure 9.The model-fitting results are poor, and the predicted results of this model at the initial and severe wear stages of the tool have a large gap with the measured wear, which leads to model errors that are generally large.Then, as shown in Figure 10, the accuracy of the CNN-LSTM model is substantially improved compared with that of the 1D-CNN model, and the predicted wear of the model basically fits the measured wear.Due to the fact that the measured time-domain data of tool wear are essentially sequential data, adding the learning of temporal features to the 1D-CNN can improve the accuracy of the model's prediction.However, there is still a large error in predicting the wear at the initial and severe wear stages of the tool, especially at the severe wear stage for tool C6.From Figure 11, the tool wear predicted using the ConvLSTM-Att model is closer to the measured one.The model improves the ability to extract the key features of tool wear and thus has achieved a higher level of accuracy in predicting the normal and severe wear stages of the tool.Although the prediction of the initial tool wear phase still has some errors, it has been improved compared to the CNN-LSTM model.
From Table 3, the CNN-LSTM model has a complexity that is slightly higher than 1D-CNN and CNN-LSTM due to the extra structural components in the model.Its complexity is higher than 1D-CNN and CNN-LSTM by 91.6% and 14.8%, respectively.RNN and 1D-CNN models have similar complexity, but the 1D-CNN model has better prediction accuracy.The CNN-LSTM and TDconvLSTM models have similar complexity and prediction accuracy.Compared with the CNN-LSTM model, the ConvLSTM-Att model has similar complexity but better prediction accuracy.
On the three different test sets, the ConvLSTM-Att model reduced the RMSE by 60.8%, 67.4%, and 63.7%, and the MAE by 61.7%, 71.3%, and 68.8%, respectively, compared with the 1D-CNN model.When compared with the CNN-LSTM model, the RMSE was reduced by 34.4%, 37.1%, and 46.3%, respectively, and the MAE decreased by 39.6%, 37.6%, and 47.7%, respectively.The results of the ConvLSTM-Att model are also compared with several state-of-the-art models using the same data sets, as shown in Table 2. Compared with these models, the ConvLSTM-Att model has the smallest RMSE and MAE in predicting tool wear, which further confirms the effectiveness and superiority of the ConvLSTM-Att model.

Conclusions
In this paper, a tool wear prediction model based on an attentional composite neural network is proposed.Based on a large amount of tool wear data, the original time-domain data in the tool machining state is used as input, and the wear features are mined by 1D-CNN.Using LSTM and attention mechanisms to learn the important tool wear features, the measured wear is approximated by regression.By comparing the predicted tool wear and the measured tool wear, it is confirmed that the ConvLSTM-Att model can well reflect It can be seen that the 1D-CNN model has the worst overall performance, which is shown in Figure 9.The model-fitting results are poor, and the predicted results of this model at the initial and severe wear stages of the tool have a large gap with the measured wear, which leads to model errors that are generally large.Then, as shown in Figure 10, the accuracy of the CNN-LSTM model is substantially improved compared with that of the 1D-CNN model, and the predicted wear of the model basically fits the measured wear.Due to the fact that the measured time-domain data of tool wear are essentially sequential data, adding the learning of temporal features to the 1D-CNN can improve the accuracy of the model's prediction.However, there is still a large error in predicting the wear at the initial and severe wear stages of the tool, especially at the severe wear stage for tool C6.From Figure 11, the tool wear predicted using the ConvLSTM-Att model is closer to the measured one.The model improves the ability to extract the key features of tool wear and thus has achieved a higher level of accuracy in predicting the normal and severe wear stages of the tool.Although the prediction of the initial tool wear phase still has some errors, it has been improved compared to the CNN-LSTM model.
From Table 3, the CNN-LSTM model has a complexity that is slightly higher than 1D-CNN and CNN-LSTM due to the extra structural components in the model.Its complexity is higher than 1D-CNN and CNN-LSTM by 91.6% and 14.8%, respectively.RNN and 1D-CNN models have similar complexity, but the 1D-CNN model has better prediction accuracy.The CNN-LSTM and TDconvLSTM models have similar complexity and prediction accuracy.Compared with the CNN-LSTM model, the ConvLSTM-Att model has similar complexity but better prediction accuracy.
On the three different test sets, the ConvLSTM-Att model reduced the RMSE by 60.8%, 67.4%, and 63.7%, and the MAE by 61.7%, 71.3%, and 68.8%, respectively, compared with the 1D-CNN model.When compared with the CNN-LSTM model, the RMSE was reduced by 34.4%, 37.1%, and 46.3%, respectively, and the MAE decreased by 39.6%, 37.6%, and 47.7%, respectively.The results of the ConvLSTM-Att model are also compared with several state-of-the-art models using the same data sets, as shown in Table 2. Compared with these models, the ConvLSTM-Att model has the smallest RMSE and MAE in predicting tool wear, which further confirms the effectiveness and superiority of the ConvLSTM-Att model.

Conclusions
In this paper, a tool wear prediction model based on an attentional composite neural network is proposed.Based on a large amount of tool wear data, the original time-domain data in the tool machining state is used as input, and the wear features are mined by 1D-CNN.Using LSTM and attention mechanisms to learn the important tool wear features, the measured wear is approximated by regression.By comparing the predicted tool wear and the measured tool wear, it is confirmed that the ConvLSTM-Att model can well reflect the trend of aggravating tool wear during machining.Under the same test set, the error rate in the tool wear prediction using the ConvLSTM-Att model has decreased significantly compared with other state-of-the-art models.For different test sets, the prediction

Figure 1 .
Figure 1.Network structure of the 1D-CNN.  is the number of the convolution-pooling layers

Figure 1 .
Figure 1.Network structure of the 1D-CNN.n c is the number of the convolution-pooling layers.

Machines 2023 ,Figure 3 .
Figure 3.The attention mechanism Therefor paper, the cutting force and vibration along the -, -, and -axes, and the acous sion at time ,  ∈  ,  ∈ 1,2, … ,  being the tap index, are selected as the i the model for training the ConvLSTM-Att model,  = ( , ⋯ ,  ),  ∈  .A one-dimensional CNN is first used to extract the high-dimensional featur measured time-domain data through multiple convolutional layers.The weigh CNN are updated by backpropagating the error calculated by the loss function

Figure 3 .
Figure 3.The attention mechanism

Figure 4 .
Figure 4.A connection layer following each convolutional layer.

3. 2 .Algorithm 1 .
The ConvLSTM-Att Model for Predicting the Tool Wear Process All data are first normalized and processed, and then divided into different data sets.Train the ConvLSTM-Att model using the training set data.The accuracy of the model is improved by adjusting the model parameters and structure according to the validation results from the validation set.Apply the optimal ConvLSTM-Att model to the test set and output the predicted wear to evaluate the model's performance.The specific operations are given in Algorithm 1. Training of the ConvLSTM-Att model Input: Tool wear data set D, learning rate η, epoch T, LSTM neuron number n, maximum error ε. 1.All data are processed by max-min normalization.2. Divide D into the training set D tr , validation set D v and test set D te .3. Perform training with D tr : do Adjust η, T, n, ε.Train ConvLSTM-Att model with D tr .Validate the currently trained model using D v .while (Loss ŷ, y does not converges && Loss > ε) return the current training model.End 4. Applying the optimal model to D te .Output: ŷ for D te The loss function of the model is the mean square error (MSE) function,

Figure 7 .
Figure 7. Division of tool wear stages

Figure 7 .
Figure 7. Division of tool wear stages.The model is built with the PyTorch deep learning library.It is coded in python and runs on Python 3.7, CUDA 11.4, and Windows with an Intel Core i7 CPU and an NVIDIA GeForce Mx350 GPU.
11,  x FOR PEER REVIEW 9 of 13 ratio of the training set to the verification set is 8:2.The third data set was used as a test set for model evaluation.

Figure 8 .
Figure 8.A close-up of the measured data . The calculation times of the 1D-CNN, CNN-LSTM, and ConvLSTM-Att models are shown in Table 3.The performance results of different models for tool wear prediction on different test sets are shown in Figures 9-11.

Figure 8 .
Figure 8.A close-up of the measured data.

Figure 9 .
Figure 9. Predicted results of 1D-CNN model on the C1, C4 and C6 test sets

Figure 10 .
Figure 10.Predicted results of CNN-LSTM model on the C1, C4 and C6 test sets

Figure 9 .
Figure 9. Predicted results of 1D-CNN model on the C1, C4 and C6 test sets.

Figure 9 .
Figure 9. Predicted results of 1D-CNN model on the C1, C4 and C6 test sets

Figure 10 . 10 .
Figure 10.Predicted results of CNN-LSTM model on the C1, C4 and C6 test sets Figure 10.Predicted results of CNN-LSTM model on the C1, C4 and C6 test sets.

Figure 11 .
Figure 11.Predicted results of ConvLSTM-Att model on the C1, C4 and C6 test sets

Figure 11 .
Figure 11.Predicted results of ConvLSTM-Att model on the C1, C4 and C6 test sets.

Table 2 .
Error performance of different models on different test sets.

Table 3 .
The calculation time of the 1D-CNN, CNN-LSTM and ConvL-Att models.

Table 2 .
Error performance of different models on different test sets

Table 3 .
The calculation time of the 1D-CNN, CNN-LSTM and ConvL-Att models

Table 2 .
Error performance of different models on different test sets

Table 3 .
The calculation time of the 1D-CNN, CNN-LSTM and ConvL-Att models