A Dual-Stage Attention Model for Tool Wear Prediction in Dry Milling Operation

The intelligent monitoring of tool wear status and wear prediction are important factors affecting the intelligent development of the modern machinery industry. Many scholars have used deep learning methods to achieve certain results in tool wear prediction. However, due to the instability and variability of the signal data, some neural network models may have gradient decay between layers. Most methods mainly focus on feature selection of the input data but ignore the influence degree of different features to tool wear. In order to solve these problems, this paper proposes a dual-stage attention model for tool wear prediction. A CNN-BiGRU-attention network model is designed, which introduces the self-attention to extract deep features and embody more important features. The IndyLSTM is used to construct a stable network to solve the gradient decay problem between layers. Moreover, the attention mechanism is added to the network to obtain the important information of output sequence, which can improve the accuracy of the prediction. Experimental study is carried out for tool wear prediction in a dry milling operation to demonstrate the viability of this method. Through the experimental comparison and analysis with regression prediction evaluation indexes, it proves the proposed method can effectively characterize the degree of tool wear, reduce the prediction errors, and achieve good prediction results.


Introduction
With the continuous improvement and optimization of sensor technology, internet of things technology, and deep learning algorithms, the development of industrial intelligent manufacturing systems is more rapid, and constantly moving toward the integration of various emerging technologies. In the industrial intelligent manufacturing environment, most of the machining process is the cutting process, which inevitably causes tool wear. Tool wear refers to a process in which the metal material on the tool surface is continuously disappearing and the surface morphology is continuously changing due to the mechanical, chemical, and thermal effects of the cutting process [1].
The tool wear has an important impact on the machining process. When the tool is worn to the scrap state, but the machining process has still not stopped, it will damage the workpiece and even break down the machine tool, which may directly affect the processing efficiency, product quality, and production cost [2,3]. The traditional coping mode is to change tools at regular intervals, which can cause the waste of materials. In that case, the coping mode has gradually updated to intelligent tool changing based on the prediction of tool wear. Changing the severely worn tools through online monitoring and real-time prediction can not only improve tool utilization rate and processing quality, but also reduce safety accidents and shutdown rate.

Related Works
With the development of deep learning in the last few years, many deep learning methods based on predictive analysis have been widely adapted for the process of tool condition monitoring and tool wear prediction [9][10][11]. Dai, Zhang, and Meng used a stacked sparse auto-encoder network to reduce the feature vectors and build a least squares support vector machine prediction model based on cuckoo optimization parameters [12]. An adaptive method was developed by Cao, Sun, and Zhang by using a deep network to replace manual feature extraction from signals and proposed an on-line tool wear monitoring model based on convolution neural networks [13]. Wu, Jennings, and Terpenny introduced a method based on random forests for tool wear prediction [14]. To realize a real-time and accurate monitoring of the tool wear in machining process, Kong, Dong, and Chen presented a model based on the integrated radial basis function with kernel principal component analysis (KPCA_IRBF) and relevance vector machine (RVM) [15].
Considering the characteristics of the time series and dynamic changes of input data, the recurrent neural network introduces a cyclic structure, which can model dynamic time series data better than other neural networks. Therefore, RNN and its variations, such as LSTM and GRU, have been widely applied to this field of tool wear. For example, a recurrent neural network based on health indicator (RNN-HI) for RUL prediction of bearings was proposed by Guo, Li, and Jia [16]. Zhu, Xie, and Li established a tool wear monitoring model on the basis of long-term and short-term memory neural networks [17]. A deep neural network structure named convolutional bi-directional long short-term memory (CBLSTM) has been designed to address raw sensory data [18]. It presented a hybrid prediction scheme to solve long-term prediction problems by a newly developed deep heterogeneous GRU model, along with local feature extraction [19]. Inspired by the success of deep learning methods that redefine representation learning from raw data, Zhao, Wang, and Yan [20] proposed a network named local feature-based gated recurrent unit (LFGRU). It was a hybrid approach that combined handcrafted feature design with automatic feature learning for machine health monitoring.
Both GRU and LSTM are special RNN structures, which are proposed to solve the problems of gradient disappearance in RNN. Although these structures solve the gradient problems to some extent by using the tanh and sigmoid function as activation function, they will also cause gradient attenuation between layers. IndyLSTMs (independently recurrent long short-term memory cells) [21] were proposed on the basis of IndRNN [22], which adds the gate structure of LSTM. Compared with the traditional LSTM, the cyclic weight is no longer a full matrix, but a diagonal matrix. In each layer of IndyLSTM, the number of parameters and nodes shows a linear relationship, while the traditional LSTM is quadratic. This feature makes the model smaller and faster, and the accuracy of this model is better than the LSTM model in most cases. Therefore, the IndyLSTM is introduced to build a new network to solve the problems of gradient attenuation between layers and to obtain a stable and accurate model for tool wear monitoring.
Moreover, most tool wear prediction models based on the recurrent neural network mainly focus on the selection of input features, ignoring the influence degree of input features on the tool wear. The attention mechanism is widely used in various types of deep learning tasks, such as natural language processing, image recognition, and speech recognition. As a resource allocation mechanism, it can assign different weights to input features so that different features containing important information will not disappear with the increase in time steps, which can highlight the impact of more important information. In this way, full use of the network can help to study and improve the prediction quality for a longer period of stability [23].
In summary, existing works have used deep networks instead of traditional methods, such as manual or machine learning methods, to extract features from signals to improve the prediction accuracy. However, existing models, such as IndyLSTM, ignore the difference in the degree of influence of selected input features on tool wear. In addition, existing methods do not make full use of prior knowledge to improve model performance.

Dual-Stage Attention Prediction Model
The framework of the tool wear prediction model based on dual-stage attention is shown in Figure 1. The whole model mainly includes three layers: the feature engineering layer, the deep feature extraction layer, and the model prediction layer. After the initial feature engineering process to the raw signals, the CBGA network will be used to extract deep features. Finally, applying the IndyLSTM-Attention model to train and output the wear values generates a stable model to realize real-time prediction of tool wear.

Dual-Stage Attention Prediction Model
The framework of the tool wear prediction model based on dual-stage attention is shown in Figure 1. The whole model mainly includes three layers: the feature engineering layer, the deep feature extraction layer, and the model prediction layer. After the initial feature engineering process to the raw signals, the CBGA network will be used to extract deep features. Finally, applying the IndyLSTM-Attention model to train and output the wear values generates a stable model to realize real-time prediction of tool wear.

Feature Engineering
The feature engineering layer consists of data cleaning, feature extraction, and feature selection. In this part, data cleaning mainly includes zero-averaging, removing the trend term, and normalizing signals. Meanwhile, according to the wavelet packet decomposition theory, high-frequency noise will be filtered out. Using the common signal analysis methods on the cleaned data, the statistical features of signal data are extracted from three domains. Combined with the existing research, this paper integrates time domain, frequency domain, and time-frequency to analyze the sensor signals comprehensively. After feature fusion, a preliminary feature selection is conducted on the extracted features. The flow of feature engineering is shown in Figure 2.

Feature Engineering
The feature engineering layer consists of data cleaning, feature extraction, and feature selection. In this part, data cleaning mainly includes zero-averaging, removing the trend term, and normalizing signals. Meanwhile, according to the wavelet packet decomposition theory, high-frequency noise will be filtered out. Using the common signal analysis methods on the cleaned data, the statistical features of signal data are extracted from three domains. Combined with the existing research, this paper integrates time domain, frequency domain, and time-frequency to analyze the sensor signals comprehensively. After feature fusion, a preliminary feature selection is conducted on the extracted features. The flow of feature engineering is shown in Figure 2.

Dual-Stage Attention Prediction Model
The framework of the tool wear prediction model based on dual-stage attention is shown in Figure 1. The whole model mainly includes three layers: the feature engineering layer, the deep feature extraction layer, and the model prediction layer. After the initial feature engineering process to the raw signals, the CBGA network will be used to extract deep features. Finally, applying the IndyLSTM-Attention model to train and output the wear values generates a stable model to realize real-time prediction of tool wear.

Feature Engineering
The feature engineering layer consists of data cleaning, feature extraction, and feature selection. In this part, data cleaning mainly includes zero-averaging, removing the trend term, and normalizing signals. Meanwhile, according to the wavelet packet decomposition theory, high-frequency noise will be filtered out. Using the common signal analysis methods on the cleaned data, the statistical features of signal data are extracted from three domains. Combined with the existing research, this paper integrates time domain, frequency domain, and time-frequency to analyze the sensor signals comprehensively. After feature fusion, a preliminary feature selection is conducted on the extracted features. The flow of feature engineering is shown in Figure 2.

Feature Exaction: Signal Analysis
Time domain analysis uses time axis as the coordinate to express the relationship between dynamic signals. It can effectively improve the signal-to-noise ratio and find the similarity and correlation of signal waveform transformations at different times. These obtained features can reflect the operating status of mechanical equipment.
Frequency domain analysis transforms the signals to the frequency axis. This method based on frequency characteristics makes up for the shortcomings of time domain analysis. It indirectly reveals the time domain performance of signals and easily displays the effect of system parameters on system performance. In this paper, spectrum analysis is used to analyze the signals after fast Fourier transform and to extract frequency domain features.
Wavelet analysis is a common time-frequency domain analysis method, which takes the signal information in both time domain and frequency domain into account. By analyzing frequency spectrum of sampled signal, the level of wavelet decomposition about signal is determined. The energy of each frequency band and the total energy entropy are taken as the time-frequency features after decomposition. In the Formula (1), F s is the sampling frequency, n is the number of layers, and f min is the minimum frequency band.
The sampling frequency is 50 kHz, so a five-layer wavelet packet decomposition is performed. One then takes 2 5 = 32 frequency band energy and energy entropy as timefrequency domain features. Thus, the signal analysis selects 13 features in the time domain, 3 features in the frequency domain, and 33 features in the time-frequency domain. In that case, 49 different features of each sensor channel will be extracted. The main extracted features are shown in Table 1.

Domain
Feature Name Formula Feature Name Formula

Feature Selection: Based on MIC
Unnecessary features will reduce training speed and generalization performance of the test set. In this paper, the features are selected and reduced by the maximum information coefficient (MIC). MIC can express various linear and non-linear relationships, and its value range is between 0 and 1. The higher the value, the stronger the correlation, so it has been widely used to select features in machine learning [24]. The basic principle of MIC utilizes the concept of mutual information, which is used to measure the degree of interdependence between two random variables. The mutual information can be explained as the following equation.
where p (x, y) is the joint probability density function of x and y, p (x) is the marginal probability density function of x, and p (y) is the marginal probability density function of y.
The calculation formula of MIC is shown in Formula (3): In the above formula, a and b are the number of dividing grids in the x and y directions, B is a variable, and the size of B is generally set to 0.5 or 0.6 power of the total amount of data. Calculate the maximum information coefficient of tool statistical features and choose the target number of features according to the correlation and, at last, return the feature vectors after feature selection.

Deep Feature Extraction: CBGA Network
In this section, a CNN-BiGRU-attention (CBGA) network is proposed to encode and mine the deep features. It consists of CNN, Bi-GRU, and the attention mechanism, expanding the features in two dimensions of space and time.
CNN is a neural network with a deep structure that includes convolution calculations. It uses local connection and weight sharing to perform a higher-level and more abstract process on original data and can effectively extract local features of the data. CNN is mostly used for static output and is difficult for obtaining dynamic characteristics, especially when the data fluctuate or are unstable.
Bi-GRU can capture long-term dependencies and describe the continuous state output in time. It is suitable for analyzing time series data because of its memory function. The bidirectional structure can make full use of historical information and learn the dynamic laws of both positive and negative directions at the same time.
Simultaneously, self-attention as one of the attention mechanisms can discover the internal characteristics of sequence data and highlight the important features. Thus, the self-attention layer is added to obtain final deep features. Based on the above theory, the CBGA network is designed to further process the feature vectors. Figure 3 shows the whole structure of the CBGA. As shown in this figure, the CBGA network can be divided into four parts: the multi-channel convolution layer, the max-pooling layer, the bidirectional GRU layer, and the attention layer.
Assume that the number of initially extracted features is n. The input data of this module is a n-dimensional feature vector that uses I[i 0 , i 1 , . . . i n ] to represent this input vector. Firstly, multi-channel convolution will be performed, it will connect the sequence results of the output, and it will obtain a t-dimensional vector C[c 0 , c 1 , . . . c t ], where t = k*f, k is the number of convolution layers, and f is the number of filters of the convolution neural network.
The max-pooling operation will be carried out. After that, the full sequence feature extraction will be conducted on the input through the bidirectional GRU. This part output G[g 0 , g 1 , . . . The max-pooling operation will be carried out. After that, the full sequence feature extraction will be conducted on the input through the bidirectional GRU. This part output G[ , , … ] is the concatenation of the results of forward GRU and backward GRU, where m = 2*h, and h is the number of Bi-GRU's hidden units. At last, the attention value of each GRU node will be calculated in the attention layer. A weighted feature vector Z[ , , … ] is obtained as the final deep feature encoding vector.
Using mathematical formulas to express the self-attention mechanism, the input sequence from the Bi-GRU layer is G[ , , … ], and the output sequence is Z[ , , … ]. Obtain three sets of vector sequences through linear transformation: where Q, K, and V are query vector sequence, key vector sequence, and value vector sequence, respectively. , , and are the parameter matrix that can be learned, respectively. In the definition of self-attention, Q = K = V = G, so the output vector is calculated as: where i, j∈ [1, m] are the positions of the output and input vector sequences, and , is a function to calculate the similarity between two vectors.

Model Prediction: IndyLSTM-Attention
The IndyLSTM-attention model is used in this paper to train and output the prediction. The model consists of the IndyLSTM network, the attention network, and the fully connected network, and these decode the feature sequences and output the required prediction. In the common RNN decoding unit, generally only the last sequence is taken from the output result as the prediction result. However, other sequences in the network structure are also meaningful. Combining other sequences through the attention mechanism may help us achieve a better fitting effect. Using mathematical formulas to express the self-attention mechanism, the input sequence from the Bi-GRU layer is G[g 0 , g 1 , . . . g m ], and the output sequence is Z[z 0 , z 1 , . . . z m ]. Obtain three sets of vector sequences through linear transformation: where Q, K, and V are query vector sequence, key vector sequence, and value vector sequence, respectively. W Q ,W K , and W V are the parameter matrix that can be learned, respectively. In the definition of self-attention, Q = K = V = G, so the output vector z i is calculated as: where i, j ∈ [1, m] are the positions of the output and input vector sequences, and s k j , q i is a function to calculate the similarity between two vectors.

Model Prediction: IndyLSTM-Attention
The IndyLSTM-attention model is used in this paper to train and output the prediction. The model consists of the IndyLSTM network, the attention network, and the fully connected network, and these decode the feature sequences and output the required prediction. In the common RNN decoding unit, generally only the last sequence is taken from the output result as the prediction result. However, other sequences in the network structure are also meaningful. Combining other sequences through the attention mechanism may help us achieve a better fitting effect. Therefore, the Bahdanau attention [25] is added after the IndyLSTM layer. The formula of Bahdanau attention mechanism is as follows: Therefore, the Bahdanau attention [25] is added after the IndyLSTM layer. The formula of Bahdanau attention mechanism is as follows:

Experiments
In this section, an empirical evaluation is conducted to test the performance of the proposed model. The descriptions of the datasets and experimental setup are introduced in detail. The proposed model is compared with other common prediction methods to form the comparison results and discussion.

Descriptions of Datasets
Open datasets were used to verify the predictive performance of the model, which were collected from the ball end carbide milling cutter of a high-speed (CNC) machine operated under dry milling operations [26]. Each training record contains one "wear" file that lists the flank wear values measured for three cutting edges after each cut in 10 −3 mm and a folder with approximately 300 individual data acquisition files (one for each cut). The data acquisition files have seven columns of dynamometer, accelerator, and acoustic emission data. The main equipment and cutting parameters are specifically listed in Table  2.
In the experiment, six independent milling cutters (c1~c6) were used for the full tool life test. The force sensor, acceleration sensor, and acoustic emission sensor were used to collect signals, and each tool collected seven channel signals, which include force in three

Experiments
In this section, an empirical evaluation is conducted to test the performance of the proposed model. The descriptions of the datasets and experimental setup are introduced in detail. The proposed model is compared with other common prediction methods to form the comparison results and discussion.

Descriptions of Datasets
Open datasets were used to verify the predictive performance of the model, which were collected from the ball end carbide milling cutter of a high-speed (CNC) machine operated under dry milling operations [26]. Each training record contains one "wear" file that lists the flank wear values measured for three cutting edges after each cut in 10 −3 mm and a folder with approximately 300 individual data acquisition files (one for each cut). The data acquisition files have seven columns of dynamometer, accelerator, and acoustic emission data. The main equipment and cutting parameters are specifically listed in Table 2.
In the experiment, six independent milling cutters (c1~c6) were used for the full tool life test. The force sensor, acceleration sensor, and acoustic emission sensor were used to collect signals, and each tool collected seven channel signals, which include force in three directions (X, Y, Z), vibration in three directions (X, Y, Z), and AE-RMS. During the experiment, all the sensor data were collected on a data acquisition card, and the data acquisition card transmitted all the information to the computer. Meanwhile, in the process of cutting the workpiece, the tool would be stopped in every cutting and used the microscope to measure the wear in the X, Y, and Z directions. The test was terminated when the tool was severely worn out and could not work anymore. 315 samples were obtained during the test, taking the average of the flank wear in three directions as the true value of tool wear estimation, and the unit of the tool wear is 10 −3 mm. In the given datasets, c1, c4, and c6 are training data with corresponding wear values, and c2, c3, and c5 are test data without wear values. Therefore, during the process of model verification, c1, c4, and c6 were selected as the training dataset. The leave-one-out method was adopted to achieve cross-validation by using two datasets as training set and using the rest one for verification. Therefore, three different test cases can be created, denoted as C1, C4, and C6. The partition of datasets is shown in Table 3.

Evaluation Index
In order to quantify the performance of all comparison methods, three commonly used evaluation indicators were selected to evaluate the regression loss, including mean absolute error (MAE), root mean square error (RMSE) and the coefficient of determination (R 2 score). Among the selected functions, both the MAE and R 2 scores are relatively robust and insensitive to outliers and noise. On the contrary, RMSE can integrate the advantages and disadvantages of MSE and MAE, is very sensitive to extremely large or small errors, and can make the model tend to be optimal.
MAE is the average of absolute values of errors. RMSE is the square root of mean of all squared errors. These two metrics are calculated as follows: where y i and y i are true and predicted tool wear values. The R 2 score is the coefficient of determination, reflecting how much of the fluctuation of y can be described by the fluctuation of x. The value range of the R 2 score is between 0 and 1. The closer the value is to 1, the higher the degree of interpretation of the variable. The expression is as follows:

Evaluation Setup
The vibration sensor signals were chosen for modeling. According to the part of feature exaction, the signal analysis was performed from three directions (X, Y, and Z) of vibration signals. Therefore, 147-dimensional (49*3) features were obtained in total. Then, the 40 best features were selected with high correlation by MIC, and the feature vectors were obtained after preprocessing. Meanwhile, the average of the wear value (mm) in three directions after each cutting was taken as true wear value of the tool. The initial values of the parameters of the experimental models are set with reference to the pre-trained models. In the experiment, the parameters are adjusted one by one by fixing other parameters and fine-tuning one parameter until the optimal result is obtained. The specific structure and experimental parameter configurations of the model are shown in Table 4. The IndyLSTM-attention model was also compared with common neural networks including RNN, LSTM, GRU, IndRNN, and IndyLSTM. These different neural network methods are used as the model to output tool wear values after training. The input vectors of these models are extracted by CBGA network, and the size of hidden units in recurrent neural cells is unified to be the same as 128. The loss function used is mean squared error (MSE). Stochastic gradient descent (SGD) is adopted as an optimizer algorithm to train the models. During the training process, by using the estimator to build a deep recurrent network model, it can easily configure, train, and evaluate various machine learning models.

Experimental Evaluation
This section shows the results of comparison experiments and makes a brief analysis. This gives the prediction curves of the proposed method for three different test sets, as shown in Figure 5. In the figure, the broken line is the actual value of tool wear, the smooth curve is the predicted value of tool wear, and the bottom histogram is the error between the predicted value and the true value.  Table 5 shows all the results of common methods on three test cases, including the RMSE, MAE and R score.   In order to show the comparison results more intuitively, the average performance of six methods is calculated for three test cases, as shown in Table 6. The average performance comparison histogram of three cases is shown in Figure 7.  Table 5 shows all the results of common methods on three test cases, including the RMSE, MAE and R 2 score. The bold face indicates the best performance. Figure 6 shows the error area between the true wear values and the predicted values. The area chart can clearly display the error size of each prediction model.  Table 5 shows all the results of common methods on three test cases, including the RMSE, MAE and R score.   In order to show the comparison results more intuitively, the average performance of six methods is calculated for three test cases, as shown in Table 6. The average performance comparison histogram of three cases is shown in Figure 7. In order to show the comparison results more intuitively, the average performance of six methods is calculated for three test cases, as shown in Table 6. The average performance comparison histogram of three cases is shown in Figure 7.  The bold face indicates the best performance. Overall, the model proposed by this paper has a good fitting effect, and the curves basically match the true data from Figure 5. The prediction of C4 and C6 is better than C1. According to the comparison results of Tables 5 and 6, the IndyLSTM-attention in three indicators has the best performance. It can be observed that the IndyLSTM and In-dyLSTM-attention outperforms the RNN, LSTM, and GRU neural network in three cases. This shows that independently recurrent long short-term memory networks perform better than traditional recurrent neural networks in this situation. Meanwhile, the comparison between the results of IndyLSTM and IndyLSTM-attention also shows that all indicators have been improved to a certain extent. Therefore, it can be concluded that the accuracy of prediction can be improved by adding an attention layer to the predictive model.
Further analysis is seen through the error area chart, the area where the prediction error of the proposed model is the smallest. At the same time, the model is found to have large fluctuations at the beginning and at the end of the tool prediction, and the error of other stages is small. Finally, through the histogram, the differences between different models can be observed. RMSE and MAE of the proposed model are much smaller than in the other models.

Conclusions
In this paper, an IndyLSTM model with a self-attention mechanism is proposed in order to solve the problem that existing deep learning methods ignore (the different influences of the degree of input features on tool wear in the process of intelligent tool wear monitoring). By using the 2010 PHM Society Conference Data Challenge open datasets, the proposed model has achieved better performance than common regression prediction methods in all three evaluation criteria (MAE, RMSE, and R score). Through experimental verification, there are two main findings obtained.
(1) By applying the self-attention mechanism in the deep feature extraction and tool wear prediction model to assign different weights to different input features, performance of the prediction model for tool wear can be effectively improved. Overall, the model proposed by this paper has a good fitting effect, and the curves basically match the true data from Figure 5. The prediction of C4 and C6 is better than C1. According to the comparison results of Tables 5 and 6, the IndyLSTM-attention in three indicators has the best performance. It can be observed that the IndyLSTM and IndyLSTM-attention outperforms the RNN, LSTM, and GRU neural network in three cases. This shows that independently recurrent long short-term memory networks perform better than traditional recurrent neural networks in this situation. Meanwhile, the comparison between the results of IndyLSTM and IndyLSTM-attention also shows that all indicators have been improved to a certain extent. Therefore, it can be concluded that the accuracy of prediction can be improved by adding an attention layer to the predictive model.
Further analysis is seen through the error area chart, the area where the prediction error of the proposed model is the smallest. At the same time, the model is found to have large fluctuations at the beginning and at the end of the tool prediction, and the error of other stages is small. Finally, through the histogram, the differences between different models can be observed. RMSE and MAE of the proposed model are much smaller than in the other models.

Conclusions
In this paper, an IndyLSTM model with a self-attention mechanism is proposed in order to solve the problem that existing deep learning methods ignore (the different influences of the degree of input features on tool wear in the process of intelligent tool wear monitoring). By using the 2010 PHM Society Conference Data Challenge open datasets, the proposed model has achieved better performance than common regression prediction methods in all three evaluation criteria (MAE, RMSE, and R 2 score). Through experimental verification, there are two main findings obtained.
(1) By applying the self-attention mechanism in the deep feature extraction and tool wear prediction model to assign different weights to different input features, performance of the prediction model for tool wear can be effectively improved. (2) By combining prior experience, the feature selection method using the maximum information coefficient can effectively reduce redundant features, which shows an ability to improve modeling efficiency.
However, the features used in this paper are a combination of time domain, frequency domain, and deep learning features, wherein the time and frequency domain features are dependent on prior knowledge. The future work is to select better time and frequency domain features and better feature selection criteria to further improve the performance of the model.