Research on a Real-Time Monitoring Method for the Wear State of a Tool Based on a Convolutional Bidirectional LSTM Model

To monitor the tool wear state of computerized numerical control (CNC) machining equipment in real time in a manufacturing workshop, this paper proposes a real-time monitoring method based on a fusion of a convolutional neural network (CNN) and a bidirectional long short-term memory (BiLSTM) network with an attention mechanism (CABLSTM). In this method, the CNN is used to extract deep features from the time-series signal as an input, and then the BiLSTM network with a symmetric structure is constructed to learn the time-series information between the feature vectors. The attention mechanism is introduced to self-adaptively perceive the network weights associated with the classification results of the wear state and distribute the weights reasonably. Finally, the signal features of different weights are sent to a Softmax classifier to classify the tool wear state. In addition, a data acquisition experiment platform is developed with a high-precision CNC milling machine and an acceleration sensor to collect the vibration signals generated during tool processing in real time. The original data are directly fed into the depth neural network of the model for analysis, which avoids the complexity and limitations caused by a manual feature extraction. The experimental results show that, compared with other deep learning neural networks and traditional machine learning network models, the model can predict the tool wear state accurately in real time from original data collected by sensors, and the recognition accuracy and generalization have been improved to a certain extent.


Introduction
As a critical component of intelligent manufacturing, mechanical intelligent fault diagnosis has become an essential part of "Made in China 2025" [1]. In mechanical processing, cutting is the most important means of manufacturing. At present, research in this field mainly focuses on tool cutting parameter optimization [2,3] and tool wear condition monitoring [4,5]. Real-time monitoring of the tool wear state is an essential part of the computerized numerical control (CNC) machining process in a manufacturing workshop. The wear state of a tool is affected by the processing procedures, workpiece materials, cutting parameters, and other factors. The whole system exhibits strong nonlinearity and uncertainty. The tool wear will not only reduce the processing quality of the CNC machining equipment but also affect the surface roughness and machining accuracy of the workpiece and seriously affect the overall stability and processing efficiency of the CNC machining equipment. The wear state of a tool will directly affect the machining accuracy, surface quality, and production efficiency of the parts. Therefore, the technology of tool condition monitoring (TCM) is of great significance for ensuring the quality of processing and realizing continuous automatic processing [6][7][8][9]. Therefore, this paper proposes a method for real-time monitoring of a tool wear state based on a CNN and bidirectional long short-term memory (BiLSTM) network model with an attention mechanism (CABLSTM). The sensor acquires the signals generated during tool processing in real time, which are directly fed into the CNN for parallel local feature extraction and then into the BiLSTM network for feature extraction of the long-distance dependence information. The attention mechanism is used to calculate the network weights and distribute them reasonably. Finally, the signal feature information with different weights is sent to a Softmax classifier to classify the tool wear status, avoiding the complexity and limitation caused by a manual feature extraction. This method can meet the real-time and accuracy requirements of tool monitoring in actual industrial production. The remainder of this paper is organized as follows. Section 2 presents the CABLSTM algorithm. Section 3 presents the monitoring process of tool wear. Section 4 presents the experimental results of the tool wear condition monitoring. Section 5 concludes the article.

CABLSTM Model
Inspired by the literature [24], this paper applied a CNN and recurrent neural network (RNN) fusion to the real-time monitoring task of a tool wear state, constructs two network models of convolutional long short-term memory (CLSTM) and convolutional bi-directional long short-term memory (CBLSTM), effectively solves the problem of the correlation between the ignored time-series signals in a single CNN, and avoids the problem of gradient dispersion and gradient explosion in a circular neural network. Meanwhile, the attention mechanism is introduced on the basis of the CBLSTM network. Finally, the CABLSTM network is proposed, which further improves the accuracy of model prediction.
The CABLSTM model mainly includes four parts: The first part involves the local feature extraction of the single time step timing signal, which mainly uses a one-dimensional CNN for neighborhood filtering, uses a sliding window for the convolution calculation, and finally obtains the high-dimensional features of the single time step timing signal. The second part involves the extraction of the time series of time-series signals, and the BiLSTM network is used to process the high-dimensional features generated by the continuous time step timing signals and gradually synthesize the vector feature representation of the input signals. The third part uses the attention mechanism to calculate the importance distribution of sequential signal features in continuous time steps and generate the feature model of sequential signals with an attention probability distribution. The fourth part is the classifier, which uses dropout technology to prevent overfitting and uses the Therefore, this paper proposes a method for real-time monitoring of a tool wear state based on a CNN and bidirectional long short-term memory (BiLSTM) network model with an attention mechanism (CABLSTM). The sensor acquires the signals generated during tool processing in real time, which are directly fed into the CNN for parallel local feature extraction and then into the BiLSTM network for feature extraction of the long-distance dependence information. The attention mechanism is used to calculate the network weights and distribute them reasonably. Finally, the signal feature information with different weights is sent to a Softmax classifier to classify the tool wear status, avoiding the complexity and limitation caused by a manual feature extraction. This method can meet the real-time and accuracy requirements of tool monitoring in actual industrial production.
The remainder of this paper is organized as follows. Section 2 presents the CABLSTM algorithm. Section 3 presents the monitoring process of tool wear. Section 4 presents the experimental results of the tool wear condition monitoring. Section 5 concludes the article.

CABLSTM Model
Inspired by the literature [24], this paper applied a CNN and recurrent neural network (RNN) fusion to the real-time monitoring task of a tool wear state, constructs two network models of convolutional long short-term memory (CLSTM) and convolutional bi-directional long short-term memory (CBLSTM), effectively solves the problem of the correlation between the ignored time-series signals in a single CNN, and avoids the problem of gradient dispersion and gradient explosion in a circular neural network. Meanwhile, the attention mechanism is introduced on the basis of the CBLSTM network. Finally, the CABLSTM network is proposed, which further improves the accuracy of model prediction.
The CABLSTM model mainly includes four parts: The first part involves the local feature extraction of the single time step timing signal, which mainly uses a one-dimensional CNN for neighborhood filtering, uses a sliding window for the convolution calculation, and finally obtains the high-dimensional features of the single time step timing signal. The second part involves the extraction of the time series of time-series signals, and the BiLSTM network is used to process the high-dimensional features generated by the continuous time step timing signals and gradually synthesize the vector feature representation of the input signals. The third part uses the attention mechanism to calculate the importance distribution of sequential signal features in continuous time steps and generate the feature model of sequential signals with an attention probability distribution. The fourth part is the classifier, which uses dropout technology to prevent overfitting and uses the Softmax classifier to predict the tool wear states. The neural network framework for real-time monitoring of the tool wear state based on CABLSTM is shown in Figure 2. Softmax classifier to predict the tool wear states. The neural network framework for real-time monitoring of the tool wear state based on CABLSTM is shown in Figure 2.

Local Feature Extraction of Single Time Step Timing Signals
The one-dimensional CNN can be applied to a time-series analysis of sensor data [24][25][26]. In the one-dimensional convolutional layer, multiple filters are used to perform neighborhood filtering of the input time-series data, and the acquired feature maps are superimposed to form an output feature map of the convolutional layer. Then, the pooling layer extracts the fixed-length feature vectors from feature maps of each candidate frame for a feature dimension reduction, thereby extracting critical features in the time-series data and simplifying the complexity of the network calculation.
In this paper, a one-dimensional CNN was used to directly process the timing signals generated during tool processing. The CNN includes two layers: A convolutional layer and a pooling layer. The convolution layer performs neighborhood filtering of the time-series signals of each dimension using a one-dimensional convolution operation to generate feature maps, and each feature map can be regarded as a convolution operation of different filters on the current time step timing signals [27]. When the input timing signal is , the weight vector of the convolution kernel is , the total number of samples is , the size of the convolution kernel is , * is the convolution operation, and the output feature map of the convolutional layer can be expressed as follows: In the convolutional layer, each neuron of the layer is only connected to a local window neuron in the − 1 layer to form a local connection network. The calculation formula for the onedimensional convolution layer is as follows: where is the feature map of the layer, (·) is the activation function, is the input feature vector, is the feature map of the − 1 layer, is a trainable convolution kernel, and is the bias parameter. Considering the convergence speed and overfitting problems, the rectified linear unit (Relu) is chosen for the non-linear activation function, which converges faster to improve the sparsely of the network in this paper, reduces the interdependence of the parameters, and alleviates the occurrence of overfitting. The formula for the Relu activation function is as follows:

Local Feature Extraction of Single Time Step Timing Signals
The one-dimensional CNN can be applied to a time-series analysis of sensor data [24][25][26]. In the one-dimensional convolutional layer, multiple filters are used to perform neighborhood filtering of the input time-series data, and the acquired feature maps are superimposed to form an output feature map of the convolutional layer. Then, the pooling layer extracts the fixed-length feature vectors from feature maps of each candidate frame for a feature dimension reduction, thereby extracting critical features in the time-series data and simplifying the complexity of the network calculation.
In this paper, a one-dimensional CNN was used to directly process the timing signals generated during tool processing. The CNN includes two layers: A convolutional layer and a pooling layer. The convolution layer performs neighborhood filtering of the time-series signals of each dimension using a one-dimensional convolution operation to generate feature maps, and each feature map can be regarded as a convolution operation of different filters on the current time step timing signals [27]. When the input timing signal is x, the weight vector of the convolution kernel is w, the total number of samples is m, the size of the convolution kernel is n, * is the convolution operation, and the output feature map of the convolutional layer y can be expressed as follows: In the convolutional layer, each neuron of the l layer is only connected to a local window neuron in the l − 1 layer to form a local connection network. The calculation formula for the one-dimensional convolution layer is as follows: x where x l j is the j feature map of the l layer, f (·) is the activation function, M j is the input feature vector, x l−1 i is the i feature map of the l − 1 layer, w l ij is a trainable convolution kernel, and b l j is the bias parameter. Considering the convergence speed and overfitting problems, the rectified linear unit (Relu) is chosen for the non-linear activation function, which converges faster to improve the sparsely of the network in this paper, reduces the interdependence of the parameters, and alleviates the occurrence of overfitting. The formula for the Relu activation function is as follows: where y l+1 i ( j) is the output value of the volume and operation and a l+1 i ( j) is the activation value of y l+1 i ( j). The convolutional layer is connected to the pooling layer for the local maximum or local mean, namely, max pooling and mean pooling [28]. The pooling layer has the function of feature selection, which can ensure that the feature can resist a deformation; at the same time, the pooling layer can reduce the feature dimension, speed up the network training, reduce the number of parameters, and improve the robustness of the feature. In this paper, max pooling was used to obtain the maximum value of the feature points in the neighborhood. The formula is as follows: where q l i (t) is the value of the t neuron in the i feature vector of the l layer and t ∈ [( j − 1)w + 1, jw]. w is the width of the pooled region, and P l+1 i ( j) is the value corresponding to the l + 1 layer neuron. The one-dimensional CNN performs the feature extraction of the original data, and the three-dimensional features of the time-series signal are better expressed as high-dimensional features, which facilitate the subsequent time-series feature extraction of the BiLSTM network. The basic structure of the one-dimensional CNN is shown in Figure 3. where ( ) is the output value of the volume and operation and ( ) is the activation value of ( ).
The convolutional layer is connected to the pooling layer for the local maximum or local mean, namely, max pooling and mean pooling [28]. The pooling layer has the function of feature selection, which can ensure that the feature can resist a deformation; at the same time, the pooling layer can reduce the feature dimension, speed up the network training, reduce the number of parameters, and improve the robustness of the feature. In this paper, max pooling was used to obtain the maximum value of the feature points in the neighborhood. The formula is as follows: where ( ) is the value of the neuron in the feature vector of the layer and ∈ [( − 1) + 1, ]. is the width of the pooled region, and ( ) is the value corresponding to the + 1 layer neuron.
The one-dimensional CNN performs the feature extraction of the original data, and the threedimensional features of the time-series signal are better expressed as high-dimensional features, which facilitate the subsequent time-series feature extraction of the BiLSTM network. The basic structure of the one-dimensional CNN is shown in Figure 3.

Time-Series Feature Extraction of Time-Series Signals
Long short-term memory (LSTM) is an exclusive self-connected recurrent neural network (RNN). LSTM introduces a gate function to generate the path of continuous gradient flow for a long time, which effectively avoids the problem of gradient disappearance and gradient explosion caused by the chain rule in the gradient calculation of hidden layers in RNN [29]. LSTM can mine the temporal variation law of relatively long intervals in time series, and it is particularly used to process timeseries data. The original signal generated during tool processing has a timing relationship. The LSTM network can encode the time series of time-series signals and mine the timing variation in relatively long intervals in the time series [30]. To ensure that the real-time monitoring model of tool wear can better learn the dependence of time-series features between time-series signals and improve the accuracy of the model classification, this paper improves the existing LSTM network [31] and builds a BiLSTM network with a symmetric structure by constructing two directions of LSTM networks [32]. At the same time, the attention mechanism is introduced into the BiLSTM network to increase the attention layer, which enables the model to both extract temporal signal features from both the positive and negative directions and selectively learn the critical information of the signal features.
The constructed BiLSTM network contained 256 neurons in this paper. The forward and reverse LSTM networks consisted of 128 neurons. Each BiLSTM neuron included an input gate, a forget gate

Time-Series Feature Extraction of Time-Series Signals
Long short-term memory (LSTM) is an exclusive self-connected recurrent neural network (RNN). LSTM introduces a gate function to generate the path of continuous gradient flow for a long time, which effectively avoids the problem of gradient disappearance and gradient explosion caused by the chain rule in the gradient calculation of hidden layers in RNN [29]. LSTM can mine the temporal variation law of relatively long intervals in time series, and it is particularly used to process time-series data. The original signal generated during tool processing has a timing relationship. The LSTM network can encode the time series of time-series signals and mine the timing variation in relatively long intervals in the time series [30]. To ensure that the real-time monitoring model of tool wear can better learn the dependence of time-series features between time-series signals and improve the accuracy of the model classification, this paper improves the existing LSTM network [31] and builds a BiLSTM network with a symmetric structure by constructing two directions of LSTM networks [32]. At the same time, the attention mechanism is introduced into the BiLSTM network to increase the attention layer, which enables the model to both extract temporal signal features from both the positive and negative directions and selectively learn the critical information of the signal features.
The constructed BiLSTM network contained 256 neurons in this paper. The forward and reverse LSTM networks consisted of 128 neurons. Each BiLSTM neuron included an input gate, a forget gate and an output gate, which are represented by , , and , respectively. The internal structure of the BiLSTM neurons is shown in Figure 4.  The input gate is used to control the amount of current input information of the network that can be saved to the memory unit , uses the sigmoid function to determine new information to be saved, uses the tanh function to generate a new candidate vector , and sends the information to be saved to the memory. The unit completes the update. The forget gate is used to control the selfconnecting unit, filters the information in the memory unit at the previous moment to determine the amount of valid information that needs to be retained in the current memory unit , and forgets the useless information. The output gate o controls the influence of the memory unit on the current output value ℎ and determines the amount of information that the memory unit outputs at time step . The formula is as follows: tanh( ) where is the memory unit, which is called the cell state, is the memory cell state at time step , is the candidate vector of the memory cell at time step , is the input vector at time step , ℎ is the output vector at time step , is the weight vector of the network, is the offset vector, ⊙ represents a multiplication of vector elements, (·) is the sigmoid function, and the tanh function is the hyperbolic tangent activation function.
The high-dimensional feature of the input timing signal is outputted by the forward LSTM network vector ℎ ⃗ , the inverse LSTM network output vector is ℎ ⃖ , and the BiLSTM network output eigenvector is at time step . The formula is as follows: The input gate i is used to control the amount of current input information x t of the network that can be saved to the memory unit C t , uses the sigmoid function to determine new information to be saved, uses the tanh function to generate a new candidate vector C t , and sends the information to be saved to the memory. The unit completes the update. The forget gate f is used to control the self-connecting unit, filters the information in the memory unit C t−1 at the previous moment to determine the amount of valid information that needs to be retained in the current memory unit C t , and forgets the useless information. The output gate o controls the influence of the memory unit C t on the current output value h t and determines the amount of information that the memory unit C t outputs at time step t. The formula is as follows: where C is the memory unit, which is called the cell state, C t is the memory cell state at time step t, C t is the candidate vector of the memory cell at time step t, x t is the input vector at time step t, h t is the output vector at time step t, W is the weight vector of the network, b is the offset vector, represents a multiplication of vector elements, σ(·) is the sigmoid function, and the tanh function is the hyperbolic tangent activation function. The high-dimensional feature of the input timing signal is outputted by the forward LSTM network vector → h t , the inverse LSTM network output vector is ← h t , and the BiLSTM network output eigenvector is P t at time step t. The formula is as follows: In this paper, the attention mechanism was used to assign weights to each time step output vector of the BiLSTM layer by assigning different initialization probability weights. Finally, the values were calculated by the sigmoid function. The attention mechanism achieves selective filtering and focusing of some critical information from a large number of signal features. The focusing process was embodied in the calculation of the weight coefficients. Different weights were allocated to different critical pieces of information, and the proportion of critical information was enhanced by lifting the weights to reduce the loss of critical information of long sequence timing signals. The calculation formula for the attention mechanism [30] is as follows: where P t is the output eigenvector of the BiLSTM layer at time step t, u t is the hidden layer representation of P t through the neural network layer, u s is the randomly initialized context vector, α t is the importance weight of u t normalized by the Softmax function, and v is the feature vector of the final text message. u s is generated randomly during the training process, and finally, the output value v of the attention layer is mapped via the Softmax function to obtain a real-time classification result of the tool wear state. The partial expansion of the BiLSTM network model with the attention mechanism along the time axis is shown in Figure 5.
[ , ] In this paper, the attention mechanism was used to assign weights to each time step output vector of the BiLSTM layer by assigning different initialization probability weights. Finally, the values were calculated by the sigmoid function. The attention mechanism achieves selective filtering and focusing of some critical information from a large number of signal features. The focusing process was embodied in the calculation of the weight coefficients. Different weights were allocated to different critical pieces of information, and the proportion of critical information was enhanced by lifting the weights to reduce the loss of critical information of long sequence timing signals. The calculation formula for the attention mechanism [30] is as follows: =s max( , ) where is the output eigenvector of the BiLSTM layer at time step , is the hidden layer representation of through the neural network layer, is the randomly initialized context vector, is the importance weight of normalized by the Softmax function, and is the feature vector of the final text message.
is generated randomly during the training process, and finally, the output value of the attention layer is mapped via the Softmax function to obtain a real-time classification result of the tool wear state. The partial expansion of the BiLSTM network model with the attention mechanism along the time axis is shown in Figure 5.

Network Model Training
In this paper, dropout technology was introduced into the real-time monitoring model of the tool wear state to prevent the model from overfitting during training. The activation function of the

Network Model Training
In this paper, dropout technology was introduced into the real-time monitoring model of the tool wear state to prevent the model from overfitting during training. The activation function of the network model uses Softmax, and the loss function uses Categorical_crossentropy, which was used to classify the wear features of the acquired time-series signals. The formula is as follows: y is a vector whose dimensions are the number of categories, each of which has a value between [0,1], and the sum of all dimensions is 1, which is the probability that the tool wear state belongs to i1 log y i1 +ŷ i2 log y i2 + · · · +ŷ im log y im , where m is the number of classifications, n is the number of samples,ŷ im is the i value in the tool wear state real category label vector, and y im is the i value of the output vector y of the Softmax classifier. For the obtained cross-entropy error, the average was taken as the loss function of the model. The Adam method was used to minimize the objective function when training the model. The Adam method is essentially the RMSprop method with a momentum term. The Adam method dynamically adjusts the learning rate of each parameter by using a first-order moment estimation and a second-order moment estimation of the gradient. The main advantage of the Adam method was that after the offset correction, the learning rate of each iteration had a specific range, which makes the parameter change relatively stable.

Real-Time Monitoring Method of the Tool Wear State
An acceleration sensor is used to collect the vibration signal generated by a computerized numerical control (CNC) machining device in the process of machining the workpiece in real time. The input signal of the real-time monitoring model of the tool wear state is the α x , α y , and α z vibration signals, and the output of the model is the predicted value of the tool wear state. In this paper, after continuous sampling of the original vibration signal generated by each milling cutter feed, the sampling points with a length of 2000 were cut to form multiple tensors (3 × 2000), which were taken as the input data of the model for the DL neural network. The schematics diagram of the CABLSTM network is shown in Figure 6. The CBLSTM network did not have an attention block, while the CLSTM network was similar to the CBLSTM network but with an LSTM block instead of a BiLSTM block.
The input data of the CABLSTM network included the time-series signal (data type) and the wear classification (label type). The feature extraction and expression of the time-series signal were achieved by two convolution layers, one pooling layer, one flatten layer, one BiLSTM layer, one attention layer, and two fully-connected layers. The parameters of each layer of the network are shown in Table 1.

Experimental Design
A real-time monitoring system for the tool wear state includes a condition monitoring facility and a data analysis unit. The condition monitoring facilities include the basic equipment used to process the workpiece, the equipment to collect the vibration signals generated during the processing, and the equipment to measure the value of tool wear. The data analysis facility included high-performance computers and DL platforms for analyzing and processing the data and classifying and reporting the tool wear status in real time.

Condition Monitoring
The experimental platform of this paper was provided by the Engineering Training Center of Guizhou University. A high-precision CNC vertical milling machine (Model: VM600) was used for the milling workpiece. No coolant was added during milling. The workpiece was milled steel (S136). The milling tool had a cemented carbide 4-edge milling cutter, and its surface was covered with layers of a titanium aluminum nitride coating. The diameter of the tool was 6 mm, the rake angle was 4°, the clearance angle was 8°, and the helix angle was 30°. The cutting parameters of the milling experiment are shown in Table 2.

Experimental Design
A real-time monitoring system for the tool wear state includes a condition monitoring facility and a data analysis unit. The condition monitoring facilities include the basic equipment used to process the workpiece, the equipment to collect the vibration signals generated during the processing, and the equipment to measure the value of tool wear. The data analysis facility included high-performance computers and DL platforms for analyzing and processing the data and classifying and reporting the tool wear status in real time.

Condition Monitoring
The experimental platform of this paper was provided by the Engineering Training Center of Guizhou University. A high-precision CNC vertical milling machine (Model: VM600) was used for the milling workpiece. No coolant was added during milling. The workpiece was milled steel (S136). The milling tool had a cemented carbide 4-edge milling cutter, and its surface was covered with layers of a titanium aluminum nitride coating. The diameter of the tool was 6 mm, the rake angle was 4 • , the clearance angle was 8 • , and the helix angle was 30 • . The cutting parameters of the milling experiment are shown in Table 2. In the experiment, three accelerometers (Model: INV9822; Range: ±50 g) were magnetically attracted to the machine tool fixture in the x, y, and z directions for real-time acquisition of the original vibration signals generated during tool machining. A high-precision digital acquisition instrument (model: INV3018CT) from the Beijing Oriental Institute of Vibration and Noise was used to process the real-time signals and transmit them to a computer. The sampling frequency of the signal was 20 kHz, 200 mm of milling in each direction of the tool was recorded as a milling stroke, and each tool was milled for 330 strokes. After each milling stroke, the milling cutter was removed from the milling machine and photographed. A pre-calibrated high-precision digital microscope (EVDM-101) was used for the measurement, the optical magnification was 0.7×-4.5×, the electronic magnification was 35×-235×, and the measuring accuracy was 0.1 µm. During the measurement process, the position of the wear zone of the minor flank surface of the milling cutter, which was the most easily worn, was selected as the measurement position, and the same reference line was taken as the standard to ensure that the position remains unchanged during the measurement. The wear value (VBmax) was calculated by subtracting the current cutting edge length from the initial length of the cutting edge of the milling cutter. The real-time monitoring experimental device of the tool wear state is shown in Figure 7.
In the experiment, three accelerometers (Model: INV9822; Range: ±50 g) were magnetically attracted to the machine tool fixture in the , , and directions for real-time acquisition of the original vibration signals generated during tool machining. A high-precision digital acquisition instrument (model: INV3018CT) from the Beijing Oriental Institute of Vibration and Noise was used to process the real-time signals and transmit them to a computer. The sampling frequency of the signal was 20 kHz, 200 mm of milling in each direction of the tool was recorded as a milling stroke, and each tool was milled for 330 strokes. After each milling stroke, the milling cutter was removed from the milling machine and photographed. A pre-calibrated high-precision digital microscope (EVDM-101) was used for the measurement, the optical magnification was 0.7×-4.5×, the electronic magnification was 35×-235×, and the measuring accuracy was 0.1 μm. During the measurement process, the position of the wear zone of the minor flank surface of the milling cutter, which was the most easily worn, was selected as the measurement position, and the same reference line was taken as the standard to ensure that the position remains unchanged during the measurement. The wear value (VBmax) was calculated by subtracting the current cutting edge length from the initial length of the cutting edge of the milling cutter. The real-time monitoring experimental device of the tool wear state is shown in Figure 7.

Data Analysis
The DL hardware platform of the experiment used high-performance servers: An Intel Xeon E5-2650 processor, with a frequency of 2.3 GHz, 256 GB of memory, and an NVIDIA GeForce TITAN X graphics processing unit (GPU). The software platform used the Ubuntu 16.04.4 operating system with Keras as the front-end of the in-depth learning framework and TensorFlow as the back-end for data analysis.
The milling operation was carried out with four milling cutters (C1, C2, C3, and C4). Each milling cutter was performed 330 times, and 1320 original signal samples were obtained. The data of three milling cutters (C1, C2, and C3) were used for the training set and verification set of the model, and one milling cutter (C4) data was used for the test set of the model. The training set was used for model

Data Analysis
The DL hardware platform of the experiment used high-performance servers: An Intel Xeon E5-2650 processor, with a frequency of 2.3 GHz, 256 GB of memory, and an NVIDIA GeForce TITAN X graphics processing unit (GPU). The software platform used the Ubuntu 16.04.4 operating system with Keras as the front-end of the in-depth learning framework and TensorFlow as the back-end for data analysis.
The milling operation was carried out with four milling cutters (C1, C2, C3, and C4). Each milling cutter was performed 330 times, and 1320 original signal samples were obtained. The data of three milling cutters (C1, C2, and C3) were used for the training set and verification set of the model, and one milling cutter (C4) data was used for the test set of the model. The training set was used for model fitting the data samples, the verification set was used for adjusting the hyperparameters of the model, the initial ability of the model was evaluated, and the test set was used to evaluate the generalization ability of the final model. In the DL training process, a sufficient number of samples were needed to improve the learning quality of the neural network. The data samples of the original processed signals were long sequences of periodic timing signals. According to the principle of signal sampling, in this paper, 100,000 points of each sample were sampled continuously, and 50 short sequence timing signals with a length of 2000 were cut to be used for model input after data normalization to reduce the computational intensity of the network training. At the same time, data expansion could increase the experimental data based on the original magnitude data, improve the robustness of the network, and reduce the risk of overfitting.
The processing conditions in the experiment had the following characteristics: 1. Finishing milling and small back engagement were performed; 2. the workpiece was milled steel (S136) with high hardness after heat treatment; and 3. the experiment needed to produce tool data set quickly and accurately. This paper referred to references [33][34][35] and the measurement methods of milling tool wear in 2010 prognostics and health management (PHM) competition. The following method was used as the blunt standard for the milling cutter in this experiment: The maximum value (VBmax) of the wear zone of the minor flank surface of the milling cutter was selected as the quantified value reflecting the wear state. It was specified that failure of the milling cutter occurred when the wear value of the milling cutter was greater than 0.13 mm. The wear process of the milling cutters (C1, C2, C3, and C4) is shown in Figure 8.
fitting the data samples, the verification set was used for adjusting the hyperparameters of the model, the initial ability of the model was evaluated, and the test set was used to evaluate the generalization ability of the final model. In the DL training process, a sufficient number of samples were needed to improve the learning quality of the neural network. The data samples of the original processed signals were long sequences of periodic timing signals. According to the principle of signal sampling, in this paper, 100,000 points of each sample were sampled continuously, and 50 short sequence timing signals with a length of 2000 were cut to be used for model input after data normalization to reduce the computational intensity of the network training. At the same time, data expansion could increase the experimental data based on the original magnitude data, improve the robustness of the network, and reduce the risk of overfitting.
The processing conditions in the experiment had the following characteristics: 1. Finishing milling and small back engagement were performed; 2. the workpiece was milled steel (S136) with high hardness after heat treatment; and 3. the experiment needed to produce tool data set quickly and accurately. This paper referred to references [33][34][35] and the measurement methods of milling tool wear in 2010 prognostics and health management (PHM) competition. The following method was used as the blunt standard for the milling cutter in this experiment: The maximum value (VBmax) of the wear zone of the minor flank surface of the milling cutter was selected as the quantified value reflecting the wear state. It was specified that failure of the milling cutter occurred when the wear value of the milling cutter was greater than 0.13 mm. The wear process of the milling cutters (C1, C2, C3, and C4) is shown in Figure 8. Each sample contains three-dimensional vibration signals and the wear values of the four rear blades. To prevent mutual interference of the different blade wear values, the maximum wear value of the four blades was selected as the label of the milling stroke. The wear state of the tool was divided into initial wear, normal wear, and rapid wear. In this paper, the wear state of the tool was defined according to the actual wear curve of each milling cutter. The actual wear curve was used to determine the wear degree of the tool. The tool wear degree was divided into three types of label data, and the label data were converted by a one-hot coding form to facilitate the classification of the final tool wear state. The classification of the final tool wear state is shown in Table 3.  Each sample contains three-dimensional vibration signals and the wear values of the four rear blades. To prevent mutual interference of the different blade wear values, the maximum wear value of the four blades was selected as the label of the milling stroke. The wear state of the tool was divided into initial wear, normal wear, and rapid wear. In this paper, the wear state of the tool was defined according to the actual wear curve of each milling cutter. The actual wear curve was used to determine the wear degree of the tool. The tool wear degree was divided into three types of label data, and the label data were converted by a one-hot coding form to facilitate the classification of the final tool wear state. The classification of the final tool wear state is shown in Table 3.

Comparison of the Experimental Results of the Deep Learning Model
The original signal generated by the milling process was sampled and then sent to the DL neural network model. The model adaptively extracted the high-dimensional features implied in the time-series signal and calculated the actual output value and reality of the model. The Adam algorithm reduced the error distance between the values, and the network weight was continuously updated so that the actual output value of the model was closer to the real value. To further verify the performance of the proposed algorithm, we implemented the bearing fault diagnosis algorithm of the CNN model in [25] and the turbofan engine life prediction algorithm of the BiLSTM model in [26]. The above model was compared with our proposed CLSTM, CBLSTM, and CABLSTM networks. The five training models used the same training parameters. The specific training parameters of the model are shown in Table 4. After the training and verification of the DL neural network, different loss function values and accuracies were obtained. The loss function values of the CNN [25], BiLSTM [26], CLSTM, CBLSTM, and CABLSTM models and the accuracy of the verification set are shown in Figures 9-13, where the x axis was used to represent the number of iterations of the milling data set, and the double y axis was used to represent the loss function value and the model verification accuracy.

Comparison of the Experimental Results of the Deep Learning Model
The original signal generated by the milling process was sampled and then sent to the DL neural network model. The model adaptively extracted the high-dimensional features implied in the timeseries signal and calculated the actual output value and reality of the model. The Adam algorithm reduced the error distance between the values, and the network weight was continuously updated so that the actual output value of the model was closer to the real value. To further verify the performance of the proposed algorithm, we implemented the bearing fault diagnosis algorithm of the CNN model in [25] and the turbofan engine life prediction algorithm of the BiLSTM model in [26]. The above model was compared with our proposed CLSTM, CBLSTM, and CABLSTM networks. The five training models used the same training parameters. The specific training parameters of the model are shown in Table 4. After the training and verification of the DL neural network, different loss function values and accuracies were obtained. The loss function values of the CNN [25], BiLSTM [26], CLSTM, CBLSTM, and CABLSTM models and the accuracy of the verification set are shown in Figures 9-13, where the axis was used to represent the number of iterations of the milling data set, and the double axis was used to represent the loss function value and the model verification accuracy.           It can be concluded from the figure that the loss function value of the network model training set decreased with an increase in the number of iterations and finally stabilized. The loss function value of the verification set fluctuated periodically, and the loss function of the CLSTM model had a large amplitude. The CNN, BiLSTM, CBLSTM, and CABLSTM models were relatively stable, the overall trend of the loss function was decreasing and finally converging, there was no gradient explosion or dispersion phenomenon, and the network convergence speed was faster. The accuracy rates of the CNN and BiLSTM model validation sets were 87.57% and 86.36%, respectively, and the It can be concluded from the figure that the loss function value of the network model training set decreased with an increase in the number of iterations and finally stabilized. The loss function value of the verification set fluctuated periodically, and the loss function of the CLSTM model had a large amplitude. The CNN, BiLSTM, CBLSTM, and CABLSTM models were relatively stable, the overall trend of the loss function was decreasing and finally converging, there was no gradient explosion or dispersion phenomenon, and the network convergence speed was faster. The accuracy rates of the CNN and BiLSTM model validation sets were 87.57% and 86.36%, respectively, and the prediction accuracy was low. This result indicates that the individual DL network could predict the tool wear state, but deeper features could not be captured due to the limitation of the network model capability. There were deeper features hidden in the tool vibration signal. The network model proposed in this paper was superior to the CNN and BiLSTM network. This is because the network structure was relatively deep, which is conducive to mining deeper features. First, the CNN was used to extract the local features of the timing signals, which could effectively filter the noise in the original signal. At the same time, the length of the timing signal was reduced, which facilitates subsequent network learning depending on the time-series characteristics of the time-series signals and improved the ability of the model prediction.
In the network model proposed in this paper, the CABLSTM model had the best performance, which ewas superior to that of the CLSTM and CBLSTM models, and achieved high prediction accuracy. The initial prediction accuracy of the CLSTM model was relatively low. After 65 iterations, the accuracy of the verification set was basically stable and above 96%, and the accuracy was 96.42% after 100 iterations. The CBLSTM model used a two-way LSTM network to access past and future information; that is, it could extract timing signal features from both the forward and reverse directions and extract more abundant information features. After 42 iterations, the accuracy rate of the verification set was basically stable at over 96%, and the accuracy rate was 97.04% after 100 iterations. The CABLSTM model introduced the attention mechanism on the basis of CBLSTM, which selectively filtered out some key information from a large amount of information and focused on the key information, reducing the loss of key information features of long sequence texts. After 35 iterations, the accuracy of the verification set was basically stable and above 96%, the accuracy was 97.50% after 100 iterations, the loss function value reached 0.0651, and the network stability was higher. The loss function and the accuracy of the verification set and test set are shown in Table 5. The data of the milling cutter (C4) were selected as the test set of the DL network model to evaluate the generalization ability of the final model. The total number of test samples was 330, including 23 initial wear samples, 232 standard wear samples, and 75 sharp wear samples. The samples were randomly fed into the trained DL network model. The CABLSTM model had high precision and recall. The F1-score reaches the optimum value at 1 (perfect precision and recall), and the worst is 0. The F1-score in this paper was 0.9697. The evaluation indices of the CABLSTM model are shown in Table 6. The test results show that the CABLSTM model proposed in this paper hade a strong generalization ability. Although the test time was not as good as that of the partial comparison model, the algorithm found a good balance between time and precision. It can be concluded from the figure that the CABLSTM model proposed in this paper completed the inspection of the milling cutter (C4) with an accuracy of 96.97%. The predicted results of normal wear were more accurate. There were some deviations between the initial wear and sharp wear, but the deviations were within a reasonable range. The incorrect prediction results mainly occurred in the transition stage of the wear degree. This is because the tool was in the normal wear state for a long time during the machining process, the amount of data that could be learned by the model was relatively large, and the features were relatively distinct; in addition, the tool had a short period of initial wear and rapid wear, and the amount of data that could be obtained was insufficient. The confusion matrix of the wear test results of the tool test set is shown in Figure 14.
including 23 initial wear samples, 232 standard wear samples, and 75 sharp wear samples. The samples were randomly fed into the trained DL network model. The CABLSTM model had high precision and recall. The F1-score reaches the optimum value at 1 (perfect precision and recall), and the worst is 0. The F1-score in this paper was 0.9697. The evaluation indices of the CABLSTM model are shown in Table 6. The test results show that the CABLSTM model proposed in this paper hade a strong generalization ability. Although the test time was not as good as that of the partial comparison model, the algorithm found a good balance between time and precision.
It can be concluded from the figure that the CABLSTM model proposed in this paper completed the inspection of the milling cutter (C4) with an accuracy of 96.97%. The predicted results of normal wear were more accurate. There were some deviations between the initial wear and sharp wear, but the deviations were within a reasonable range. The incorrect prediction results mainly occurred in the transition stage of the wear degree. This is because the tool was in the normal wear state for a long time during the machining process, the amount of data that could be learned by the model was relatively large, and the features were relatively distinct; in addition, the tool had a short period of initial wear and rapid wear, and the amount of data that could be obtained was insufficient. The confusion matrix of the wear test results of the tool test set is shown in Figure 14. When the real-time monitoring system of tool wear state was working, the acceleration sensors would bring a three-axis vibration signal of length 2000 to the monitoring model of the CABLSTM network. The model performed a forward calculation to identify the current tool wear state and achieve real-time monitoring of the tool wear state. When the real-time monitoring system of tool wear state was working, the acceleration sensors would bring a three-axis vibration signal of length 2000 to the monitoring model of the CABLSTM network. The model performed a forward calculation to identify the current tool wear state and achieve real-time monitoring of the tool wear state.

Comparison of Deep Learning and Machine Learning
To further validate the feasibility of the proposed model, a comparative experiment was designed with alternative ML models. The same data set used for DL was used in the experiment. More specifically, the commonly used models in traditional tool wear value detection approaches, including the BPNN, the SVM, the HMM, and the FNN, were compared with the CABLSTM model proposed in this paper. The wavelet threshold denoising method was used to perform noise reduction processing on the original signal collected by the acceleration sensor. The data features of the time domain, frequency domain, and time-frequency domain were extracted, and the specific extraction method is shown in Table 7. Pearson's correlation coefficient (PCC) was used to reflect the correlation between the feature and the wear value, and the feature with a correlation coefficient greater than 0.9 was selected as the extraction object to achieve a feature dimensionality reduction. The extracted features were used as the input of the ML model. It can be concluded from Table 7 that the accuracy of traditional ML models varied greatly, which was due to the instability of the artificial extraction features, and the construction of the model would have an impact on the prediction results. The DL model proposed in this paper could achieve ideal results by adaptively extracting hidden high-dimensional features and reasonable network depth design for tool processing signals without data pre-processing. The prediction accuracy was significantly higher than that of the BPNN, SVM, and HMM. However, the prediction accuracy of the FNN reached 94.24% because the FNN used a neural network to learn the rules of the fuzzy system. According to the learning sample of the input and output, the design parameters of the fuzzy system were automatically designed and adjusted to realize the self-learning and adaptive functions of the fuzzy system. Compared with the other algorithm models, this method demonstrated a great improvement in performance. The test sample speed of the CABLSTM model could reach 6 ms, which could meet the requirements of real-time tool wear monitoring in industrial production. The accuracy of ML and DL prediction is shown in Table 8.

Conclusions
In this paper, we proposed the application of a CNN and RNN fusion to real-time monitoring of a tool wear state and modified the network parameters and structure according to the characteristics of vibration signals to monitor the tool wear degree in real time. The prediction accuracy of the CBLSTM reached 96.97%. In the pre-processing stage, the wear state of the tool was defined according to the actual wear curve, which was used to determine the wear degree of the tool and improve the accuracy of the data label classification. At the same time, the experimental data were added to the original magnitude data to improve the robustness of the algorithm by employing the data expansion method. A one-dimensional CNN was used to extract the local features, and abundant high-dimensional features were extracted from the original signal, which avoided the limitation of the traditional manual feature extraction, better characterizede the hidden tool wear state information in the original signal, and shortened the network model training time. The idea of introducing the attention mechanism was innovatively applied to the improved CBLSTM network model, which effectively improved the recognition accuracy and generalization performance of the real-time monitoring. The experimental results show that the CABLSTM model had certain advantages in the real-time monitoring of tool wear, which could meet the industrial requirements in terms of recognition accuracy and recognition speed.
In the process of actual manufacturing, the processing procedures and site conditions were often complicated and variable. There were many features that could reflect the wear state of a tool. In this paper, the original signal collected by the acceleration sensor was used as the tool wear monitoring index, which was restricted by the training data volume and processing method. It might not be applicable to meet the requirements of arbitrary working conditions. In future work, multi-source data fusion technology and DL theory will be used to further study the information characterizing the wear state of the tool, improve the proposed method, and extend the method to industrial monitoring.