Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit

Alkhalefah, Hisham

doi:10.3390/math13233798

Open AccessArticle

Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit

by

Hisham Alkhalefah

^1,2

¹

Advanced Manufacturing Institute, King Saud University, P.O. Box 800, Riyadh 11421, Saudi Arabia

²

Industrial Engineering Department, College of Engineering, King Saud University, P.O. Box 800, Riyadh 11421, Saudi Arabia

Mathematics 2025, 13(23), 3798; https://doi.org/10.3390/math13233798

Submission received: 28 October 2025 / Revised: 19 November 2025 / Accepted: 22 November 2025 / Published: 26 November 2025

(This article belongs to the Special Issue Artificial Intelligence for Fault Detection in Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

When performing the milling process, it is essential to consider the life estimation and availability of the milling tool to achieve a reliable and optimized result at a lower cost. It is necessary to monitor the tool’s condition during the milling process due to its inherent wear nature. In earlier times, visual inspection was used to assess the condition of the milling tool, and it was considered a complex and specialized task. Due to this issue, the milling process requires further investigation. In the manufacturing and automation industry, deteriorated milling tools have led to several challenges, including a decline in product quality, reduced equipment utilization, and increased costs. The tool wear prediction is a challenging and complex task, as it includes several variables. The existing framework for tool condition monitoring, in terms of the degree, typically falls short in terms of real-time prediction and accuracy. Hence, in this research, a tool-life estimation model is developed to minimize unexpected failures during the milling process using deep learning techniques. Initially, the data are collected from benchmark sources. The statistical features, deep features via fuzzy autoencoders (FAEs), and t-Distributed Stochastic Neighbor Embedding (t-SNE)-based features are extracted from the input data to capture various information related to the machine. These features are passed to the proposed multi-head cross-covariance attention fusion-based dilated dense bi-directional gated recurrent unit (MCF-DD-BiGRU) for accurate prediction of tool life. The input features are fused using a multi-head cross-covariance attention mechanism to enhance the representation of interdependencies among features. The DBi-GRU network processes the fused features to improve the accuracy of tool-life prediction for milling machines. The prediction efficiency of the implemented model is compared with the existing models to ensure its effectiveness.

Keywords:

tool-life estimation; feature extraction; machine learning; prediction; multi-head cross-covariance attention fusion; dilated dense bi-directional gated recurrent unit

MSC:

68T07

1. Introduction

In the automation and manufacturing industries, a common issue known as machine tool wear can result in higher costs, lower-quality outcomes, and reduced equipment utilization. Therefore, it is challenging to make efficient and accurate predictions of machine tool wear, which is essential for maintaining productivity and machine health [1]. Some of the categorization involved in the tool condition monitoring is fault type determination, fault detection, and estimation of Remaining Useful Life (RUL). It is determined using the prognostics methods [2]. In recent times, prognostics has emerged as a research field to identify tool breakage. The input sensory data are taken by the prognostic model, which is considered to be the health statement problem [3]. In general, this prognosis is determined using steps such as identifying indicator failure, constructing the health index, and estimating the present state. In the prognostics, the analysis or prediction of RUL is the main target, where the time machine can safely work without any breakage [4]. The health index of the tool wear obtains the estimation of RUL for the machine. Product quality can be compromised when there is severe tool wear, resulting in a higher rejection rate and potentially causing accidents in machine tools [5]. Production efficiency is improved through effective tool management, which helps minimize maintenance and operational costs. However, adopting an overly protective strategy does not fully leverage the value of the tool. Additionally, time consumption increases due to unnecessary downtime caused by tool changes [6].

In the machining process, the most active component is the cutting tool, which inevitably wears out as it separates the metal material, ultimately causing failure. The replacement of the cutting tool must happen before the inability to guarantee the quality [7]. The accurate forecasting of tool RUL not only determines replacement but also extends the tool’s lifetime, ensuring savings in machining costs and reducing failures [8]. During the machining process, the workpiece is in contact with the cutting tool, where the quality of the outcome is directly influenced by the degree of wear on the tool. Generating tool changes based on personal experience tends to lead to poor judgment [9]. Acute tool wear can cause tool fracture, chatter, and chipping, which generally harm both the operator and the machine tool. Therefore, it is crucial to determine the tool’s condition during actual machining to reduce processing costs and unnecessary downtime due to tool wear [10]. To mitigate the expenses in the manufacturing sector, it is essential to perform tool condition monitoring. Most tool failures are caused by tool downtime, indicating that tool wear has a direct impact on the quality and precision of the machined surface [11].

Several physics-based approaches and data-driven models have been developed for accurate tool wear prediction. In this era of Artificial Intelligence (AI), machine learning (ML) is applied in various domains [12,13,14]. The utilization of advanced AI approaches for developing tool health classification increases detection accuracy in tool wear, thereby enhancing productivity and reducing maintenance [15,16]. The traditional prediction methods used for milling tool wear typically face limitations in producing accurate results due to the underlying system dynamics. The machining process is stopped by the straightforward model, which removes the tool and provides an optical measurement for the precise determination of the wear area. However, the earlier models have a significant impact on operational efficiency [17]. In processing the sensor time series, the LSTM achieved a unique advantage; however, it has a limitation in performing feature extraction [18]. Therefore, it is crucial to develop an advanced strategy that addresses all the limitations in earlier research. Hence, a new tool-life estimation model is created in this work through the involvement of a deep learning model.

Motivation and research gaps of the proposed work: The motivation for this research work stems from the growing need to accurately forecast tool life and eliminate unexpected tool failures in modern milling operations, which directly impact cost, productivity, and machining quality in the manufacturing sector [18]. The conventional approaches to tool-life estimation, such as physics-based models and empirical methods, primarily rely on limited experimental data and fail to capture the dynamic, nonlinear, and intricate behavior of real-time machining conditions [18]. Even with the development of machine learning and deep learning approaches in recent years, the majority of models still suffer from specific problems, including the limited fusion of multimodal sensor data, inadequate feature representation, and insufficient modeling of temporal dependencies in the data [18]. For instance, most of the existing approaches utilize only handcrafted or statistical features extracted from current or vibration signals, which overlook the deeper correlations among signals from distinct sensors [18]. Likewise, the precious deep learning models such as CNNs, LSTMs, and basic GRUs have improved accuracy but mostly fail to efficiently combine heterogeneous datasets or capture cross-feature dependencies crucial for reliable tool wear and remaining functional life prediction [18]. To address these problems, the existing work presents a novel MCF-DD-BiGRU technique. The motivation behind this mechanism is to design a highly intelligent and comprehensive model capable of integrating diverse features, such as statistical deep and t-SNE-based, while learning complex interdependencies among them via an advanced multi-head cross-covariance attention method. This fusion enables the technique to better understand the relationships among sensor modalities, such as current, vibration, and force signals, which are mostly treated independently in previous works. Moreover, the integration of a dilated dense Bi-GRU model enhances the technique’s capability to capture both long-term and short-term temporal dependencies in sequential data, thereby improving the accuracy of tool-life estimation. Hence, this research addresses the limitations of conventional approaches by proposing a hybrid, robust deep learning model that not only enhances prediction accuracy but also provides high reliability and generalizability for real-world industrial milling applications.

The significant contributions of this work are detailed below.

To implement an intelligent tool-life estimation framework in the milling process by training efficient deep learning technology to predict the tool wear. This framework leverages multi-domain feature extraction to extract significant features, thereby achieving higher prediction accuracy. This prediction provides the status of RUL or tools’ current wear to perform proper tool changes at the right time. This tends to improve the machining precision and reduce the unplanned downtime in an intelligent manufacturing system with real-time monitoring.
To convert the raw data into meaningful input features, this work performed the multi-domain feature extraction process for accurate tool-life prediction. The feature extraction utilizes statistical, t-SNE, and FAE to refine the information available for the model. This process effectively generates clear visual clusters of data points from the given data. This process automatically captures the relevant and comprehensive information from complex data for achieving better predictive maintenance.
To develop an MCF-DD-BiGRU for tool-life estimation with multi-head feature fusion. The system utilizes the MCF as the feature learner to determine the given three sets of features completely. Further, the DD-BiGRU performs the estimation based on the fused feature and delivers the present tool value. The model led to higher prediction accuracy, which is suitable for tool condition monitoring. Based on the outcome, the tool wear value can be monitored in real-time and make an intelligent decision to improve the processing quality of the product with a minimum rejection rate.

The remaining work is as follows. Section 2 details the existing literature on tool-life estimation. Section 3 provides an overview of the designed methodology along with a dataset description. Section 4 determines a different set of feature extraction processes. Section 5 details the design model of the final tool estimation approach. Section 6 provides the experimental setup and the comparative analysis. Finally, Section 7 concludes the work with future directions.

2. Literature Review

This section describes and assesses existing works related to machine learning applications for predicting machine tool life.

2.1. Related Works

Khan et al. [19] have suggested using the LSTM model to determine time series sequential data. The suggested model has the potential to achieve an impressive and accurate outcome. The experiments were conducted for different workpiece materials, including brass, aluminum, and mild steel, to achieve a precise prediction. Elminir et al. [20] have designed a tool wear prediction approach for milling machine cutters incorporating the AE and the LSTM. Several steps were involved in the framework, including correlation analysis and multi-domain feature extraction. The target tool was predicted by training, testing, and validating the developed model. The RUL value was estimated by comparing with the value of the predicted tool wear.

Che et al. [21] designed the hybrid method, where the design construction involved three major components to improve the precision of prediction. The most relevant features were extracted through the filtering process from the raw signals, improving the model’s interpretability. An experimental analysis was conducted to evaluate the features and capabilities of the proposed technique. Shah et al. [22] recommended the Morlet wavelet model for determining the vibration signals from the scalogram. The corresponding wavelet functions were selected by applying the relative energy criterion. The stacked LSTM effectively predicted the tool wear better than other approaches.

Wang et al. [23] have adopted the prediction model using the hybrid network with the involvement of the multi-channel fusion. During the tool-life cycle, the multi-source sensor signals were collected, and the computer vision-based feature extraction was performed. The designed approach improved the efficiency of the RUL prediction model. Kamat et al. [24] developed a deep learning-based system for the prediction process to evaluate the tool life by detecting the wear onset. The tool wear onset was estimated by the hybrid model, and its remaining useful life was predicted.

Li et al. [25] suggested the tool RUL prediction model using a convolutional stacked network. The model fused the gathered multi-sensor signal during the cutting process and then generated signal feature mapping based on the tool value. The experiment on milling revealed that the implemented framework not only improved the RUL prediction but also achieved better generalization ability. Kaliyannan et al. [26] adopted the RL-based model to monitor the condition of the tool in the milling process. The tool condition was classified by employing a deep learning model based on the vibration sensor. The outcome showed that the suggested model outperformed the existing models with enhanced efficiency for the overall process.

2.2. Research Gaps and Challenges

Manufacturing sectors are growing due to the integration of automated and intelligent production processes provided by technological advancements. Maintaining the milling machines helps to improve their performance by handling challenging and complex processing issues. However, tool damage or malfunction can affect the processing performance. Therefore, monitoring the tool’s state during processing is essential to address major tool failures. If the machine components are affected, this directly impacts the production of expensive products. Researchers have developed various deep learning techniques to predict the lifetime of milling tools, and some of the common challenges in the existing works are provided below.

Existing approaches, such as SVM or CNN, are not feasible for capturing the complex and non-linear relationships in the sensor data collected during the milling process. Moreover, they are less efficient in analyzing unstructured feature spaces and high-dimensional data.
Most of the models have obtained less prediction accuracy due to the loss of necessary information. Existing models cause this, as they only consider single-feature modalities.
Issues such as inter-feature correlations and suboptimal utilization of diverse data sources may occur in conventional models, as they do not incorporate efficient feature fusion mechanisms.
The computational complexity and memory consumption of the LSTM model are high. Both past and future signal patterns associated with tool wear are not precisely captured by the existing techniques.

Table 1 shows the features and challenges of some of the reported methodologies in the literature.

3. Significance and Overview of the Proposed Tool-Life Estimation Process in Milling

This section explains the importance of predicting tool life in milling. Furthermore, the proposed method is explained in detail.

3.1. Significance of Estimating the Tool Life in Milling

Tool-life estimation in the milling process involves several prediction methods, which forecast the useful life of the tool by considering factors such as workpiece material, depth of cut, cutting speed, tool material, and feed rate. The practical estimation helps assess workpiece quality before tool failure occurs. Hence, it is significant to perform the tool-life estimation to avoid inefficient tool usage and enable better product quality by managing the machined surface finish and accuracy. Accurate prediction of tool life tends to prevent unplanned stoppages and allows better tool replacement to enhance the efficiency of the overall manufacturing process. Some of the key points regarding the significance of tool-life estimation are given below.

Maximum utilization of the tool is allowed by understanding the remaining useful life of RUL, which also minimizes the cost related to premature regrinding or replacement. In addition, the higher production cost due to unforeseen tool failure is avoided.
In general, poor surface finishes are achieved by the tool wear, which impacts the suitability and quality of the manufactured part. Hence, to rectify the issue, it is crucial to perform the tool-life estimation.
The scheduled tool changes allowed by this tool-life estimation help avoid interruptions in the production process. Moreover, the selection of optimal settings is achieved by understanding machining parameters to balance productivity.
While developing the automated processes and advanced manufacturing system, it is important to have tool-life estimation for continuous adjustment and monitoring.
Forecasting the tool wear allows the manufacturer to gain knowledge about when the tool becomes unusable. This ensures better procurement and inventory management. In addition, it also improves the process control for a reliable outcome.
Precision and machining accuracy greatly affect the final product when using the worn tool, resulting in a loss. This results in a semi-finished product and the wasting of expensive materials.

3.2. Proposed Estimation Model and Its Details

The architectural representation of the proposed model is presented in Figure 1.

This work designed a novel tool wear estimation using an intelligent deep learning framework to estimate the life of the tool in the milling process. Initially, the required tool data for the milling process is gathered from standard resources. In this work, two different datasets are taken from the Kaggle website corresponding to the tool wear estimation task. The dataset contains the raw or processed sensor data from the milling process tools. To improve the model performance and to simplify the complex pattern from the original data, it is necessary to have a feature extraction process. Hence, this work involved three feature extraction modules, which can learn efficient features from the original data. Therefore, three sets of feature extraction are carried out, namely statistical feature, t-SNE (t-Distributed Stochastic Neighbor Embedding), and deep feature extraction. Some metrics are involved in the statistical-based feature extraction, such as homogeneity, median, kurtosis, mean correlation, minimum, and maximum value. These metrics are highly efficient in performing the statistical feature extraction. The extracted statistical features are known as feature set 1. Similar to this, the second set of features is extracted by the implementation of the t-SNE module. This model is mainly used to perform dimensionality reduction for the given original data. It also facilitates processing the complex data based on the probability distribution and provides a low-dimensional representation, which is termed as feature set 2. Finally, the deep features from the data are extracted by using the FAE (fuzzy autoencoder) model, which involves the self-supervised module to extract the discriminative representation from the given input. The extracted three sets of features are further processed through the suggested MCF-DD-BiGRU (multi-head cross-covariance attention fusion-based dilated dense bi-directional gated recurrent unit) for tool-life estimation. The developed MCF-DD-BiGRU is designed by the involvement of the MCF (multi-head cross-covariance attention fusion) in the Bi-GRU (bi-directional gated recurrent unit) with an added DD (dilated dense) layer. Each feature set from the earlier process is given as input to each head in the multi-head layer. The attention module based on cross covariance helps in highlighting the meaningful feature from the input. Among the keys and queries, compute the attention weight for cross covariance. Further, the fused features are determined by the Bi-GRU model, which includes the dilated dense net to improve the performance. The network learns the complex temporal relationship effectively. In the end, the comparative analysis is performed to evaluate the performance of the model with different traditional approaches.

3.3. Experimented Dataset Details

This research utilizes two datasets from the Kaggle website that contain the cutting tool data from industrial milling to predict the tool wear. The dataset facilitates the improvement of predictive maintenance strategies for the milling process to minimize the cost and downtime.

Dataset 1: the RUL dataset taken from Kaggle, accessed on 23 September 2025 [27]. The data is collected from the real-time industrial milling process using sensors for the RUL prediction and tool wear estimation.

Dataset 2: CNC milling process dataset obtained from Nature Scientific Data accessed on 23 September 2025 [28]. The dataset contains both the raw data and the processed data of the cutting tool used in the milling process. The dataset contains both the current signals and vibration signals.

From the above dataset, the collected data are represented as Ct_p, where p = 1, 2, 3, …, P. Here, the term P represents the total data gathered from the dataset.

Reasons for choosing these datasets: The RUL dataset and the CNC milling process dataset are both highly related for tool-life estimation in the missing task because they include detailed sensor data that reflects the real-world industrial conditions. The RUL dataset is mainly designed for predictive modeling tasks associated with the tool wear and failure prediction, providing a solid foundation for analyzing how machine condition evolves over time. In addition, the CNC milling process dataset includes both processed and raw data from the CNC milling process, providing an in-depth view of the milling process. The current signals in this dataset can reflect the power usage or cutting force, while the vibration signals in this dataset are commonly related to the tool wear and machine dynamics. Thus, these datasets provide a rich source of data on tool wear and failure patterns, making them effective for employing advanced predictive approaches, especially deep learning models.

Methods studied on these datasets: Some machine learning techniques, such as Support Vector Machines (SVMs), Random Forests, and K-Nearest Neighbors (KNNs), as well as deep learning techniques such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and FAEs have been studied on these datasets to implement an accurate tool-life prediction process.

4. Different Set of Feature Engineering Mechanisms for Determining the Tool Life

This section describes the algorithm in detail with its mathematical equations. The research work considers three features, such as statistical, t-SNE, and FAE, as significant.

Reasons for choosing statistical, t-SNE, and FAE features: The inclusion of statistical, t-SNE, and FAE features as additional inputs in the designed model is motivated by the need to capture complementary and comprehensive data from the milling process, thereby guaranteeing a highly discriminative and richer feature representation for accurate tool-life prediction. Each feature type uniquely contributes to the approach’s understanding of the tool’s condition. The statistical features, such as minimum, maximum, mean, median, homogeneity, correlation, variance, skewness, kurtosis, and entropy, are selected because they offer fundamental descriptive details about the sensor signals, efficiently summarizing the time-domain features of the current, vibration, and force data. These features support recognizing the basic variations and patterns associated with the tool wear and cutting conditions. Nevertheless, the statistical features alone may not entirely capture the complex nonlinear relations in the sensor data. To resolve this problem, the FAE features are included, as FAEs can automatically learn the abstract, deep representations of the input data while handling the noise and uncertainty inherent in industrial signals. The fuzzy strategy enables the approach to represent the imprecise or ambiguous sensor data more efficiently, resulting in improved generalization and robustness. In addition, t-SNE features are added to enhance the approach’s ability to represent high-dimensional data in a lower-dimensional manifold while preserving local neighborhood structures. This supports the disclosure of hidden relations between distinct tool wear states that may not be evident in the original or statistical features. By integrating these three feature types, the designed framework achieves a highly detailed understanding of the milling task, enabling highly reliable and accurate tool-life validation compared to approaches that rely on a single type of feature representation.

4.1. Statistical Features

The statistical feature extraction [29] utilizes the statistical and mathematical operations to convert the original data Ct_p into a significant set of derived features. This process retains the essential data by reducing the dimensionality and complexity to generate accurate performance. Some of the metrics used to retrieve the feature set are detailed below.

The minimum value represents the most significant value or the lowest possible value in each channel.

The maximum value represents the highest possible value in each channel.

The mean represents the arithmetic average of the elements in a dataset by summing all the values and dividing them by their total number using Equation (1).

M e = \frac{1}{n \sum_{i - 1}^{n - 1} y i}

(1)

Here, the term N indicates the number of data points from the given dataset, and yi indicates the numbers or data.

The median measures the middle value in a set of numbers, which is the central value of the sorted dataset based on Equation (2).

M e = A v g {(\frac{n + 1}{2})}^{t h}

(2)

Homogeneity represents the uniformity of data along with its consistency. It generates numerical features based on Equation (3).

H g = \frac{1}{1 + {(i - j)}^{2}} y (i, j)

(3)

Correlation measures the relation between the two variables based on Equation (4).

C o = \frac{\sum (y_{c} - \tilde{y}) (z_{c} - \tilde{z})}{\sqrt{\sum {(y_{c} - \tilde{y})}^{2} \sum {(z_{c} - \tilde{z})}^{2}}}

(4)

Here, Z_c is the value of the Z variable in the sample.

Variance is the squared difference between each value and the mean. The frequency distribution variance is evaluated using Equation (5).

V r = [\sum_{i - 1}^{n} {(y_{i} - y)}^{2} / N] / σ^{3}

(5)

Here, the term

y

is the mean intensity and σ is the standard deviation.

Skewness is the degree of distortion in the dataset relative to a normal distribution. It is estimated using Equation (6).

S k = \frac{d_{3}}{d_{2}^{\frac{3}{2}}}

(6)

Here

m_{2} \in {(y_{c} - \tilde{y})}^{2} / n

and

m_{3} \in {(y_{c} - \tilde{y})}^{3} / n

.

Kurtosis measures the extremity of values in the tails, which also evaluates the distribution using Equation (7).

K u = \frac{d_{4}}{d_{2}^{2}}

(7)

Entropy provides the intraset distribution, which is a useful measure and is determined based on the set of patterns using Equation (8).

E n = - \sum_{c = 1}^{n} E_{d} \log_{2} E_{d}

(8)

Here, the probability value is indicated as E_d for achieving the dth value. The compact and informative feature set was created by using these metrics and denoted as

C t_{p}^{F 1}

. This feature helps in improving the predictive accuracy and computational efficiency of further deep learning models by its informative features.

4.2. T-SNE Features

T-SNE [30] is the module used for exploratory data analysis and visualization for performing dimensionality reduction. The non-linear relationship can be revealed using t-SNE between the data, which is valuable for determining the corresponding pattern. The high-dimensional data Ct_p can be visualized using this module even in a lower-dimensional space with two or three dimensions. The detection of outliers, clusters, and patterns is facilitated through t-SNE for the complex dataset. This helps to gain knowledge about the data structure and leads them to the feature engineering steps.

t-SNE is generally used for high-dimensional data based on its statistical measure to preserve the significant data. The non-linear dimensionality-lowering model for pairwise similarity in high-dimensional data can reduce the differentiation of probability distribution Q. The Euclidean distance calculates the similarity between

y_{i}

and

y_{j}

data points. Through the conditional probability, the pairwise similarity Q(i,j) between the high-dimensional data points is determined using Equation (9).

Q (y_{i} / y_{j}) = \frac{S (y_{i}, y_{j})}{\sum_{n \neq i}^{M} S (y_{i}, y_{n})}

(9)

Through t-SNE, the pairwise similarity of the low-dimensional data point Z_i is evaluated using Equation (10).

R (y_{i} / y_{j}) = \frac{S (z_{i}, z_{j})}{\sum_{n \neq i}^{M} S (z_{i}, z_{n})}

(10)

The determination of the low-dimensional points achieved, reducing the Kullback–Leibler divergence (KL) between Q and R in the joint probability distribution based on Equation (11).

K L = \sum_{i} \sum_{j} Q (y_{i}, y_{j}) \log \frac{Q (y_{i} / y_{j})}{R (z_{i}, z_{j})}

(11)

Based on the formulation, the optimal dimensionality reduction is achieved using the t-SNE model by minimizing the KL divergence. The KL evaluates the projection to a low low-dimensional representation

C t_{p}^{F 2}

from a high high-dimensional structure.

4.3. Fuzzy Autoencoder

FAE [31] leverages fuzzy logic to extract the discriminative representations and provide robust features. The module self-supervised the data generated during the training process. With the improved discrimination, the data can be converted into another space along with the superior representation learning managed by the autoencoder. Based on this strategy, the loss function can be detailed as the minimum of

Γ (Y, Θ)

, and it is derived in Equation (12).

Γ (Y, Θ) = \min_{Θ, D} \sum_{Y_{i} \in Y} [\begin{array}{l} \frac{η}{2 e} | | Y_{i} - Z_{i} | |^{2} + \frac{1 - η}{2 m} \\ \sum_{j - 1}^{L} \sum_{H_{i} \in D_{j}} | | H_{i} - D_{j} | |^{2} \end{array}]

(12)

Here, the term

η

is the adjustable parameter and parameter

Θ \in \{ω, B_{1}, B_{2}\}

. The influence of clustering-oriented loss and the reconstruction loss is regulated using these parameters. The effective learned representations are guaranteed by the reconstruction loss, which also reduces the mean square error. The discrimination of the learned features improved based on the clustering-oriented loss using Equation (13).

D_{j} = \frac{\sum_{H_{i} \in D_{j}} v_{i, j} H_{i}}{\sum_{H_{i} \in D_{j}} v_{i, j}}

(13)

In each iteration, the hidden layer block center is indicated as D_j. In each block, the improvement in the sample similarity enhances the learned feature discrimination through the fuzzy optimization. The better separability feature

C t_{p}^{F 3}

resulted from the training process by clustering the hidden layer features in each block.

5. Calculation of Milling Tool Life Using Multi-Head Cross-Attention for Fusion with Dilated Dense Bi-GRU

This section describes the calculation steps for milling tool life using multi-head cross attention for fusion with a dilated dense Bi-GRU.

5.1. Multi-Head Cross-Covariance Attention Fusion

MCF is the incorporation of multi-head convolution and the Cross-Covariance Attention (CCA). Few parallel heads are contained in the multi-head [32], where each head remains independent. The long-range structural data is extracted using this layer. Here, the given input is used as the embedding based on the sequence length and generates an output of the same size. Based on the time direction, the multi-head attention is determined based on Equation (14).

M H (Q^{t}, K^{t}, V^{t}) = [g_{1}; g_{2}; \dots g_{n}] X_{t}^{O}

(14)

Here

X_{t}^{O} \in Γ^{n \times D_{v} \times D_{l a y}}

, where n indicates the number of heads. In addition, the terms

Q^{t}

,

K^{t}

, and

V^{t}

represent the query, key, and values, respectively, and are connected with g, as shown in Equation (15).

g_{i} = A (Q^{t} W_{t}^{Q}, K^{t} W_{t}^{K}, V^{t} W_{t}^{V})

(15)

Further, the structure allows for the integration of each input element to improve the lower-level feature map. CCA [33] is the kind of attention module that highlights the meaningful feature from the input. It is the kind of transposed mechanism that works across the features channel and provides linear complexity in regard to token length, and ensures effective input processing. The attention weight is computed in terms of cross cross-covariance matrix among the keys and queries to determine across feature dimensions. The efficient processing allowed by this layer fixes the channel number and leads to more robust performance. An illustration of MCF is depicted in Figure 2.

5.2. Bi-GRU

Bi-GRU [34] is the improved version of the GRU network that represents the variations. The structure is a combination of the two GRUs, operating in both forward and backward directions. The given input is processed in both directions simultaneously, and it is represented as

j_{y}^{G} = [j_{1}^{G}, j_{2}^{G}, j_{3}^{G}, \dots j_{t}^{G}]

. The model can preserve both the future and past data based on its improvement. With the help of the reset and update gate, the input sequence is processed from left to right in the forward GRU. The hidden state is controlled based on the update gate in response to the added data. The inputs are considered using the sigmoid activation function while computing the update gate. The hidden state and reset gate are evaluated using Equations (16) and (17).

V_{t} = σ (z_{V} . [I_{t - 1}, j_{t}^{G}])

(16)

S_{t} = σ (z_{S} . [I_{t - 1}, j_{t}^{G}])

(17)

The terms

z_{V}

and

z_{S}

are the weight metrics of reset and the update gate. The candidate cell states are used to perform the new memory content based on the computation of the hidden state. The candidate’s weight is evaluated Equations (18) and (19).

I_{t} = (1 - V_{t}) \times G_{t - 1} + G_{t - 1} \times {\tilde{G}}_{t}

(18)

{\tilde{G}}_{t} = \tanh (z_{G} . [S_{t} \times I_{t - 1}, j_{t}^{G}])

(19)

The forward and backward computations are evaluated using Equations (20)–(22).

{\vec{G}}_{t} = G R U^{f r w} (j_{t}^{G}, {\vec{G}}_{t - 1})

(20)

{\overset{\leftarrow}{G}}_{t} = G R U^{b c k} (j_{t}^{G}, {\overset{\leftarrow}{G}}_{t - 1})

(21)

G_{t} = {\vec{G}}_{t} \oplus {\overset{\leftarrow}{G}}_{t}

(22)

The concatenation of the forward and backward GRU obtains the output of Bi-GRU. The general structure of Bi-GRU is depicted in Figure 3.

5.3. Proposed MCF-DD-BiGRU for Estimation

This framework implemented the MCF-DD-BiGRU to learn from extracted features and predict the tool condition for different machining processes. The main aim of designing the model is to monitor the tool wear value in real-time.

MCF-DD-BiGRU is constructed by incorporating the MCF into the baseline Bi-GRU, which includes added dense and dilated layers.

Reasons for using DBiGRU: The decision to utilize the DBi-GRU instead of standard GRU is driven by the requirement to efficiently capture both complex temporal dynamics and long-term dependencies in the sequential data attained from the milling task. Although the standard GRU is effective in modeling temporal relations, it often struggles to retain long-range dependencies when handling lengthy time-series sensor data, such as current and vibration signals, which exhibit gradual changes in tool wear patterns. The inclusion of dilated connections enables the technique to skip specific time sequences during the information flow, thus expanding the receptive field without increasing computational expenses or network depth. This enables the DBi-GRU to capture broader contextual data over longer sequences, which is significant for precisely tracking the gradual degradation trends of the tool. In addition, the bi-directional design enhances the technique’s ability to learn from both future and past temporal contexts simultaneously, ensuring that each prediction encompasses the entire temporal scope of the signal rather than just previous time sequences. This is highly significant for tool-life validation, where the future signal behavior can offer significant cues about the present state of wear. Thus, the integration of dilation and bi-directionality enables the designed DBi-GRU to outperform the existing GRU by providing a detailed temporal representation, enhanced feature learning effectiveness, and higher prediction accuracy, most importantly in the scenarios with non-linear, complex, and long-term temporal dependencies typical of real-world milling data.

Here, the MCF performs feature fusion and sequence modeling, as attained by the DD-BiGRU. The extracted three feature sets from the previous process are given as input to the MCF-DD-BiGRU model. The three sets of representation,

C t_{p}^{F 1}

,

C t_{p}^{F 2}

, and

C t_{p}^{F 3}

, are initially processed by each head in MCF. The data is jointly analyzed in this layer at different positions, ensuring that it focuses on various aspects of the input data simultaneously. The multi-head in the structure enables the network to focus on significant data simultaneously from different features and determine long-range dependencies between sensor modalities. The outcome from the parallel heads is linearly transformed to generate the final result, which tends to make the network capture different relationships. The model has the efficiency to fuse diverse data with different characteristics and adaptively learn the relationships and dependencies between them. The CCA operates across the feature channel, in conjunction with the token sequence. The cross-covariance matrix is computed by the key and query generated in each channel, which aggregates data from other channels. Here, the features are weighted and combined dynamically from different channels. The data-rich feature generated by this feature fusion process helps enhance further processing.

Other than the general concatenation, here the attention mechanism helps to focus on selective features, which are relevant to the prediction task. The features are further processed through the layers in DD-BiGRU. The construction of the Bi-GRU process features in both forward and backward directions to determine the previous and the future complex sequential pattern. DD [35] is the incorporation of the dilated convolution into the Densenet structure. The feature reuse encouraged by the propagation of the strengthened feature solves the vanishing-gradient issue, enabling a robust and deeper model. To perform better data flow and evaluate an effective receptive region based on the convolutional network. The typical dilation factor is considered in the structure with the base of the depth layer. In the transition layer, there are both the pooling layer and the convolution layer. The addition of a dilation layer helps in the effective evaluation of the multi-scale feature by placing the dilation in the transition layer.

The integration of efficient sequence modeling with the advanced feature fusion results in more robust and accurate predictions. The complex temporal relationships are learned effectively by the model during the wear progression. The result of the Bi-GRU estimates the level of current tool wear. Based on the results, the network can achieve high accuracy in processing real-time data, which makes it suitable for developing the tool condition monitoring system.

Exploration of a three-level feature extraction strategy (statistical, t-SNE, and FAE) combined with a multi-head cross-covariance attention mechanism and different recurrent architectures: The designed research work utilizes the DD-BiGRU model for estimating the tool life. To ensure its superiority, the exploration of a three-level feature extraction strategy (statistical, t-SNE, and FAE) combined with a multi-head cross-covariance attention mechanism and different recurrent architectures, such as the LSTM, GRU, and RNN, is given below.

Three-Level Feature Extraction + Multi-Head Attention + LSTM: This framework starts with the three-level feature extraction mechanism, including statistical features, t-SNE-based nonlinear manifold features, and FAE deep features, each capturing various characteristics of the milling process. These features are fused by employing the multi-head cross-covariance attention mechanism, which improves the representation by modeling the interdependencies between the feature groups. When an LSTM model executes this fused representation, the framework employs its memory cells to learn the long-term temporal patterns, but its complex gating structure often limits its capability to efficiently use all its temporal variations in the fused feature.

Three-Level Feature Extraction + Multi-Head Attention + GRU: Utilizing the same three-level feature extraction method, the statistical, t-SNE, and FAE-derived features are integrated via the multi-head cross-covariance attention module to produce the rich fused representation. This fused feature set is further given to the GRU model, which simplifies the learning tasks via the reduced gating strategy while still capturing the main temporal dependencies. Though GRU performs more effectively than LSTM and responds better to the attention-improved features, it still struggles to entirely model the intricate multi-scale temporal patterns embedded in the fused feature space.

Three-Level Feature Extraction + Multi-Head Attention + RNN: Likewise, the three-level features, such as statistical, t-SNE, and FAE, are fused by employing the multi-head cross-covariance attention mechanism before being subjected to the conventional RNN. While the RNN provides a baseline for the temporal sequence modeling, it lacks gating blocks and thus exhibits difficulty in retaining the long-term dependencies. As an outcome, its capability to learn from the rich fused features is limited, resulting in minimized prediction accuracy compared to highly advanced recurrent models.

Superiority of the designed DD-BiGRU approach: Although the LSTM, GRU, and RNN benefit from the attention-fused three-level features, these models do not entirely capture the complex temporal dynamics of the milling process. To resolve these problems, the designed DD-BiGRU combines the dilated connections, dense feature reuse and the bidirectional temporal learning. When integrated with the same multi-head attention, fused features, the designed DD-BiGRU attains tool-life prediction performance, demonstrating its superiority over LSTM, GRU and RNN in modeling complex, multiscale patterns in milling data.

Figure 4 represents the architectural view of the designed MCF-DD-BiGRU for tool-life estimation.

6. Results and Discussion

This section discusses the results obtained based on different performance metrics.

6.1. Experimental Setup

The designed tool wear estimation model was implemented in Python 3.10, and the determinations were carried out. Several performance measures were taken to evaluate the model’s performance in estimating tool wear. The comparative method used in the experimentation was LSTM [19], LSTM-AE [20], GAN-LSTM [22], and CNN-LSTM [5], respectively. The computer configuration is Intel Xeon^®, CPU E52630, with 32 GB RAM. In the designed work, the training/testing splits were performed at the session level to eliminate the temporal or identity leakage. Each session appears in only some splits. Dataset 1 was split 80/20 into 1120 training sessions and 280 test sessions. Dataset 2 was split 80/20 into 774 training sessions and 193 test sessions. No session, subject, or tool instance appears in more than one split. Further, the initial experimental parameters of the model were epochs: 100; steps per epoch: 10; batch size: 32; optimizer: Adam; and hidden neuron count: 128. Table 2 presents the hyperparameter search ranges and the final picks for the designed model.

6.2. Evaluation Metrics

Accuracy, mean absolute error (MAE), mean absolute percentage error (MAPE), and mean squared error (MSE) were evaluated using Equations (23)–(26).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(23)

The terms

T P

and

T N

represent the true positive and true negative.

F P

and

F N

denote the false positive and false negative.

M A E = \frac{\sum_{i = 1}^{z} (g_{i} - y_{i})}{n}

(24)

M A P E = \frac{1}{n} \sum_{i = 1}^{z} (\frac{g_{i} - y_{i}}{g_{i}})

(25)

M S E = \sum {(g_{i} - y_{i})}^{2} n

(26)

Here, the predicted and observed values are indicated by the term a and b.

Root mean square error (RMSE), mean absolute scaled error (MASE), mean percentage error (MPE), and symmetric mean absolute percentage error (SMAPE) were calculated using Equations (27)–(30).

R M S E = \sqrt{\frac{\sum_{i = 1}^{z} {(g_{i} - y_{i})}^{2}}{n}}

(27)

M A S E = \frac{1}{n} \sum_{i = 1}^{z} (\frac{y_{i} - g_{i}}{y_{i}})

(28)

M P E = \frac{1}{n} \sum_{i = 1}^{z} (\frac{y_{i} - g_{i}}{y_{i}} \times 100)

(29)

S M A P E = \frac{100 %}{n} \sum_{i = 1}^{z} (\frac{g_{i} - y_{i}}{\frac{g_{i} + y_{i}}{2}})

(30)

6.3. K-Fold-Based Performance Analysis of MCF-DD-BiGRU for Tool-Life Estimation

The K-fold method is considered a robust measure for evaluating the performance of the designed model for tool wear prediction. The reliable estimation of the network’s generalization is determined using this analysis. The comparison between MCF-DD-BiGRU and other traditional networks in terms of tool-life estimation is shown in Figure 5 for both datasets. In this study, the evaluation is performed utilizing k-fold cross-validation, where the experiments are performed with k = 3, 4, and 5 to guarantee the reliability and stability of the technique’s performance across distinct data partition settings. The folds are constructed by employing the run/tool-grouped strategy rather than random splitting, meaning that the entire samples originating from the same machining run or the same tool wear cycle are kept within the same fold. This eliminates the framework from encountering the data segments in the test set that are operationally or temporally correlated with the training set. To further prevent the temporal leakage, the sequence data are partitioned on the basis of their chronological order within each tool run, guaranteeing that the future time steps appear near in the training part of the fold. These precautions ensure that the validation reflects a realistic prediction scenario and that the designed framework is not unintentionally advanced by the data leakage across folds. From the analysis, the MCF-DD-BiGRU achieved better accuracy than the LSTM by 2.6%, the LSTM-AE by 1.4%, the GAN-LSTM by 2.1%, and the CNN-LSTM by 0.7%. From the outcome, the presented model MCF-DD-BiGRU provides more stable estimation in predicting tool wear than the other comparative models.

6.4. Performance Assessment of MCF-DD-BiGRU Based on Batch Size

The model performance is analyzed through the determination of the batch size in this experiment. The designed MCF-DD-BiGRU for tool-life prediction is distinguished from other models, as shown in Figure 6. This analysis helps to evaluate the performance of the prediction model, where a smaller batch size provides more frequent generalization outcomes. Based on the graphical analysis, the MCF-DD-BiGRU achieved higher accuracy than the LSTM by 2.52%, the LSTM-AE by 1.68%, the GAN-LSTM by 2.27%, and the CNN-LSTM by 1.61%. The results show that the designed model provides a successful prediction for both datasets. This reveals the efficiency of the MCF-DD-BiGRU in predicting tool wear.

6.5. Epoch-Based Comparative Analysis for Designed MCF-DD-BiGRU

Some tracking model metrics are used in the epoch-based performance analysis to evaluate the model’s overfitting and underfitting based on the given dataset. The performance of the MCF-DD-BiGRU for tool-life prediction is compared to that of other conventional networks in Table 3 for the given dataset. The model’s parameters are updated during each epoch based on the measure of training data error, guiding the network to learn iteratively. From the comparison, the designed MCF-DD-BiGRU has a lower MPE value than the LSTM by 72.3%, the LSTM-AE by 39.6%, the GAN-LSTM by 55.7%, and the CNN-LSTM by 25.1%. According to the analysis, the proposed MCF-DD-BiGRU exhibits better prediction capability with fewer errors compared to other models.

The experimental results clearly demonstrate that the proposed MCF-DD-BiGRU framework is effective in predicting tool wear and estimating tool life during milling operations. The MCF-DD-BiGRU consistently achieved higher prediction accuracy and lower error rates than baseline models, including LSTM, LSTM-AE, GAN-LSTM, and CNN-LSTM, across various evaluation metrics, such as MPE, SMAPE, RMSE, and MASE. The two main reasons for these improvements are (i) the strong feature representation made possible by combining statistical, t-SNE, and fuzzy autoencoder-based deep features; and (ii) the cross-covariance attention mechanism, which makes the model better at finding complex relationships between different data sources.

The results in Table 3 and Figure 5 and Figure 6 demonstrate that the proposed method continues to perform better even when the batch sizes and epoch iterations are changed. This indicates its strong generalization capabilities. The multi-head attention fusion enables the network to utilize complementary details from different feature spaces, thereby improving temporal pattern recognition. This differs from traditional models that only utilize single-modality features. The addition of dilated dense layers to the Bi-GRU structure also enables efficient modeling of long-term dependencies without excessively increasing the number of parameters. This solves some of the scalability problems that LSTM-based architectures have.

These results are beneficial for real-time tool condition monitoring, where accurate and timely predictions of RUL can significantly cut down on downtime, make tool-change intervals more efficient, and boost machining productivity. The suggested method aligns with the general trend of utilizing AI-driven predictive maintenance strategies in smart manufacturing settings, resulting in lower operating costs and improved resource utilization.

6.6. Contributions of Features in Designed MCF-DD-BiGRU

The implemented model utilizes three distinct features, including statistical, t-SNE, and FAE, to estimate tool life in the milling process. Figure 7 shows the accuracy analysis of these features. Although the statistical, t-SNE, and FAE features are extracted to capture various features of the milling process, the results show that the most significant contribution comes from the fused feature (accuracy: 96% for both datasets). These fused features are generated via the multi-head cross-variance attention method, which improves the interdependencies between all three feature types. As a result, no single feature type dominates individually; instead, the attention-driven fusion of all features provides the most substantial impact.

6.7. Ablation Study

The ablation study, presented in Table 4, was performed to validate the contribution of each component in the designed MCF-DD-BiGRU model, particularly the impact of eliminating the three-level feature extraction of the multi-head cross-covariance attention block. The results across both Dataset 1 and Dataset 2 demonstrate that the basic GRU, Bi-GRU, dilated Bi-GRU, dense Bi-GRU, and dilated dense Bi-GRU all significantly improve accuracy as additional architectural improvements are introduced. However, none of these models attains the performance of the designed MCF-DD-BiGRU model, which provides the highest accuracy, 96.52% on Dataset 1 and 96.49% on Dataset 2. This performance gain ensures that the combination of multi-head attention-based fused features, together with the DD-BiGRU, relatively improves temporal feature learning.

6.8. Computational Complexity Analysis

A detailed complexity analysis of the designed MCF-DD-BiGRU is presented in Table 5, which contrasts its training time, testing time, total computational time, and computational space with those of baseline models. Across both Dataset 1 and Dataset 2, the designed framework demonstrates lower training time (35.28 min for Dataset 1 and 37.65 min for Dataset 2) and testing time (10.53 min and 10.00 min). The minimization of execution time results in improved computation time, which is lower than that of all comparative approaches, indicating that the designed MCF-DD-BiGRU is not only accurate but also highly effective. Moreover, the designed model exhibits one of the smallest computational space footprints for both datasets (205 KB and 200 KB). These results clearly illustrate that, though MCF-DD-BiGRU includes multi-head attention, dense connections, and dilated BiGRU layers, its optimized model provides superior predictive performance while maintaining minimized computational overhead, making it both effective and computationally efficient compared to other high-complexity techniques.

6.9. Impact of Modifications to Key Parameters on Results

The performance of the designed MCF-DD-BiGRU technique is influenced by some primary parameters, and modifying them revealed their individual impacts on the computational efficiency and prediction accuracy. For the hidden size, increasing it improved the ability of the model to obtain the long-term temporal dependencies, but sizes larger than 128 offered negligible accuracy gains while increasing the computational cost. Refining the number of attention heads displayed, it was found that too few heads constrained the feature interaction learning, whereas too many heads provided redundancy without any improvement, making four heads optimal. The modifications in dilation rates affected the temporal receptive field: higher rates allowed the learning of long-term dependencies, but very large values can skip significant short-term patterns. Likewise, the depth of the dense blocks improves feature reuse up to a certain point; with one layer being enough for this process. The dropout rate tuning showed that lower rates caused overfitting, while higher rates (0.4) balanced stability and generalization. Adjusting the number of epochs and batch size also influenced performance: a small batch size slowed training, and too few epochs eliminated overall convergence, whereas the large batches and high epochs did not offer the additional accuracy measures. Finally, the k-fold setting ensured that the accuracy of the model is robust across distinct data splits. Thus, this analysis illustrates that careful tuning of each parameter is significant to attain a balanced trade-off among convergence stability, predictive accuracy, and computational efficiency, estimating the robustness of the designed MCF-DD-BiGRU approach.

7. Conclusions

This study proposed an improved deep learning framework named MCF-DD-BiGRU to make it easier to estimate tool wear and predict the remaining useful life of tools in milling processes. A three-level feature extraction strategy, statistical, t-SNE, and FAE, was used to extract diverse and informative feature sets from raw sensor data. A multi-head cross-covariance attention mechanism was utilized to combine these features. This made it easier to see how they were related and improved the quality of the representation. After that, the fused features were run through a dilated dense Bi-GRU network to find patterns over time and obtain a very accurate estimate of tool wear.

Two datasets were used to test the proposed model, and the results showed that it worked better than traditional LSTM, LSTM-AE, GAN-LSTM, and CNN-LSTM models. The MAE of the designed MCF-DD-BiGRU model was reduced by 62.5% compared to the LSTM, by 43.75% compared to the LSTM-AE, by 50% compared to the GAN-LSTM, and by 18.75% compared to the CNN-LSTM for the 64th batch size value in the second dataset. In addition, the model achieved an improvement of up to 3.4% in accuracy and demonstrated reduced prediction errors across multiple metrics, indicating its robustness and suitability for real-world applications.

This study provides a scalable, interpretable, and precise framework for estimating tool life, suitable for integration into contemporary predictive maintenance systems. The framework helps reduce unplanned downtime, improve machining accuracy, and make manufacturing operations more cost-effective by making it easier to predict tool wear.

Limitations and Future Research Directions: The developed MCF-DD-BiGRU framework shows great promise for predicting tool wear and tool life, but there are still some challenges that need to be explored further. One major problem with the model is that it is very sensitive to hyperparameter tuning. To obtain the best performance right now, one has to make manual adjustments, which can take a long time and may not always provide the optimal settings for all operating conditions. This shows that future work needs automated tuning strategies, such as advanced optimization techniques or reinforcement learning-based methods that can search for the best hyperparameters on their own.

Another limitation concerns the dependency on a specific set of publicly available datasets. While these datasets are well-established benchmarks, they may not fully capture the complexity of real industrial machining environments, where factors such as non-stationary noise, varying tool geometries, and mixed-material machining are often present. Therefore, future studies should focus on validating and enhancing the model’s robustness through transfer learning and domain adaptation strategies that allow better generalization across diverse operational settings.

Funding

This research was funded by the Ongoing Research Funding Program (ORF-2025-499), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

The data presented in this study are openly available at https://www.kaggle.com/datasets/programmer3/milling-tool-wear-and-rul-dataset (accessed on 25 August 2025) and https://doi.org/10.1038/s41597-025-04923-y (accessed on 25 August 2025).

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT 5 for the purposes of language corrections of some sections only. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

t-SNE	t-Distributed Stochastic Neighbor Embedding
MCF-DD-BiGRU	Multi-head cross-covariance attention fusion-based dilated dense bi-directional gated recurrent unit
FAE	Fuzzy autoencoder
RUL	Remaining useful life
AI	Artificial Intelligence
ML	Machine learning
DBN	Dynamic Bayesian Network
MCF	Multi-head cross-covariance attention fusion
Bi-GRU	bi-directional gated recurrent unit
DD	Dilated dense
SVM	Support Vector Machine
KNN	K-Nearest Neighbor
RNN	Recurrent Neural Network
CNN	Convolutional Neural Network
CCA	Cross-Covariance Attention
MAE	Mean absolute error
MAPE	Mean absolute percentage error
MSE	Mean squared error
RMSE	Root mean square error
MASE	Mean absolute scaled error
MPE	Mean percentage error
SMAPE	Symmetric mean absolute percentage error
KL	Kullback–Leibler divergence

References

Wang, X.; Yan, J. Deep learning based multi-source heterogeneous information fusion framework for online monitoring of surface quality in milling process. Eng. Appl. Artif. Intell. 2024, 133, 108043. [Google Scholar] [CrossRef]
Liu, R.; Tian, W. A novel simultaneous monitoring method for surface roughness and tool wear in milling process. Sci. Rep. 2025, 15, 8079. [Google Scholar] [CrossRef] [PubMed]
Hojati, F.; Azarhoushang, B.; Daneshi, A.; Hajyaghaee Khiabani, R. Prediction of Machining Condition Using Time Series Imaging and Deep Learning in Slot Milling of Titanium Alloy. J. Manuf. Mater. Process. 2022, 6, 145. [Google Scholar] [CrossRef]
Ahmed, M.; Kamal, K.; Ratlamwala, T.A.H.; Hussain, G.; Alqahtani, M.; Alkahtani, M.; Alatefi, M.; Alzabidi, A. Tool Health Monitoring of a Milling Process Using Acoustic Emissions and a ResNet Deep Learning Model. Sensors 2023, 23, 3084. [Google Scholar] [CrossRef]
Bhandari, B.; Park, G. Non-contact surface roughness evaluation of milling surface using CNN-deep learning models. Int. J. Comput. Integr. Manuf. 2024, 37, 423–437. [Google Scholar] [CrossRef]
Umar, M.; Siddique, M.F.; Ullah, N.; Kim, J.-M. Milling Machine Fault Diagnosis Using Acoustic Emission and Hybrid Deep Learning with Feature Optimization. Appl. Sci. 2024, 14, 10404. [Google Scholar] [CrossRef]
Karabacak, Y.E. Deep learning-based CNC milling tool wear stage estimation with multi-signal analysis. Eksploat. I Niezawodn. Maint. Reliab. 2023, 25, 168082. [Google Scholar] [CrossRef]
Farhani, G.; Kurukuri, S.; Myers, R.; Santos, N.; Tauhiduzzaman, M. Unlocking Dual Utility: 1D-CNN for Milling Tool Health Assessment and Experimental Optimization. IEEE Access 2024, 12, 105096–105107. [Google Scholar] [CrossRef]
Hu, N.; Liu, Z.; Jiang, S.; Li, Q.; Zhong, S.; Chen, B. Remaining Useful Life Prediction of Milling Tool Based on Pyramid CNN. Shock Vib. 2023, 2023, 1830694. [Google Scholar] [CrossRef]
Sayyad, S.; Kumar, S.; Bongale, A.; Kotecha, K.; Abraham, A. Remaining Useful-Life Prediction of the Milling Cutting Tool Using Time–Frequency-Based Features and Deep Learning Models. Sensors 2023, 23, 5659. [Google Scholar] [CrossRef]
Zhu, M.; Zhang, J.; Bu, L.; Nie, S.; Bai, Y.; Zhao, Y.; Mei, N. Methodology and Experimental Verification for Predicting the Remaining Useful Life of Milling Cutters Based on Hybrid CNN-LSTM-Attention-PSA. Machines 2024, 12, 752. [Google Scholar] [CrossRef]
Abidi, M.H.; Alkhalefah, H.; Umer, U. Fuzzy harmony search based optimal control strategy for wireless cyber physical system with industry 4.0. J. Intell. Manuf. 2022, 33, 1795–1812. [Google Scholar] [CrossRef]
Abidi, M.H.; Alkhalefah, H.; Umer, U.; Mohammed, M.K. Blockchain-based secure information sharing for supply chain management: Optimization assisted data sanitization process. Int. J. Intell. Syst. 2021, 36, 260–290. [Google Scholar] [CrossRef]
Abidi, M.H. Multimodal data-based human motion intention prediction using adaptive hybrid deep learning network for movement challenged person. Sci. Rep. 2024, 14, 30633. [Google Scholar] [CrossRef] [PubMed]
Cen, Z.; Hu, S.; Hou, Y.; Chen, Z.; Ke, Y. Remaining useful life prediction of machinery based on improved Sample Convolution and Interaction Network. Eng. Appl. Artif. Intell. 2024, 135, 108813. [Google Scholar] [CrossRef]
Abidi, M.H.; Mohammed, M.K.; Alkhalefah, H. Predictive Maintenance Planning for Industry 4.0 Using Machine Learning for Sustainable Manufacturing. Sustainability 2022, 14, 3387. [Google Scholar] [CrossRef]
Danish, M.; Gupta, M.K.; Irfan, S.A.; Ghazali, S.M.; Rathore, M.F.; Krolczyk, G.M.; Alsaady, A. Machine learning models for prediction and classification of tool wear in sustainable milling of additively manufactured 316 stainless steel. Results Eng. 2024, 22, 102015. [Google Scholar] [CrossRef]
Omole, S.; Dogan, H.; Lunt, A.J.G.; Kirk, S.; Shokrani, A. Using machine learning for cutting tool condition monitoring and prediction during machining of tungsten. Int. J. Comput. Integr. Manuf. 2024, 37, 747–771. [Google Scholar] [CrossRef]
Khan, F.; Kamal, K.; Ratlamwala, T.A.H.; Alkahtani, M.; Almatani, M.; Mathavan, S. Tool Health Classification in Metallic Milling Process Using Acoustic Emission and Long Short-Term Memory Networks: A Deep Learning Approach. IEEE Access 2023, 11, 126611–126633. [Google Scholar] [CrossRef]
Elminir, H.K.; El-Brawany, M.A.; Ibrahim, D.A.; Elattar, H.M.; Ramadan, E.A. An efficient deep learning prognostic model for remaining useful life estimation of high speed CNC milling machine cutters. Results Eng. 2024, 24, 103420. [Google Scholar] [CrossRef]
Che, Z.; Peng, C.; Liao, T.W.; Wang, J. Improving milling tool wear prediction through a hybrid NCA-SMA-GRU deep learning model. Expert Syst. Appl. 2024, 255, 124556. [Google Scholar] [CrossRef]
Shah, M.; Vakharia, V.; Chaudhari, R.; Vora, J.; Pimenov, D.Y.; Giasin, K. Tool wear prediction in face milling of stainless steel using singular generative adversarial network and LSTM deep learning models. Int. J. Adv. Manuf. Technol. 2022, 121, 723–736. [Google Scholar] [CrossRef]
Wang, S.; Yu, Z.; Xu, G.; Zhao, F. Research on Tool Remaining Life Prediction Method Based on CNN-LSTM-PSO. IEEE Access 2023, 11, 80448–80464. [Google Scholar] [CrossRef]
Kamat, P.; Kumar, S.; Kotecha, K. DeepTool: A deep learning framework for tool wear onset detection and remaining useful life prediction. MethodsX 2024, 13, 102965. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Liu, X.; Yue, C.; Wang, L.; Liang, S.Y. Data-model linkage prediction of tool remaining useful life based on deep feature fusion and Wiener process. J. Manuf. Syst. 2024, 73, 19–38. [Google Scholar] [CrossRef]
Kaliyannan, D.; Thangamuthu, M.; Pradeep, P.; Gnansekaran, S.; Rakkiyannan, J.; Pramanik, A. Tool Condition Monitoring in the Milling Process Using Deep Learning and Reinforcement Learning. J. Sens. Actuator Netw. 2024, 13, 42. [Google Scholar] [CrossRef]
Milling Tool Wear and RUL Dataset; Kaggle: San Francisco, CA, USA, 2025. Available online: https://www.kaggle.com/datasets/programmer3/milling-tool-wear-and-rul-dataset (accessed on 21 November 2025).
Piecuch, G.; Żabiński, T. A new open dataset from a milling process—Data for classification and estimation of tool life. Sci. Data 2025, 12, 650. [Google Scholar] [CrossRef]
Kanimozhi, M.; Roselin, R. Statistical Feature Extraction and Classification using Machine Learning Techniques in Brain-Computer Interface. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 1754–1758. [Google Scholar] [CrossRef]
Alalayah, K.M.; Senan, E.M.; Atlam, H.F.; Ahmed, I.A.; Shatnawi, H.S.A. Effective Early Detection of Epileptic Seizures through EEG Signals Using Classification Algorithms Based on t-Distributed Stochastic Neighbor Embedding and K-Means. Diagnostics 2023, 13, 1957. [Google Scholar] [CrossRef]
Yang, W.; Wang, H.; Zhang, Y.; Liu, Z.; Li, T. Self-supervised Discriminative Representation Learning by Fuzzy Autoencoder. ACM Trans. Intell. Syst. Technol. 2022, 14, 11. [Google Scholar] [CrossRef]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5797–5808. [Google Scholar]
Sarker, M.M.K.; Singh, V.K.; Alsharid, M.; Hernandez-Cruz, N.; Papageorghiou, A.T.; Noble, J.A. COMFormer: Classification of Maternal-Fetal and Brain Anatomy Using a Residual Cross-Covariance Attention Guided Transformer in Ultrasound. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2023, 70, 1417–1427. [Google Scholar] [CrossRef]
Xu, H.; Zhang, A.; Xu, X.; Li, P.; Ji, Y. Prediction of Particulate Concentration Based on Correlation Analysis and a Bi-GRU Model. Int. J. Environ. Res. Public Health 2022, 19, 13266. [Google Scholar] [CrossRef]
Srinivasulu, M.; Maiti, S. RNDDNet: A residual nested dilated DenseNet based deep-learning model for chilli plant disease classification. Eng. Res. Express 2024, 6, 035204. [Google Scholar] [CrossRef]

Figure 1. Architectural view of the implemented tool-life prediction model in the milling process.

Figure 2. Diagram structure of multi-head cross-covariance attention fusion.

Figure 3. Illustration of Bi-GRU.

Figure 4. Pictorial representation of MCF-DD-BiGRU for tool-life estimation.

Figure 5. K-fold-based performance analysis of MCF-DD-BiGRU for predicting tool wear compared with other methods regarding “(a,f) Accuracy, (b,g) MAE, (c,h) MAPE, (d,i) MSE, and (e,j) RMSE”.

Figure 6. Performance analysis of the MCF-DD-BiGRU model for tool wear prediction compared with other networks in terms of “(a,j) Accuracy, (b,k) MAE, (c,l) MAPE, (d,m) MASE, (e,n) MPE, (f,o) MSE, (g,p) NMSE, (h,q) RMSE, (i,r) SMAPE”.

Figure 7. Contributions of features in the designed MCF-DD-BiGRU-based tool wear prediction compared with other networks in terms of accuracy for (a) Dataset 1 and (b) Dataset 2.

Table 1. Features and challenges of the existing tool-life prediction models using deep learning.

Author [Citation]	Methodology	Features	Challenges
Khan et al. [19]	LSTM	It efficiently preserves important details for a long time. Variable time series are effectively handled by this approach.	Memory consumption is high.
Elminir et al. [20]	LSTM-AE	It efficiently retrieves dynamic features by handling non-linear time series data.	It considers redundant data, which leads to high processing time. The computational complexity of the model is high when dealing with long data sequences.
Che et al. [21]	NCA-SMA-GRU	It retains and filters the most relevant features. Interpretability of the model is high.	The modeling time is high.
Shah et al. [22]	GAN and LSTM	It precisely identifies wavelet functions to generate feature vectors for precise tool-life estimation.	Computationally expensive, and it has training instability issues.
Wang et al. [23]	CNN-LSTM-PSO	It uses a multi-channel feature fusion mechanism to improve the accuracy of tool wear prediction. It helps in managing the spatial continuity of features.	The training time of the model is high.
Kamat et al. [24]	DeepTool	It extracts useful features from the sensor signals to accurately predict the lifetime of tools.	It suffers from overfitting issues.
Li et al. [25]	CSBLSTM-TSAM	It is capable of mining the temporal dependence of signal features. This technique solves particle degradation issues.	Predicting the lifetime of machines with curved parts is complex. The robustness of the model is affected by changing the parameters of the milling machine.
Kaliyannan et al. [26]	LSTM and FFNN	The learning process of the model is highly consistent, and it is capable of overcoming premature convergence issues.	It is ineffective in capturing time-based patterns.

Table 2. Hyperparameters search ranges and the final picks.

Hyperparameter	Searched Range/Defaults	Final Pick(s)
Learning Rate (LR)	0.0001, 0.001, 0.01, 0.1	0.01
Hidden Size (HN)	[64, 128, 256]	128
Number of Attention Heads	[4, 8, 16]	4
Dilation Rates	[1, 2, 3, 4]	4
Depth of Dense Blocks	[1, 2, 3]	1
Dropout Rate	[0.1, 0.2, 0.3, 0.4]	0.4
Early Stopping	Patience: 10, Monitor: ‘val_loss’	-

Table 3. Epoch-based comparative analysis of MCF-DD-BiGRU with other models for tool-life prediction.

Epoch	LSTM [19]	LSTM-AE [20]	GAN-LSTM [22]	CNN-LSTM [23]	MCF-DD-BiGRU
Dataset 1
MPE
10	6.125	4.964286	5.535714	4.446429	3.553571
20	5.946429	4.910714	5.375	4.25	3.267857
30	5.767857	4.625	5.089286	4.142857	3.196429
40	5.660714	4.5	5.053571	4.071429	3.160714
50	5.339286	4.482143	4.785714	3.964286	3.053571
60	5.267857	4.410714	4.821429	3.928571	3.107143
SMAPE
10	7	5.673469	6.326531	5.081633	4.061224
20	6.795918	5.612245	6.142857	4.857143	3.734694
30	6.591837	5.285714	5.816327	4.734694	3.653061
40	6.469388	5.142857	5.77551	4.653061	3.612245
50	6.102041	5.122449	5.469388	4.530612	3.489796
60	6.020408	5.040816	5.510204	4.489796	3.55102
RMSE
10	10.15868	9.336447	9.20071	8.728049	7.445795
20	9.839033	8.517013	9.440956	8.50815	7.509238
30	9.956536	8.595302	9.726698	8.399718	6.96534
40	9.622079	8.89281	9.053666	8.350308	7.148677
50	9.311087	8.314425	8.78561	7.559584	7.398947
60	9.612436	8.692587	9.057767	8.195201	7.015632
MASE
10	946.7651	855.762	862.3747	799.7338	777.4966
20	896.9576	816.0386	900.5572	806.3521	764.6227
30	862.6906	851.9967	862.5689	821.9569	773.2036
40	926.3732	823.3261	833.5765	794.319	784.6413
50	847.9551	793.5111	839.9428	753.1057	761.0283
60	905.8937	813.4717	848.5818	817.0405	728.18
MAE
10	4.335371	3.630108	3.732796	3.218234	2.466241
20	4.143531	3.278485	3.78853	3.09333	2.388016
30	4.224391	3.211893	3.803556	2.983011	2.105449
40	3.981789	3.275068	3.524696	2.942779	2.210008
50	3.714829	2.990979	3.312419	2.542821	2.249609
60	3.80338	3.174758	3.440739	2.809722	2.114432
MSE
10	10.31987	8.716924	8.465306	7.617883	5.543986
20	9.680657	7.253952	8.913166	7.238862	5.638866
30	9.913261	7.387922	9.460865	7.055527	4.851596
40	9.258441	7.908207	8.196887	6.972764	5.110358
50	8.669635	6.912967	7.718695	5.71473	5.474442
60	9.239892	7.556106	8.204314	6.716132	4.921909
NMSE
10	1.581359	1.335732	1.297176	1.167322	0.849529
20	1.48341	1.111555	1.365803	1.109243	0.864068
30	1.519053	1.132084	1.44973	1.081149	0.743431
40	1.418712	1.211809	1.256045	1.068467	0.783083
50	1.328486	1.059304	1.182769	0.875693	0.838873
60	1.415869	1.157855	1.257183	1.029142	0.754206
Accuracy
10	93.80197	94.81018	94.66337	95.3991	96.47411
20	94.07615	95.31295	94.58369	95.57759	96.58595
30	93.96064	95.40815	94.5622	95.7353	96.98992
40	94.30746	95.31776	94.96095	95.79282	96.84044
50	94.68905	95.72391	95.26436	96.36463	96.78382
60	94.56253	95.46126	95.08091	95.98305	96.97708
Dataset 2
MPE
10	6.374607	4.80063	5.69255	4.223505	3.462749
20	5.954879	4.80063	5.403987	4.354669	3.331584
30	5.849948	4.748164	5.062959	4.013641	3.147954
40	5.797482	4.643232	4.879328	3.83001	3.279119
50	5.377754	4.459601	4.669465	3.934942	3.043022
60	5.272823	4.302204	4.80063	3.856243	3.016789
SMAPE
10	7.285265	5.486434	6.505771	4.826863	3.957428
20	6.805576	5.486434	6.175986	4.976765	3.807525
30	6.685654	5.426473	5.786239	4.587018	3.597662
40	6.625693	5.306551	5.576375	4.377155	3.747564
50	6.146005	5.096687	5.336531	4.497077	3.477739
60	6.026083	4.916804	5.486434	4.407135	3.447759
RMSE
10	7.431344	6.643252	6.885293	5.778184	5.069942
20	6.888637	6.482618	6.799102	6.142759	5.406138
30	7.038907	6.496948	6.359537	5.739587	5.232892
40	7.207774	6.395633	6.332227	5.382166	5.42488
50	7.078538	6.345149	6.112087	5.724741	5.02194
60	6.733888	6.119597	6.361739	5.760822	5.054418
MASE
10	449.2075	382.5456	424.1658	330.4695	291.034
20	420.022	380.3023	395.9337	359.679	301.1804
30	410.6396	382.0612	381.0258	327.0288	292.1369
40	421.7488	380.1668	378.3147	301.1708	311.6336
50	410.4199	364.2985	370.3846	325.5384	276.8786
60	398.2375	358.1721	375.7199	333.3341	282.7617
MAE
10	3.277164	2.535518	2.88323	2.055218	1.606044
20	2.919401	2.45275	2.749124	2.208157	1.709625
30	2.973501	2.487324	2.463018	1.97411	1.636935
40	3.065401	2.422543	2.407664	1.785126	1.687677
50	2.879411	2.341971	2.300087	1.987604	1.546472
60	2.69769	2.230625	2.390433	1.978993	1.538482
MSE
10	0.552249	0.441328	0.474073	0.333874	0.257043
20	0.474533	0.420243	0.462278	0.377335	0.292263
30	0.495462	0.422103	0.404437	0.329429	0.273832
40	0.51952	0.409041	0.400971	0.289677	0.294293
50	0.501057	0.402609	0.373576	0.327727	0.252199
60	0.453452	0.374495	0.404717	0.331871	0.255471
NMSE
10	1.632492	1.304601	1.401397	0.986959	0.75984
20	1.402759	1.242274	1.366531	1.115433	0.863954
30	1.464627	1.247772	1.195549	0.973818	0.809468
40	1.535743	1.209159	1.185303	0.856309	0.869955
50	1.481165	1.190146	1.104321	0.968786	0.74552
60	1.340442	1.107037	1.196377	0.981037	0.755194
Accuracy
10	93.52122	94.98741	94.3	95.93694	96.82494
20	94.22849	95.15104	94.56513	95.63459	96.62016
30	94.12154	95.08269	95.13074	96.09729	96.76386
40	93.93986	95.21076	95.24017	96.4709	96.66355
50	94.30755	95.37004	95.45285	96.07061	96.94271
60	94.66681	95.59017	95.27424	96.08763	96.9585

Table 4. Ablation study of the designed MCF-DD-BiGRU.

Terms	GRU	Bi-GRU	Dilated Bi-GRU	Dense Bi-GRU	Dilated DenseBi-GRU	MCF-DD-BiGRU
Dataset 1
Accuracy (%)	95.83	96.06	96.23	96.11	96.19	96.52
Dataset 2
Accuracy (%)	95.79	95.96	95.8	96.33	96.43	96.49

Table 5. Computational complexity analysis of the designed MCF-DD-BiGRU.

Terms	LSTM-AE [19]	NCA-SMA-GRU [19]	CNN-LSTM-PSO [19]	CSBLSTM-TSAM [19]	MCF-DD-BiGRU
Dataset 1
Training Time	43.64649961	42.29822822	41.47202515	40.18363694	35.28433801
Testing Time	11.86451866	11.91198138	11.49771959	11.41539469	10.5320641
Computational Time	55.51101827	54.2102096	52.96974474	51.59903163	45.81640211
Computational Space	224	226	216	205	205
Dataset 2
Training Time	41.36714968	42.06272651	39.28451891	38.60131932	37.6526821
Testing Time	11.51236742	11.50162516	11.44002281	10.90986397	10.0094979
Computational Time	52.87951711	53.56435167	50.72454172	49.51118329	47.66218
Computational Space	209	211	205	202	200

All the time parameters are in minutes, and the computational space is in kilobytes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alkhalefah, H. Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit. Mathematics 2025, 13, 3798. https://doi.org/10.3390/math13233798

AMA Style

Alkhalefah H. Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit. Mathematics. 2025; 13(23):3798. https://doi.org/10.3390/math13233798

Chicago/Turabian Style

Alkhalefah, Hisham. 2025. "Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit" Mathematics 13, no. 23: 3798. https://doi.org/10.3390/math13233798

APA Style

Alkhalefah, H. (2025). Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit. Mathematics, 13(23), 3798. https://doi.org/10.3390/math13233798

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tool-Life Estimation Model in Milling Processes Using Multi-Head Cross-Covariance Attention Fusion-Based Dilated Dense Bi-Directional Gated Recurrent Unit

Abstract

1. Introduction

2. Literature Review

2.1. Related Works

2.2. Research Gaps and Challenges

3. Significance and Overview of the Proposed Tool-Life Estimation Process in Milling

3.1. Significance of Estimating the Tool Life in Milling

3.2. Proposed Estimation Model and Its Details

3.3. Experimented Dataset Details

4. Different Set of Feature Engineering Mechanisms for Determining the Tool Life

4.1. Statistical Features

4.2. T-SNE Features

4.3. Fuzzy Autoencoder

5. Calculation of Milling Tool Life Using Multi-Head Cross-Attention for Fusion with Dilated Dense Bi-GRU

5.1. Multi-Head Cross-Covariance Attention Fusion

5.2. Bi-GRU

5.3. Proposed MCF-DD-BiGRU for Estimation

6. Results and Discussion

6.1. Experimental Setup

6.2. Evaluation Metrics

6.3. K-Fold-Based Performance Analysis of MCF-DD-BiGRU for Tool-Life Estimation

6.4. Performance Assessment of MCF-DD-BiGRU Based on Batch Size

6.5. Epoch-Based Comparative Analysis for Designed MCF-DD-BiGRU

6.6. Contributions of Features in Designed MCF-DD-BiGRU

6.7. Ablation Study

6.8. Computational Complexity Analysis

6.9. Impact of Modifications to Key Parameters on Results

7. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI