1. Introduction
Milling machines represent the cornerstone of modern manufacturing infrastructure, serving as critical enablers in precision-dependent industries including aerospace, automotive, defense, and medical device manufacturing [
1]. These sophisticated machine tools facilitate the complex shaping of metallic and composite materials through material removal processes, where their operational efficiency directly correlates with product quality, production throughput, and overall manufacturing competitiveness [
2].
The operational environment of milling machine cutting tools constitutes one of the most challenging scenarios in manufacturing engineering [
3]. Cutting tools are subjected to extreme thermo-mechanical stresses during metal removal operations, experiencing complex interactions of cutting forces, high temperatures at the tool–work-piece interface, and cyclic mechanical loading that induces progressive wear mechanisms [
4]. These harsh operating conditions trigger multiple wear phenomena including abrasive wear, adhesive wear, and thermal cracking, collectively contributing to the gradual degradation of cutting tool integrity and performance [
5].
The economic implications of cutting tool performance extend far beyond simple replacement costs [
6]. Modern cutting tools and the work-piece materials being machined, often expensive aerospace alloys or sophisticated composites, carry substantial value. As shown in
Figure 1, industry studies indicate that tool wear-related issues account for approximately 15–20% of total machine tool downtime in manufacturing facilities. In addition, unexpected tool changes contribute to 3–12% of total manufacturing costs in precision machining operations [
7]. This suggests that nearly 1/5 of all downtime could potentially be mitigated with better tool wear monitoring, predictive maintenance, or improved cutting tool materials/coatings. Also, it presents a major cost driver, especially in high-precision environments where tolerances are tight, and downtime or scrap from sudden tool failure can be expensive.
Beyond direct economic impacts, tool degradation manifests in multiple detrimental effects on manufacturing outcomes. Progressive tool wear directly influences surface finish quality, dimensional accuracy, and geometrical tolerances of machined components [
8]. Furthermore, worn tools significantly increase energy consumption due to increased cutting forces and reduced machining efficiency [
9]. In severe cases, undetected tool wear can progress to catastrophic tool failure, potentially causing collateral destruction to the machine tool components and consequently to the work-piece [
10].
The emergence of Prognostics and Health Management (PHM) as a systematic engineering discipline has transformed approaches to manufacturing equipment maintenance and reliability [
11]. PHM represents a paradigm shift from traditional maintenance strategies toward predictive methodologies that enable condition-based maintenance decisions. In milling operations, PHM systems aim to predict accurately the Remaining Useful Life (RUL) of cutting tools, thereby optimizing tool utilization and minimizing unplanned downtime [
12].
Considering the open literature, two principal methodological frameworks dominate PHM implementation for tool condition monitoring. Model-based approaches leverage physical principles and mathematical models to simulate tool degradation processes [
13]. While offering valuable insights into fundamental wear mechanisms, these methods often struggle with the complex, nonlinear nature of tool wear progression under varying operational conditions [
14]. Data-driven approaches utilize operative data recorded through sensor systems to model tool health without explicit physical modeling [
15]. The proliferation of advanced sensor technologies has enabled comprehensive data collection during machining operations. Data-driven methods excel at capturing complex patterns from multivariate sensor data and adapting to specific operational contexts [
16]. Within the data-driven paradigm, machine learning and deep learning techniques have demonstrated significant abilities in tool wear prediction [
17]. As shown in
Table 1, traditional algorithms of machine learning such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN) have been extensively applied, but these approaches typically require significant manual feature engineering and may lack temporal modeling capabilities [
18]. This table summarizes key data-driven approaches for tool wear prediction, including model types, key advantages, limitations, and reported performance metrics (e.g., RMSE, MAE). It highlights the strengths and limitations of each method, providing a concise overview of existing techniques and their applicability to industrial CNC machining scenarios. The introduction of deep learning has presented more sophisticated architectures specifically designed for sequential data processing. Among these, Long Short-Term Memory (LSTM) networks have developed as particularly suitable for tool wear prediction tanks to their inherent capacity to capture long-term dependences in time-series data [
19]. The evolution of LSTM architectures has produced several variants with distinct characteristics, including Bidirectional LSTM (BiLSTM), Stacked LSTM, and Gated Recurrent Units (GRU) [
20].
Despite the growing adoption of LSTM-based architectures in manufacturing prognostics, the research landscape lacks comprehensive comparative studies that systematically evaluate different LSTM variants under consistent experimental conditions [
31]. This research gap is particularly significant, given the computational resource constraints and accuracy demands in industrial implementations [
32].
This study addresses, firstly, this critical research gap by conducting an exhaustive comparative analysis of three fundamental LSTM architectures (single-layer LSTM, Stacked LSTM, and Bidirectional LSTM) for tool wear degradation estimation and RUL prediction. The research employs comprehensive experimental data acquired from industrial-grade CNC milling machines, ensuring practical relevance and industrial applicability. Hence, experimental results were performed in real industrial conditions such as high-speed milling operations and variable loads. Secondly, in order to enhance predictions, we propose a new strategy using the ensemble-LSTM model. Experimental results approve that the proposed approach is accurate and flexible to be implemented in real industrial PHM applications.
The rest of this paper is prearranged as follows:
Section 2 presents the theoretical foundations of LSTM architectures and describes the implemented models.
Section 3 details the experimental setup, data acquisition methodology, and pre-processing techniques.
Section 4 deliberates the experimental results and comparative analysis, while
Section 5 concludes with key conclusions and future research guidelines.
3. Methods
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network designed to capture long-term dependencies in sequential data through internal memory mechanisms. Unlike traditional machine learning methods that rely on handcrafted features and struggle with temporal patterns, LSTMs learn these relationships directly from raw sequences by maintaining information over time [
35]. In time-series forecasting, LSTM networks are widely used because they can model complex temporal dependencies more effectively than other neural networks such as Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs), which mainly capture static or spatial features. LSTMs use recurrent connections and gated memory units to selectively retain or discard information, allowing them to handle long-term patterns and avoid issues like vanishing gradients [
36]. Although other architectures such as standard Gated Recurrent Unit (GRUs) and Temporal Convolutional Networks (TCNs) offer efficiency advantages, LSTMs often perform better on datasets with long-term dependencies, irregular intervals, or noisy behavior [
37].
An LSTM unit is the basic component of an LSTM network and controls how information flows from one time step to the next. As shown in
Figure 3, each unit includes a cell state with three gates: the input gate, the forget gate, and the output gate. The cell state serves as a memory pathway that preserves long-term information, while the gates regulate what to add, what to forget, and what to pass on to the next hidden state. Using sigmoid activations, these gates filter information between 0 and 1, allowing the LSTM to selectively remember or discard past data and reducing the vanishing gradient problem found in standard recurrent models.
More specifically, the input gate (
) controls the total information to be integrated from the present input of the sequence, as described in Equation (1). Similarly, the cell information (
) is updated using the hyperbolic tangent activation function defined by Equation (2). The forget gate (
ft) fixes the amount of information to forget from the earlier state, as specified in Equation (3). The update gate (
) associates a fraction of the old cell information, updated by the input gate according to Equation (4), with the information unable to be remembered by the forget gate. Finally, the output gate (
) produces the new long-term memory (
) and computes the output of the present state (
), as expressed in Equations (5) and (6) [
38].
Recent research has proposed several architectural variants of the LSTM network to enhance its ability to model complex temporal dependencies in time-series data. While an LSTM network uses a single recurrent layer of LSTM cells in one (forward) time direction, modern applications often employ deeper or more complex topologies to better model temporal dynamics. The Bidirectional LSTM (BiLSTM) extends the conventional unidirectional LSTM by processing input sequences in both forward and backward directions, thereby capturing contextual information from both past and future time steps within a given window. This bidirectional structure enables the network to utilize both past and future contextual information and it has been shown to improve forecasting accuracy in domains such as financial and energy load prediction, where temporal dependencies are not strictly causal [
39,
40]. The Stacked LSTM, on the other hand, consists of multiple LSTM layers stacked hierarchically, allowing the model to learn representations at different temporal scales. Deeper LSTM structures have demonstrated superior performance in capturing both short- and long-term dynamics in complex sequences [
41,
42]. More recently, the Stacked Bidirectional LSTM (Stacked BiLSTM) integrates both depth and bidirectionality by stacking multiple BiLSTM layers, enabling the extraction of high-level, bidirectional temporal features. This architecture has been successfully applied in short-term load and traffic forecasting, achieving enhanced predictive accuracy compared to single-layer or unidirectional models [
40,
42]. Collectively, these LSTM variants provide flexible and powerful frameworks for time-series forecasting, offering varying trade-offs between modeling complexity, interpretability, and computational efficiency.
Overall, the aforementioned LSTM variants, namely the Bidirectional LSTM (BiLSTM), Stacked LSTM, and Stacked Bidirectional LSTM (Stacked BiLSTM), have been widely recommended in the literature for their ability to capture complex temporal dynamics in time-series forecasting tasks [
39,
42]. Each variant offers distinct advantages: BiLSTM effectively leverages bidirectional temporal dependencies, Stacked LSTM enhances feature abstraction through deeper layers, and Stacked BiLSTM combines both mechanisms to achieve richer temporal representations. However, no single configuration consistently outperforms others across all datasets and forecasting horizons, as performance often depends on the specific characteristics of the time series, such as nonlinearity, periodicity, and noise level [
43,
44]. Therefore, to leverage the complementary strengths of these architectures and improve predictive robustness, we propose a novel combination of ensemble LSTM models. This ensemble framework integrates multiple LSTM-based learners, allowing their outputs to be adaptively combined to minimize generalization error and enhance forecasting accuracy. By fusing the predictive capabilities of different LSTM architectures, the proposed model aims to achieve superior performance and stability across diverse time-series forecasting scenarios.
In this paper, we propose to use four different methods to identify, each time, the accurate prediction between LSTM, BiLSTM, Stacked LSTM, and Stacked BiLSTM. These methods are summarized in
Table 5. To fully leverage the complementary strengths of different LSTM architectures, we implement an ensemble framework combining LSTM, BiLSTM, Stacked LSTM, and Stacked BiLSTM models. Individually, each architecture captures distinct temporal patterns: LSTM models long-term dependencies; BiLSTM incorporates both past and future context; Stacked LSTM extracts hierarchical features across multiple layers; and Stacked BiLSTM combines depth and bidirectionality for multi-scale temporal representations. By integrating these models, the ensemble adaptively selects the most reliable prediction at each time step, reducing the impact of inaccurate forecasts from any single model. This approach enhances predictive accuracy, stability, and robustness, making it particularly effective for complex and highly variable noisy time-series datasets.
Table 5 summarizes four established methods used for evaluating and selecting LSTM model predictions. These benchmark methods serve as reference approaches for combining and/or selecting predictions from multiple models. While these methods are not part of the proposed framework, they provide a useful basis for comparison and highlight the advantages of our ensemble-LSTM approach. Specifically, the proposed ensemble-LSTM model leverages a GRU-based meta-learner to integrate complementary outputs from multiple LSTM architectures, offering improved robustness, predictive accuracy, and adaptability compared to conventional selection or aggregation techniques.
In situations where multiple forecasting models produce predictions for the same time instant, and no single model consistently outperforms the others, instantaneous prediction selection or aggregation methods presented in
Table 5 can be applied using only the predicted values. Let us consider four models producing predictions y1(t), y2(t), y3(t), y4(t), at a given time step. A simple and robust approach is the median rule, which selects the median of the four predictions as the final forecast. For instance, if y1(t) = 100, y2(t) = 105, y3(t) = 98, y4(t) = 150, the median is 102.5, effectively mitigating the impact of the outlier (y4(t)). A related approach is the closest-to-consensus method, where the model whose prediction is nearest to the mean or median of the predictions is selected. Using the same example, the mean is (100 + 105 + 98 + 150)/4 = 113.25 and the prediction closest to the mean is y2(t) = 105. The winner-take-all strategy with short-term reference selects the model with minimal recent forecasting error; for example, if the last observed value was 102, the prediction closest to 102 is y1(t) = 100. Finally, the instantaneous weighted voting (prediction pooling) approach assigns dynamic weights based on the relative agreement among predictions. Predictions near the cluster center receive higher weight, while outliers are down-weighted. In our example, ignoring the outlier y4(t) = 150 and averaging the remaining three predictions gives a pooled forecast of (100 + 105 + 98)/3 ≈ 101. For better clarification and visualization, this numerical example is well illustrated in
Figure 4. We propose, in this work, to use the mean of the previous four selections (median method result = 102.5, closest-to-mean result = 105, winner-take-all result = 100, and instantaneous weighted pooling result = 101) that is equal to 102.125. These methods have been widely discussed in the literature and provide a systematic way to improve forecast accuracy when multiple models are available without relying on historical performance or external meta-features [
45,
46,
47,
48]. In the next section, we will present and discuss the accuracy of the proposed ensemble-LSTM predictions’ selection strategy.
4. Experimental Results and Discussion
Experimental data were collected from a high-speed CNC machine under dry milling conditions using a three-flute ball nose tungsten carbide cutter on stainless steel (HRC 52). The cutting parameters were as follows: spindle speed 10,400 rpm, feed rate 1555 mm/min (X direction), radial depth of cut 0.125 mm (Y direction), and axial depth of cut 0.2 mm (Z direction). The PHM’2010 dataset was acquired at 50 kHz/channel. Despite that it consists of six individual cutters, the corresponding tool wear measurements are available only for three of them. Consequently, only these three cutter records, namely C1, C4, and C6, can be used to evaluate the proposed PHM approach. For each run, time-series sensor signals with lengths exceeding 200,000 time steps were recorded, resulting in a total of 315 runs per cutter. Such long sequences are adequate to evaluate predictive models.
Considering the mechanical product process, each pass contains approximately 200,000 measurements from various sensors: cutting force components (Fx, Fy, Fz), vibration velocities (Vx, Vy, Vz), and the Acoustic Emission Root Mean Square (AERMS) signal. A total of 315 passes per cutter is realized. For each pass we compute the minimum, the average, and the maximum values, considering vibrations and the acoustic emissions. Hence, for each sensor measurement (Fx, Fy, Fz, Vx, Vy, Vz, AERMS), three statistical metrics are computed, considering that each milling cutter has three flutes and wear is recorded after every cut for each one. The cutter is deemed end-of-life when the wear on any single flute exceeds the specified limit. For this reason, the maximum wear among the three flutes is retained as the representative measure of tool degradation.
Figure 5 shows the wear evolution of the cutter C1 in the three flutes and the maximum wear of them. The maximal accepted tool wear is 0.165 mm to standardize the condition monitoring of milling tools and to facilitate the development and evaluation of predictive models for tool wear and RUL. This threshold is intentionally set lower than the typical industrial standards that suggest for any CNC tool to be replaced when flank wear (VB) reaches approximately 0.3–0.6 mm.
In order to evaluate the prediction capabilities of the different proposed LSTM models, two scenarios are proposed in this paper based on cutter C6:
- ✓
Without data pre-processing original recoded data;
- ✓
With normalized data.
Figure 6 presents the prediction degradation results of milling tool wear using LSTM, BiLSTM, Stacked LSTM, and Stacked BiLSTM models. The cutters C1 and C4 are used for training the models using original recorder signals and the cutter C6 was used for testing. All models closely follow the trend of the real wear, demonstrating their capability to capture the tool degradation pattern, even if they do not fully converge to the exact measured values. In many previously published studies, an 80/20 random split of data from the same cutting tool run-to-failure history is commonly adopted, with 80% of the data used for training and the remaining 20% for testing. However, for time-series forecasting problems, such a strategy may introduce data leakage as degradation patterns from the same cutting tool can appear in both subsets. To provide a more realistic evaluation of model generalizability, a cutting-tool-level split (i.e., a leave-one-case-out assessment) was employed in this study. Specifically, the complete run-to-failure histories of cases C1 and C4 were used for model training, while the full run-to-failure history of case C6 was reserved exclusively for testing.
Graphical analysis reveals that the Stacked BiLSTM model delivers the best performance in both the overall curve trend and prediction accuracy. The Stacked LSTM model also demonstrates strong trend-capturing ability, closely following the actual data pattern but with slightly reduced accuracy compared to the Stacked BiLSTM. The BiLSTM model performs moderately well, while the standard LSTM exhibits the lowest accuracy among the tested models.
Training LSTM-based architectures on raw (unprocessed data) negatively affects model performance. Variations in the scales of input features (Fx, Fy, Fz, Vx, Vy, Vz, AERMS) and the presence of outliers can slow down the learning process, leading to unstable convergence. As a result, models trained on unprocessed data often exhibit slower convergence rates, limited generalization ability, and a higher risk of overfitting. These findings highlight the importance of data pre-processing, including normalization and outlier mitigation, to enhance the robustness and accuracy of deep learning models in time-series prediction tasks. Thereby, we propose to use the Hamming filtering before the computation of the normalized features (minimum, average, and the maximum) based on vibrations and acoustic emissions. Without normalization or standardization, the model underfits smaller-scale variables. This imbalance leads to slower convergence, reduced generalization, and an increased risk of overfitting. Conversely, applying pre-processing techniques such as min-max normalization and outlier removal improves feature consistency, resulting in faster training and more accurate predictions. As shown in
Figure 7, experimental results indicate that pre-processing plays a crucial role in improving the quality and learnability of vibration and acoustic data. By normalizing feature scales, filtering noise, and removing outliers, pre-processing enhances the consistency and reliability of the input signals. These steps help LSTM models to converge more efficiently and produce more stable and accurate predictions. These steps guarantee that features with different physical units, such as forces (Fx, Fy, Fz), velocities (Vx, Vy, Vz), and acoustic emission energy (AERMS) contribute equally to the learning process. Additionally, Hamming filtering noises suppress irrelevant fluctuations while preserving critical features, improving signal-to-noise ratio and model generalization. Overall, pre-processing reduces computational complexity, minimizes the risk of overfitting, and enhances the interpretability of model outputs. As shown in
Figure 7 and
Table 6, these pre-processing steps enhance the signal-to-noise ratio and contribute to a more robust and accurate predictive LSTM model.
Data pre-processing further enhances model robustness, emphasizing the importance of input standardization and outlier control in deep learning applications involving sequential data. Overall, the Stacked BiLSTM model demonstrates the most reliable and accurate performance for time-series prediction compared to other models. However, predictions for the same time instant show that Stacked BiLSTM predictions cannot outperform other models each time instant. For example, as shown in
Table 7, Stacked outperforms other models at the cutting number 47 and BiLSTM outperforms other models at the cutting number 250. Thereby, instantaneous prediction selection or aggregation methods can be applied using only the predicted values.
More truthful results can be realized through the use of the proposed ensemble framework, which combines the result of each single model to enhance accuracy. The prediction result of the ensemble framework is usually superior to the prediction of a single model [
49]. For this, we propose in this study a Gated Recurrent Unit (GRU)-based ensemble framework designed to enhance predictive accuracy by integrating the outputs of multiple LSTM-based models. Specifically, predictions generated by four distinct LSTM algorithms with the five selected forecasts (see
Figure 4) are used as sequential input features to the GRU network, which functions as a meta-learner. The GRU architecture is particularly well-suited for this task due to its ability to capture temporal dependencies and nonlinear interactions among input signals while mitigating issues related to vanishing gradients. By learning complex dynamic relationships among the individual model predictions, the proposed GRU ensemble aims to produce a refined meta-prediction that more closely approximates the true target values than any single constituent model. When using a recurrent neural network such as a GRU for forecasting, the raw predictions may fluctuate due to noise or overfitting, sometimes producing a non-monotonic sequence. Hence, the GRU is adjusted by selecting the maximum value between two successive predictions. This is can be explicated physically by the monotonic behavior of the wear tool degradation. As shown in
Figure 8 and
Table 8, this approach leverages the complementary strengths and compensates for the weaknesses of the individual predictors, thereby achieving improved robustness, stability, and overall predictive performance.
The proposed ensemble-LSTM models’ combination exhibits notable superiority over conventional single-model and traditional ensemble forecasting approaches. By employing the GRU network as a meta-learner to integrate the predictions of multiple LSTM-based models, the framework effectively captures both nonlinear and temporal dependencies among the constituent predictors. This hierarchical learning structure allows the ensemble to exploit complementary features while mitigating the individual weaknesses of each base model, resulting in enhanced generalization capability and reduced predictive uncertainty. Moreover, the incorporation of a monotonic adjustment mechanism ensures that the final forecasts adhere to the physical characteristics of the tool wear process, thereby maintaining consistency with the underlying degradation dynamics. Comparative experimental analyses with some previous works (using the same dataset and the same C6 experiment for an expressive comparison) demonstrate that the proposed ensemble consistently achieves lower prediction errors, smoother degradation trajectories, and higher robustness against noise and outliers compared to the single LSTM predictive model. These results given in
Table 9 confirm the efficacy and reliability of the proposed method as a superior predictive modeling approach for complex, temporally evolving systems.
Based on the experimental plan and results presented in this study, several important conclusions can be drawn from both the experimental evaluation and the predictive performance of the proposed framework. From an experimental perspective, the comparative analysis of individual LSTM-based architectures demonstrates that no single recurrent model is universally optimal under all machining conditions. Variations in cutting parameters, tool work-piece interactions, and sensor noise significantly affect model behavior, highlighting the necessity of systematic benchmarking under realistic industrial conditions. The tool wear behavior in CNC milling is highly nonlinear and sensitive to variations in cutting conditions, sensor noise, and operational dynamics. The experimental results confirm that deeper and bidirectional architectures improve temporal feature extraction but may suffer from overfitting or reduced robustness when operating conditions deviate from the training distribution. From a predictive standpoint, the ensemble-LSTM framework consistently delivers superior accuracy and stability by integrating complementary temporal representations through a GRU-based meta-learner. This predictive improvement is particularly evident in its ability to track progressive tool wear trends and maintain reliable RUL estimates under unseen operating scenarios, which is critical for deployment in real-world CNC machining environments.
Despite these promising results, several open challenges remain for CNC cutting tool RUL prediction in the context of Industry 4.0 (automation and digitalization of factories (IoT, Big Data) for efficiency and competitiveness) and the emerging paradigm of Industry 5.0 (human-centered dimension, promoting human–machine collaboration (cobots) for sustainable mass personalization, while integrating resilience and environmental and social sustainability). Modern smart manufacturing systems are characterized by highly heterogeneous data sources, including multi-modal sensor streams, cyber–physical systems, digital twins, and human-in-the-loop decision-making. Accurately modeling tool degradation in such environments requires not only higher predictive accuracy but also improved model interpretability, adaptability, and real-time responsiveness. Furthermore, frequent changes in machining tasks, materials, and tool geometries introduce domain shifts that challenge the generalization capability of data-driven models. Future work should therefore focus on incorporating transfer learning, online learning, and domain adaptation techniques to enable continuous model updating without costly retraining. In addition, integrating physics-informed constraints, uncertainty quantification, and explainable Artificial Intelligence (AI) mechanisms will be essential to enhance trustworthiness and decision support for human operators, aligning with the human-centric goals of Industry 5.0. Ultimately, addressing these challenges will enable the development of intelligent, resilient, and scalable PHM systems capable of supporting autonomous and collaborative CNC machining operations in next-generation smart factories.