1. Introduction
Cutting tools are the core components of CNC machine tools, but they are highly susceptible to wear during actual production processes. Statistics indicate that strategic tool replacement can reduce downtime by up to 75%, enhance production efficiency by 10–60%, and reduce production costs by 10–40% [
1]. Tool wear severely impacts machining accuracy and production efficiency and can even pose safety risks [
2,
3]. Therefore, tool remaining useful life (RUL) prediction has emerged as a critical imperative in the intelligent manufacturing industry [
4]. As a new intelligent diagnostic method that integrates computer technology and the concept of intelligent manufacturing, tool RUL prediction technology not only advances traditional tool management models but also provides key technological support for the transition towards Industry 4.0 in the manufacturing sector [
5]. It can assist in making accurate tool replacement decisions, avoiding resource waste and downtime losses [
6].
Driven by their robust capacity to map complex, highly nonlinear relationships between raw sensory signals and physical wear states, data-driven machine learning (ML) techniques have been extensively leveraged for tool RUL prediction. Frequently employed algorithms include support vector machines and their variants [
7,
8], probabilistic models [
9], decision trees [
10], shallow neural networks [
2,
11], and Gaussian process regression (GPR) algorithms [
12], among others. Owing to their minimal data dependency and low computational overhead, these shallow architectures still retain considerable research significance in certain resource-constrained scenarios. However, with the increasing volume of industrial data and the advancement of computational technologies, shallow models struggle to meet current accuracy demands, often reaching performance saturation due to limitations in their expressive power when data dimensionality surpasses a certain threshold [
13]. Furthermore, the inherent reliance on manual feature engineering becomes a scalability bottleneck; extracting hand-crafted features from exponentially growing monitoring data introduces substantial industrial costs.
Deep learning techniques have opened new pathways for addressing the aforementioned bottlenecks. This paradigm shift from traditional methods to data-driven deep learning architectures is a prevalent trend across the entire field of prognostics and health management [
14]. Currently, industrial tool life prediction based on deep learning focuses on the application of single convolutional neural networks (CNNs), traditional algorithm models, recurrent neural networks (RNNs) and their variants, or combinations of the aforementioned models [
15,
16]. These deep learning-based wear prediction models provide a foundational basis for estimating RUL by mapping sensor signals to wear degradation states. CNNs are widely employed in tool wear monitoring and prediction due to their remarkable efficacy in extracting spatial features from diverse data sources, including sensor signals and wear images [
17,
18]. Specifically, to address the complex demands of both short-term monitoring and long-term prediction, CNN architectures such as the MSFnet designed by Quan et al. have successfully integrated multi-scale residual modules and parallel spatiotemporal fusion modules [
19]. Furthermore, when processing multi-view images under challenging conditions like imbalanced datasets, CNN-based binary classification models—as developed by García-Pérez et al.—effectively combine data augmentation and class-weighting techniques, achieving an accuracy of 97.8% across various insert types [
20]. However, restricted by local receptive fields, CNNs struggle to capture the long-term temporal dependencies essential for tracking wear development [
21,
22]. To overcome this, RNNs and their variants (e.g., LSTM, GRU) are widely adopted to model long-term dynamics in sequential signals, providing a foundation for RUL prediction [
23]. Specifically, to extract temporal dependencies from cutting forces, the SVD-BiLSTM model developed by Wu et al. integrates Hankel matrix reconstruction with singular value decomposition [
24]. Similarly, to enhance wear-state monitoring as a precursor to life-cycle forecasting, the hybrid stacked-LSTM model introduced by Cai et al. fuses multi-sensor features with process information [
25]. However, the recursive nature of RNNs inherently precludes parallel computation, incurring high training latency. Furthermore, their susceptibility to gradient vanishing over extended sequences hampers the capture of cross-cycle evolution and weak-cycle wear features essential for precise RUL estimation [
26].
Beyond the inherent architectural limitations of individual models, most existing deep learning approaches fuse multi-sensor signals—such as cutting force, vibration, and acoustic emission (AE)—through rudimentary concatenation or linear weighting at the input or feature level [
27]. From a tribological perspective, tool degradation is a multi-scale process in which steady mechanical loading coexists with stochastic microscopic transients. Assigning uniform significance to different modalities inherently neglects their distinct physical characteristics and temporal dynamics, engendering scale mismatch, which consequently limits the model’s ability to discriminate stage-specific degradation signatures.
To transcend the aforementioned limitations of CNNs and RNNs, Transformer models have emerged, demonstrating enhanced sequence modeling capability due to their self-attention mechanism, which enables global parallel computation and explicit positional encoding, rendering them highly suitable for large-scale industrial datasets [
28]. Their effectiveness has been validated in medical imaging [
29], bioinformatics [
30], and wind speed forecasting [
31]. However, in tool RUL prediction, the global attention paradigm of Transformers presents inherent limitations. Tool wear signals exhibit local burstiness, strong non-stationarity, and substantial noise interference, whereas standard Transformers primarily emphasize long-range dependencies and lack mechanisms to effectively capture localized transient anomalies. Moreover, conventional Transformer architectures do not provide adaptive cross-modal fusion strategies, limiting their ability to dynamically adjust attention weights under varying machining conditions. Consequently, a unified modeling framework is required to simultaneously disentangle frequency-specific wear signatures and achieve adaptive multi-scale alignment across heterogeneous modalities.
Motivated by these challenges, a novel tool RUL prediction method combining a multi-channel CNN with a Transformer model for cross-modal feature fusion is introduced. The primary contributions of this work include: (i) Proposing a unified architecture that bridges short-term impulsive features and long-term degradation trends by dynamically integrating transient physical shocks into global degradation representations. This enables effective modeling of non-stationary tool wear evolution and provides a reliable foundation for accurate RUL estimation. (ii) Designing a multi-scale convolution scheme to construct a denoising mechanism based on the capture of frequency-specific wear characteristics, providing a feature foundation for the next layer of the network architecture. (iii) Developing an adaptive cross-modal fusion mechanism to dynamically weigh heterogeneous sensor inputs based on their distinct signal characteristics. Physically robust force signals serve as the Query to anchor the steady-state degradation trend, while sensitive vibration and AE features act as Key and Value to refine predictions with local anomaly information.
Building upon the research background and proposed solutions, a mathematical modeling framework is established, and experiments based on the PHM2010 tool wear dataset are conducted to validate the effectiveness of the model in estimating RUL and obtain key performance indicators that reflect prediction accuracy. The remainder of this article is organized as follows:
Section 2 outlines the preliminary knowledge.
Section 3 details the proposed hybrid modeling framework.
Section 4 presents the experimental validations. Finally,
Section 5 draws the conclusions. The overall flowchart of this study is depicted in
Figure 1.
4. Experiment and Discussion
4.1. Dataset Introduction
Considering industrial applicability and data reliability, the PHM2010 tool wear dataset was employed in this study. As a widely recognized public benchmark in the field of tool condition monitoring and prediction, the PHM2010 dataset was collected under strictly standardized experimental protocols, providing a reliable and reproducible foundation for research. The core machining parameters include a spindle speed of 10,400 r/min, a feed rate of 1555 mm/min, a radial cutting depth of 0.125 mm, and an axial cutting depth of 0.2 mm. These parameters jointly establish a stable cutting environment, ensuring data consistency and comparability across experiments.
During the experiments, six cemented carbide ball-end milling tools (C1–C6) were employed to machine Inconel 718 workpieces, each performing 315 cutting cycles. The mechanical properties of Inconel 718 provided typical loading conditions for investigating the tool wear process. Among the six tools, C1, C4, and C6 were selected for analysis. According to the actual wear depth, the degradation progression of each tool was categorized into four stages: initial wear (0–30 μm), steady wear (30–90 μm), severe wear (90–150 μm), and failure wear (≥150 μm). The experimental conditions are summarized in
Table 1, and the experimental setup is illustrated in
Figure 5.
4.2. Signal Pretreatment
4.2.1. Sensor Data Standardization
To mitigate differences among heterogeneous sensor modalities (cutting force, vibration, and acoustic emission) and ensure feature scale parity, a consolidated standardization strategy was implemented:
where
X is the raw sensor signal;
and
represent the mean and standard deviation calculated exclusively on the training set. These statistics are then applied to standardize the training, validation, and test sets, ensuring that no information from the test set influences the preprocessing parameters and thus preventing data leakage.
4.2.2. Tensor Structure
To ensure strict data independence, the final
T data points from the steady-state cutting region of each independent cutting cycle were strictly extracted. This approach ensures a unique, one-to-one mapping between each physical cutting pass and an independent feature tensor, thereby guaranteeing strict data independence for the validation process. Consequently, the raw time-series sensor data is transformed into a three-dimensional feature tensor
X, defined as follows:
where
N is the total number of collected samples,
T is the number of time steps per sample (
T = 20,000 in this work), and
S denotes the 7 sensor channels.
4.2.3. Robust Treatment of Wear Labels
The wear label is computed by averaging multiple measurements within a single tool cycle:
where
is the averaged wear label of the
i-th cutting cycle,
wj is the
j-th wear measurement within this cycle, and
K is the total number of repeated wear measurements in a single tool cycle.
4.3. RUL Estimation Logic
To bridge the gap between wear regression and life prediction, we establish a mapping from the estimated wear value to the remaining life. Tool life is defined herein based on a predetermined failure threshold
Wth (e.g., 150 μm for the PHM2010 dataset). The RUL at any given cutting cycle
t is calculated as follows:
where
represents the specific cutting cycle when the predicted wear curve
W(
t) intersects the threshold line
Wth. Under this formulation, the accuracy of the RUL estimation is intrinsically tied to the fidelity of the wear regression. By minimizing the regression error across the entire degradation trajectory, the proposed model ought to ensure that the predicted intersection point
remains highly aligned with the actual tool failure time, providing a reliable buffer for preventive maintenance.
4.4. Experimental Results and Performance Analysis
To quantify the prediction accuracy of the proposed model, three performance metrics were employed: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination (R2).
RMSE measures the average magnitude of prediction errors and penalizes large deviations more severely, while MAE reflects the average absolute difference between the predicted and actual values. Smaller values of RMSE and MAE indicate higher prediction accuracy.
The coefficient of determination (R
2) evaluates the correlation between the predicted and actual wear curves, with values closer to 1 representing better model fitting performance. The mathematical definitions of these evaluation metrics are summarized in
Table 2.
To evaluate the proposed method, two experimental protocols were implemented. Protocol A adopts a randomized shuffle–split paradigm to rigorously assess the model’s interpolation fidelity and feature extraction efficiency. By aggregating multi-sensory data into a composite observation space with a randomized 9:1 partitioning strategy, this protocol ensures the training distribution comprehensively spans the entire degradation lifecycle. This setup validates the model’s ability to capture the underlying degradation dynamics under consistent data distributions, thereby confirming its predictive reliability and sequence modeling efficacy for seen tool instances.
In contrast, Protocol B adopts a cross-cutter validation strategy to evaluate generalization robustness against inter-tool domain shifts. Under this configuration, the source domain comprises two cutter instances (C1 and C6), while Tool C4 is strictly isolated as an unseen target domain. This separation scrutinizes the model’s ability to extrapolate learned degradation mechanisms to a novel tool instance, thereby simulating realistic industrial deployment scenarios.
All models were implemented in PyTorch 2.0.0 with CUDA 11.8 and executed on an NVIDIA GeForce RTX 3090 GPU. To ensure statistical significance, all performance metrics were derived from three independent runs adopting distinct random seeds for initialization and data partitioning. Furthermore, normalization statistics were strictly computed solely on the training set to thoroughly eliminate the risk of data leakage.
4.4.1. Protocol A: Interpolation Experiment
The network was optimized using the AdamW optimizer with a weight decay of 1 × 10−4 to mitigate overfitting. The initial learning rate was set at 1 × 10−3, regulated by a CosineAnnealingWarmRestarts scheduler to facilitate escape from local optima and ensure enhanced convergence. The batch size was fixed at 64, and the model was trained for 100 epochs with a total training duration of approximately 400 s. MSE was employed as the objective function. Furthermore, Automatic Mixed Precision training was utilized to enhance computational efficiency.
The model fidelity and interpolation capability are validated in
Figure 6, while the quantitative results in
Table 3 and the visual evidence in
Figure 7 collectively reveal performance disparities among the evaluated architectures. It is evident that the proposed Multi-Channel CNN and Cross-Modal Transformer framework achieves the minimum RMSE and MAE across all tool subsets. Taking the C1 subset as a representative case, the model yields an RMSE of 2.51 and an MAE of 1.98, representing an error reduction of over 60% compared to the LSTM baseline.
Comparative analysis further elucidates that SVR and CNN baselines struggle with stochastic signal fluctuations, whereas the LSTM model exhibits delayed responses during tool wear stage shifts. In contrast, the proposed method achieves a substantial precision gain by extracting localized high-frequency transients while concurrently tracking global degradation trends. The fitting curves in
Figure 6 directly reflect the framework’s robust nonlinear mapping capability in noise-intensive environments. Such validation of noise resilience and stable estimation under stochastic interference corroborates the model’s reliability. This further verifies the model’s effectiveness and reliability for RUL prediction of seen homogeneous cutting tools under consistent distributions, providing a solid baseline for future industrial cross-tool generalization studies.
4.4.2. Protocol B: Model Generalization Test
Protocol B was configured to enhance cross-domain generalization. To achieve optimization, the batch size was adjusted to 32, and the learning rate was set to 5 × 10−4. The model was trained for 100 epochs with a total training duration of approximately 350 s. Furthermore, a monotonicity constraint was integrated into the conventional MSELoss to ensure physical consistency. All other experimental conditions and hyperparameters remained consistent with Protocol A.
To evaluate the model’s generalization robustness under domain shifts, we adopt the cross-cutter protocol, designating Tool C4 as an entirely unseen target domain for validation. The predictive performance and goodness of fit of the four models are clearly illustrated in
Figure 8. The data presented in
Figure 9 and
Table 4 corroborate these observations, revealing that the individual CNN and Transformer models have limited adaptability to the target domain, with R
2 scores of only 0.649 and 0.743, respectively. These results suggest that single-scale or single-modality models have difficulty capturing the universal degradation signatures across different cutters. In contrast, the hybrid CNN-Transformer baseline with simple signal concatenation achieved an R
2 of 0.905 and an RMSE of 11.04, indicating that the integration of local feature extraction and temporal modeling provides a more robust foundation for domain adaptation.
In generalization tests (
Table 4), the proposed method achieved an RMSE of 6.92, an MAE of 6.09, and an R
2 of 0.961. Compared to the hybrid CNN-Transformer baseline, these results represent a 37.32% reduction in RMSE and a 6.19% improvement in R
2, enhancing prediction fidelity under domain shifts.
Figure 8 illustrates the model’s high-fidelity tracking during the noise-intensive late wear stage, where the adaptive cross-modal alignment—further analyzed in
Section 4.4.3—effectively mitigates stochastic disturbances. This robust generalization enables accurate failure threshold prediction for unseen cutters, directly providing reliable RUL outputs. To further validate the real-time performance for industrial deployment, the inference time of the model was evaluated: the latency for single-sample inference is approximately 7.64 ms; under batch inference, the average latency per sample drops to 1.13 ms. Processing the unseen target dataset of C4 cutters, containing 315 samples, took only 1218.4 ms. In summary, the proposed model, while maintaining high prediction accuracy, largely meets the specific real-time requirements for tool RUL prediction in smart manufacturing.
4.4.3. Physical Interpretation of Attention Weights
To validate this physics-driven design and demonstrate its industrial relevance, this section extracts and visualizes the dynamic distribution of cross-modal attention weights across four distinct wear stages using the data from Protocol B (
Figure 10).
During the transition from the initial wear stage to the steady wear stage, the tool manifests consistent macroscopic volumetric loss, establishing a dynamic mechanical equilibrium. This process generates relatively stable, periodic vibrations, typically dominated by low-frequency components. As these signals serve as effective macroscopic indicators of the progressive wear volume, the model adaptively assigns higher significance to the vibration features during this stage.
When the tool enters the severe wear stage, the accumulation of localized microscopic defects disrupts the stable mechanical rhythm, triggering a gradual decline in the vibration weight. Upon reaching terminal failure, the acoustic emission (AE) weight surges to 0.69, while the vibration weight drops to 0.31. This phenomenon is ascribed to the high-speed, light-load milling conditions inherent to the dataset. Specifically, the shallow depth of cut inhibits the excitation of significant structural chatter, thereby rendering vibration signals less representative of the actual wear progression during the final degradation stages. In contrast, terminal machining transitions toward a state dominated by coating delamination, micro-crack propagation, and plowing effects stemming from severe edge blunting. These discrete microscopic events liberate high-frequency transient strain energy in the form of elastic stress waves. Captured by the AE sensor, these signals naturally become the dominant indicator for representing the terminal wear stage.
As shown in
Figure 8, compared to the CNN-Transformer baseline relying on simple signal concatenation, lower error bars are achieved by the proposed model in both steady and failure wear stages. This better trajectory tracking capability is primarily attributed to its dynamic weight allocation mechanism.