RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer

Liu, Changfu; Liu, Yubai; Sun, Xiaoning; Wang, Meng; Feng, Siqi; Li, Yuelong; Gao, Jingjing

doi:10.3390/lubricants14030109

Open AccessArticle

RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer

by

Changfu Liu

,

Yubai Liu

,

Xiaoning Sun

,

Meng Wang

,

Siqi Feng

,

Yuelong Li

and

Jingjing Gao

^*

Department of Mechanical Engineering, Liaoning Petrochemical University, Wanghua District, Fushun 113001, China

^*

Author to whom correspondence should be addressed.

Lubricants 2026, 14(3), 109; https://doi.org/10.3390/lubricants14030109

Submission received: 31 January 2026 / Revised: 25 February 2026 / Accepted: 27 February 2026 / Published: 1 March 2026

(This article belongs to the Special Issue Monitoring and Remaining Useful Life (RUL) Technology of Tool Wear)

Download

Browse Figures

Versions Notes

Abstract

Excessive tool wear can compromise machining precision and increase costs, rendering accurate tool remaining useful life (RUL) prediction imperative in intelligent manufacturing. Traditional methods exhibit intrinsic limitations in cross-modal modeling accuracy and capturing temporal dependencies, failing to meet practical requirements. To transcend these bottlenecks, this study proposes a robust tool RUL prediction framework that combines a multi-channel CNN and a Cross-Modal Transformer. The CNN performs convolution operations to extract local features from wear signals, while the Transformer adaptively synchronizes heterogeneous features (cutting force, vibration, and acoustic emission) to capture long-term degradation trends. Empirical evaluations conducted on the PHM2010 dataset demonstrate the model’s robustness and generalization capability: under the random shuffle–split protocol, the proposed method achieves an R² of up to 0.99, with the RMSE and MAE reaching 2.51 and 1.98, respectively. To further evaluate the framework’s extrapolation ability under domain shifts, a cross-cutter validation protocol was implemented. Under this condition, the experimental results yield an R² of 0.961, an RMSE of 6.92, and an MAE of 6.09. Additionally, the correlation between modality-specific attention weights and their corresponding physical interpretations is systematically investigated. These results confirm the model’s potential for cross-cutter life cycle management in smart manufacturing, providing stable and physically consistent wear estimation and remaining useful life prediction in noise-intensive environments.

Keywords:

tool wear; smart manufacturing; feature fusion; Cross-Modal Transformer; remaining useful life (RUL)

1. Introduction

Cutting tools are the core components of CNC machine tools, but they are highly susceptible to wear during actual production processes. Statistics indicate that strategic tool replacement can reduce downtime by up to 75%, enhance production efficiency by 10–60%, and reduce production costs by 10–40% [1]. Tool wear severely impacts machining accuracy and production efficiency and can even pose safety risks [2,3]. Therefore, tool remaining useful life (RUL) prediction has emerged as a critical imperative in the intelligent manufacturing industry [4]. As a new intelligent diagnostic method that integrates computer technology and the concept of intelligent manufacturing, tool RUL prediction technology not only advances traditional tool management models but also provides key technological support for the transition towards Industry 4.0 in the manufacturing sector [5]. It can assist in making accurate tool replacement decisions, avoiding resource waste and downtime losses [6].

Driven by their robust capacity to map complex, highly nonlinear relationships between raw sensory signals and physical wear states, data-driven machine learning (ML) techniques have been extensively leveraged for tool RUL prediction. Frequently employed algorithms include support vector machines and their variants [7,8], probabilistic models [9], decision trees [10], shallow neural networks [2,11], and Gaussian process regression (GPR) algorithms [12], among others. Owing to their minimal data dependency and low computational overhead, these shallow architectures still retain considerable research significance in certain resource-constrained scenarios. However, with the increasing volume of industrial data and the advancement of computational technologies, shallow models struggle to meet current accuracy demands, often reaching performance saturation due to limitations in their expressive power when data dimensionality surpasses a certain threshold [13]. Furthermore, the inherent reliance on manual feature engineering becomes a scalability bottleneck; extracting hand-crafted features from exponentially growing monitoring data introduces substantial industrial costs.

Deep learning techniques have opened new pathways for addressing the aforementioned bottlenecks. This paradigm shift from traditional methods to data-driven deep learning architectures is a prevalent trend across the entire field of prognostics and health management [14]. Currently, industrial tool life prediction based on deep learning focuses on the application of single convolutional neural networks (CNNs), traditional algorithm models, recurrent neural networks (RNNs) and their variants, or combinations of the aforementioned models [15,16]. These deep learning-based wear prediction models provide a foundational basis for estimating RUL by mapping sensor signals to wear degradation states. CNNs are widely employed in tool wear monitoring and prediction due to their remarkable efficacy in extracting spatial features from diverse data sources, including sensor signals and wear images [17,18]. Specifically, to address the complex demands of both short-term monitoring and long-term prediction, CNN architectures such as the MSFnet designed by Quan et al. have successfully integrated multi-scale residual modules and parallel spatiotemporal fusion modules [19]. Furthermore, when processing multi-view images under challenging conditions like imbalanced datasets, CNN-based binary classification models—as developed by García-Pérez et al.—effectively combine data augmentation and class-weighting techniques, achieving an accuracy of 97.8% across various insert types [20]. However, restricted by local receptive fields, CNNs struggle to capture the long-term temporal dependencies essential for tracking wear development [21,22]. To overcome this, RNNs and their variants (e.g., LSTM, GRU) are widely adopted to model long-term dynamics in sequential signals, providing a foundation for RUL prediction [23]. Specifically, to extract temporal dependencies from cutting forces, the SVD-BiLSTM model developed by Wu et al. integrates Hankel matrix reconstruction with singular value decomposition [24]. Similarly, to enhance wear-state monitoring as a precursor to life-cycle forecasting, the hybrid stacked-LSTM model introduced by Cai et al. fuses multi-sensor features with process information [25]. However, the recursive nature of RNNs inherently precludes parallel computation, incurring high training latency. Furthermore, their susceptibility to gradient vanishing over extended sequences hampers the capture of cross-cycle evolution and weak-cycle wear features essential for precise RUL estimation [26].

Beyond the inherent architectural limitations of individual models, most existing deep learning approaches fuse multi-sensor signals—such as cutting force, vibration, and acoustic emission (AE)—through rudimentary concatenation or linear weighting at the input or feature level [27]. From a tribological perspective, tool degradation is a multi-scale process in which steady mechanical loading coexists with stochastic microscopic transients. Assigning uniform significance to different modalities inherently neglects their distinct physical characteristics and temporal dynamics, engendering scale mismatch, which consequently limits the model’s ability to discriminate stage-specific degradation signatures.

To transcend the aforementioned limitations of CNNs and RNNs, Transformer models have emerged, demonstrating enhanced sequence modeling capability due to their self-attention mechanism, which enables global parallel computation and explicit positional encoding, rendering them highly suitable for large-scale industrial datasets [28]. Their effectiveness has been validated in medical imaging [29], bioinformatics [30], and wind speed forecasting [31]. However, in tool RUL prediction, the global attention paradigm of Transformers presents inherent limitations. Tool wear signals exhibit local burstiness, strong non-stationarity, and substantial noise interference, whereas standard Transformers primarily emphasize long-range dependencies and lack mechanisms to effectively capture localized transient anomalies. Moreover, conventional Transformer architectures do not provide adaptive cross-modal fusion strategies, limiting their ability to dynamically adjust attention weights under varying machining conditions. Consequently, a unified modeling framework is required to simultaneously disentangle frequency-specific wear signatures and achieve adaptive multi-scale alignment across heterogeneous modalities.

Motivated by these challenges, a novel tool RUL prediction method combining a multi-channel CNN with a Transformer model for cross-modal feature fusion is introduced. The primary contributions of this work include: (i) Proposing a unified architecture that bridges short-term impulsive features and long-term degradation trends by dynamically integrating transient physical shocks into global degradation representations. This enables effective modeling of non-stationary tool wear evolution and provides a reliable foundation for accurate RUL estimation. (ii) Designing a multi-scale convolution scheme to construct a denoising mechanism based on the capture of frequency-specific wear characteristics, providing a feature foundation for the next layer of the network architecture. (iii) Developing an adaptive cross-modal fusion mechanism to dynamically weigh heterogeneous sensor inputs based on their distinct signal characteristics. Physically robust force signals serve as the Query to anchor the steady-state degradation trend, while sensitive vibration and AE features act as Key and Value to refine predictions with local anomaly information.

Building upon the research background and proposed solutions, a mathematical modeling framework is established, and experiments based on the PHM2010 tool wear dataset are conducted to validate the effectiveness of the model in estimating RUL and obtain key performance indicators that reflect prediction accuracy. The remainder of this article is organized as follows: Section 2 outlines the preliminary knowledge. Section 3 details the proposed hybrid modeling framework. Section 4 presents the experimental validations. Finally, Section 5 draws the conclusions. The overall flowchart of this study is depicted in Figure 1.

2. Preliminary Knowledge

2.1. Convolutional Neural Network (CNN)

For a one-dimensional input signal X ∈ R^L of length L, and a convolution kernel W ∈ R^m with width m, the convolution output at position t can be expressed as follows:

y_{t} = f (\sum_{i = 1}^{m} w_{i} x_{t + i - 1} + b)

(1)

where b is the bias term, and f(·) is the nonlinear activation function. The local temporal patterns in the tool wear process are evident, and through the convolution operation, CNNs can progressively construct feature representations, extracting short-term dependencies from the time-series data.

2.2. Transformer

The Transformer consists of Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (FFN), combined with residual connections and layer normalization to ensure stable training [32]. The core computation is as follows:

Attention (Q, K, V) = softmax (\frac{{Q K}^{T}}{\sqrt{d_{k}}}) V

(2)

MHSA calculates global correlations through dot-product operations, capturing dependencies between any positions, thus avoiding the gradient vanishing issue faced by Recurrent Neural Networks (RNNs) on long sequences. Meanwhile, the dimensional projection and compression introduced by the FFN enable nonlinear transformations, allowing for a more accurate description of the nonlinear features of tool wear. Since the self-attention mechanism itself does not contain positional information, a sinusoidal position encoding is introduced to differentiate between different positions in the sequence [29].

3. Proposed Method

3.1. Multi-Channel CNN Part

As illustrated in Figure 2, four specialized parallel branches are integrated into the proposed Multi-Channel CNN module to process input signals, facilitating the capture of features across diverse temporal resolutions. To ensure comprehensive feature extraction, each branch is configured to produce 32 channels in the first layer and 64 in the second, resulting in a concatenated feature depth of 128 and 256, respectively. Specifically, these branches are architecturally specialized: The large-kernel branch (Size 15) captures low-frequency, long-range local dependencies; the medium-kernel branch (Size 9) targets medium-range temporal patterns that frequently appear in the vibration signals of tools during the intermediate wear stage; the small-kernel branch (Size 5) detects fine-grained high-frequency variations and transient disturbances; and the pooling enhancement branch (Size 3) adopts 3-point max-pooling to highlight prominent local activations, followed by convolution for channel projection to preserve key peaks and impulsive signals. In each branch, the convolutional output is normalized via Batch Normalization to stabilize the feature distribution and then fed into the GeLU activation function. Finally, the outputs of the four branches are concatenated along the channel dimension, forming a unified multi-scale local feature representation.

F_{concat} = Concat (F_{15}, F_{9}, F_{5}, F_{pool}) \in R^{C_{out} \times T}

(3)

where T is the time length and C_out is the total number of output channels.

3.2. Cross-Modal Transformer Part

As illustrated in Figure 3, the proposed framework integrates a 3-layer Transformer encoder with 8 attention heads to facilitate deep cross-modal fusion. From a tribological perspective, the cutting force directly reflects the mechanical equilibrium state of the tool-workpiece contact zone, typically exhibiting a deterministic and cumulative growth trajectory with a relatively high signal-to-noise ratio (SNR). Conversely, vibration and acoustic emission (AE) signals are fundamentally fluctuations induced by numerous microscopic events at the tribological interface. These fluctuations exhibit pronounced non-stationary characteristics as the wear process evolves, demonstrating higher sensitivity and burstiness compared to the force signal, and are inevitably susceptible to noise interference in industrial environments. Within the attention mechanism, the Query (Q) serves as the primary reference, necessitating highly stationary modal inputs to ensure model convergence. Consequently, we designate the high-SNR force signal—which encapsulates the macroscopic degradation state—as the Query (Q) for information retrieval, while employing the AE and vibration signals, which characterize microscopic transient variations, as the Key (K) and Value (V). Reversing this configuration by employing stochastic signals as the Query would inevitably introduce excessive noise into the attention weights, thereby destabilizing the global degradation representation and hindering model convergence. This configuration enables the model to utilize the stable force trajectory to retrieve and integrate complementary local anomaly information from stochastic signals, while residual connections ensure the preservation of macroscopic degradation features throughout the fusion process.

The computational process can be expressed as follows:

F_{fusion} = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V + F_{Feat}^{Force}

(4)

where

F_{Feat}^{Force}

denotes the extracted force features serving as the residual connection to stabilize the macro-degradation representation.

After stacking multiple layers of Transformer encoders, temporal features representing the global degradation law of tool wear can be obtained, and their calculation is formulated as follows:

Z = TransformerEncoder (F_{fusion} + P)

(5)

where P is the position encoding, and Z represents the temporal features after modeling global dependencies.

The final temporal features are fed into a fully connected regressor, which outputs continuous wear prediction values after Dropout:

\hat{y} = W_{2} \cdot Dropout (W_{1} Z + b_{1}) + b_{2}

(6)

where

\hat{y}

represents the predicted tool wear value, while

W_{1}

and

W_{2}

denote the weights of the fully connected layers, and

b_{1}

and

b_{2}

represent their corresponding bias terms.

3.3. Proposed Model Architecture

The architecture of the network is illustrated in Figure 4. The preprocessed three-modal data are first passed through multi-channel CNN for feature extraction. The extracted features are fused via a cross-modal attention mechanism, reorganized into a sequence format suitable for Transformer-based temporal modeling, and finally fed into a regression head to produce the wear prediction.

The entire model contains approximately 2.5 million trainable parameters, balancing representational capacity with computational efficiency for industrial applications.

4. Experiment and Discussion

4.1. Dataset Introduction

Considering industrial applicability and data reliability, the PHM2010 tool wear dataset was employed in this study. As a widely recognized public benchmark in the field of tool condition monitoring and prediction, the PHM2010 dataset was collected under strictly standardized experimental protocols, providing a reliable and reproducible foundation for research. The core machining parameters include a spindle speed of 10,400 r/min, a feed rate of 1555 mm/min, a radial cutting depth of 0.125 mm, and an axial cutting depth of 0.2 mm. These parameters jointly establish a stable cutting environment, ensuring data consistency and comparability across experiments.

During the experiments, six cemented carbide ball-end milling tools (C1–C6) were employed to machine Inconel 718 workpieces, each performing 315 cutting cycles. The mechanical properties of Inconel 718 provided typical loading conditions for investigating the tool wear process. Among the six tools, C1, C4, and C6 were selected for analysis. According to the actual wear depth, the degradation progression of each tool was categorized into four stages: initial wear (0–30 μm), steady wear (30–90 μm), severe wear (90–150 μm), and failure wear (≥150 μm). The experimental conditions are summarized in Table 1, and the experimental setup is illustrated in Figure 5.

4.2. Signal Pretreatment

4.2.1. Sensor Data Standardization

To mitigate differences among heterogeneous sensor modalities (cutting force, vibration, and acoustic emission) and ensure feature scale parity, a consolidated standardization strategy was implemented:

X_{n o r m} = \frac{X - μ}{σ}

(7)

where X is the raw sensor signal;

μ

and

σ

represent the mean and standard deviation calculated exclusively on the training set. These statistics are then applied to standardize the training, validation, and test sets, ensuring that no information from the test set influences the preprocessing parameters and thus preventing data leakage.

4.2.2. Tensor Structure

To ensure strict data independence, the final T data points from the steady-state cutting region of each independent cutting cycle were strictly extracted. This approach ensures a unique, one-to-one mapping between each physical cutting pass and an independent feature tensor, thereby guaranteeing strict data independence for the validation process. Consequently, the raw time-series sensor data is transformed into a three-dimensional feature tensor X, defined as follows:

X \in R^{N \times T \times S}

(8)

where N is the total number of collected samples, T is the number of time steps per sample (T = 20,000 in this work), and S denotes the 7 sensor channels.

4.2.3. Robust Treatment of Wear Labels

The wear label is computed by averaging multiple measurements within a single tool cycle:

y_{i} = \frac{1}{K} \sum_{j = 1}^{K} w_{j}

(9)

where

y_{i}

is the averaged wear label of the i-th cutting cycle, w_j is the j-th wear measurement within this cycle, and K is the total number of repeated wear measurements in a single tool cycle.

4.3. RUL Estimation Logic

To bridge the gap between wear regression and life prediction, we establish a mapping from the estimated wear value to the remaining life. Tool life is defined herein based on a predetermined failure threshold W_th (e.g., 150 μm for the PHM2010 dataset). The RUL at any given cutting cycle t is calculated as follows:

R U L_{t} = T_{failure} - T_{t}

(10)

where

T_{failure}

represents the specific cutting cycle when the predicted wear curve W(t) intersects the threshold line W_th. Under this formulation, the accuracy of the RUL estimation is intrinsically tied to the fidelity of the wear regression. By minimizing the regression error across the entire degradation trajectory, the proposed model ought to ensure that the predicted intersection point

T_{failure}

remains highly aligned with the actual tool failure time, providing a reliable buffer for preventive maintenance.

4.4. Experimental Results and Performance Analysis

To quantify the prediction accuracy of the proposed model, three performance metrics were employed: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination (R²).

RMSE measures the average magnitude of prediction errors and penalizes large deviations more severely, while MAE reflects the average absolute difference between the predicted and actual values. Smaller values of RMSE and MAE indicate higher prediction accuracy.

The coefficient of determination (R²) evaluates the correlation between the predicted and actual wear curves, with values closer to 1 representing better model fitting performance. The mathematical definitions of these evaluation metrics are summarized in Table 2.

To evaluate the proposed method, two experimental protocols were implemented. Protocol A adopts a randomized shuffle–split paradigm to rigorously assess the model’s interpolation fidelity and feature extraction efficiency. By aggregating multi-sensory data into a composite observation space with a randomized 9:1 partitioning strategy, this protocol ensures the training distribution comprehensively spans the entire degradation lifecycle. This setup validates the model’s ability to capture the underlying degradation dynamics under consistent data distributions, thereby confirming its predictive reliability and sequence modeling efficacy for seen tool instances.

In contrast, Protocol B adopts a cross-cutter validation strategy to evaluate generalization robustness against inter-tool domain shifts. Under this configuration, the source domain comprises two cutter instances (C1 and C6), while Tool C4 is strictly isolated as an unseen target domain. This separation scrutinizes the model’s ability to extrapolate learned degradation mechanisms to a novel tool instance, thereby simulating realistic industrial deployment scenarios.

All models were implemented in PyTorch 2.0.0 with CUDA 11.8 and executed on an NVIDIA GeForce RTX 3090 GPU. To ensure statistical significance, all performance metrics were derived from three independent runs adopting distinct random seeds for initialization and data partitioning. Furthermore, normalization statistics were strictly computed solely on the training set to thoroughly eliminate the risk of data leakage.

4.4.1. Protocol A: Interpolation Experiment

The network was optimized using the AdamW optimizer with a weight decay of 1 × 10⁻⁴ to mitigate overfitting. The initial learning rate was set at 1 × 10⁻³, regulated by a CosineAnnealingWarmRestarts scheduler to facilitate escape from local optima and ensure enhanced convergence. The batch size was fixed at 64, and the model was trained for 100 epochs with a total training duration of approximately 400 s. MSE was employed as the objective function. Furthermore, Automatic Mixed Precision training was utilized to enhance computational efficiency.

The model fidelity and interpolation capability are validated in Figure 6, while the quantitative results in Table 3 and the visual evidence in Figure 7 collectively reveal performance disparities among the evaluated architectures. It is evident that the proposed Multi-Channel CNN and Cross-Modal Transformer framework achieves the minimum RMSE and MAE across all tool subsets. Taking the C1 subset as a representative case, the model yields an RMSE of 2.51 and an MAE of 1.98, representing an error reduction of over 60% compared to the LSTM baseline.

Comparative analysis further elucidates that SVR and CNN baselines struggle with stochastic signal fluctuations, whereas the LSTM model exhibits delayed responses during tool wear stage shifts. In contrast, the proposed method achieves a substantial precision gain by extracting localized high-frequency transients while concurrently tracking global degradation trends. The fitting curves in Figure 6 directly reflect the framework’s robust nonlinear mapping capability in noise-intensive environments. Such validation of noise resilience and stable estimation under stochastic interference corroborates the model’s reliability. This further verifies the model’s effectiveness and reliability for RUL prediction of seen homogeneous cutting tools under consistent distributions, providing a solid baseline for future industrial cross-tool generalization studies.

4.4.2. Protocol B: Model Generalization Test

Protocol B was configured to enhance cross-domain generalization. To achieve optimization, the batch size was adjusted to 32, and the learning rate was set to 5 × 10⁻⁴. The model was trained for 100 epochs with a total training duration of approximately 350 s. Furthermore, a monotonicity constraint was integrated into the conventional MSELoss to ensure physical consistency. All other experimental conditions and hyperparameters remained consistent with Protocol A.

To evaluate the model’s generalization robustness under domain shifts, we adopt the cross-cutter protocol, designating Tool C4 as an entirely unseen target domain for validation. The predictive performance and goodness of fit of the four models are clearly illustrated in Figure 8. The data presented in Figure 9 and Table 4 corroborate these observations, revealing that the individual CNN and Transformer models have limited adaptability to the target domain, with R² scores of only 0.649 and 0.743, respectively. These results suggest that single-scale or single-modality models have difficulty capturing the universal degradation signatures across different cutters. In contrast, the hybrid CNN-Transformer baseline with simple signal concatenation achieved an R² of 0.905 and an RMSE of 11.04, indicating that the integration of local feature extraction and temporal modeling provides a more robust foundation for domain adaptation.

In generalization tests (Table 4), the proposed method achieved an RMSE of 6.92, an MAE of 6.09, and an R² of 0.961. Compared to the hybrid CNN-Transformer baseline, these results represent a 37.32% reduction in RMSE and a 6.19% improvement in R², enhancing prediction fidelity under domain shifts. Figure 8 illustrates the model’s high-fidelity tracking during the noise-intensive late wear stage, where the adaptive cross-modal alignment—further analyzed in Section 4.4.3—effectively mitigates stochastic disturbances. This robust generalization enables accurate failure threshold prediction for unseen cutters, directly providing reliable RUL outputs. To further validate the real-time performance for industrial deployment, the inference time of the model was evaluated: the latency for single-sample inference is approximately 7.64 ms; under batch inference, the average latency per sample drops to 1.13 ms. Processing the unseen target dataset of C4 cutters, containing 315 samples, took only 1218.4 ms. In summary, the proposed model, while maintaining high prediction accuracy, largely meets the specific real-time requirements for tool RUL prediction in smart manufacturing.

4.4.3. Physical Interpretation of Attention Weights

To validate this physics-driven design and demonstrate its industrial relevance, this section extracts and visualizes the dynamic distribution of cross-modal attention weights across four distinct wear stages using the data from Protocol B (Figure 10).

During the transition from the initial wear stage to the steady wear stage, the tool manifests consistent macroscopic volumetric loss, establishing a dynamic mechanical equilibrium. This process generates relatively stable, periodic vibrations, typically dominated by low-frequency components. As these signals serve as effective macroscopic indicators of the progressive wear volume, the model adaptively assigns higher significance to the vibration features during this stage.

When the tool enters the severe wear stage, the accumulation of localized microscopic defects disrupts the stable mechanical rhythm, triggering a gradual decline in the vibration weight. Upon reaching terminal failure, the acoustic emission (AE) weight surges to 0.69, while the vibration weight drops to 0.31. This phenomenon is ascribed to the high-speed, light-load milling conditions inherent to the dataset. Specifically, the shallow depth of cut inhibits the excitation of significant structural chatter, thereby rendering vibration signals less representative of the actual wear progression during the final degradation stages. In contrast, terminal machining transitions toward a state dominated by coating delamination, micro-crack propagation, and plowing effects stemming from severe edge blunting. These discrete microscopic events liberate high-frequency transient strain energy in the form of elastic stress waves. Captured by the AE sensor, these signals naturally become the dominant indicator for representing the terminal wear stage.

As shown in Figure 8, compared to the CNN-Transformer baseline relying on simple signal concatenation, lower error bars are achieved by the proposed model in both steady and failure wear stages. This better trajectory tracking capability is primarily attributed to its dynamic weight allocation mechanism.

5. Conclusions and Future Work

The performance advantages of the proposed framework stem from an architectural synergy that transcends conventional feature extraction. It integrates a high-fidelity front-end filter based on Multi-Channel CNN with a semantic aligner powered by a Cross-Modal Transformer. While Protocol A demonstrates the model’s high interpolation capability for specific homogeneous cutting tools, Protocol B confirms its robust extrapolation precision and generalization stability for entirely unseen cutters. Experimental results further validate that dynamic weight allocation via cross-modal attention is a viable strategy for multi-modal RUL prediction, providing insights for future prognostic research.

Nevertheless, several limitations exist: (i) Figure 8 illustrates that the proposed model exhibits higher error bars than the concatenation baseline during the early cutting phase, reflecting an inherent adaptation period of the attention mechanism. As analyzed in Section 4.4.3, the lack of a stable dynamic mechanical equilibrium and distinct wear signatures in this phase forces the model to undergo a brief data-driven exploration for optimal weight allocation. Minimizing this adaptation error remains a key direction for future work. (ii) The model’s generalizability across diverse cutting conditions or workpiece materials is not yet fully validated. (iii) The interpretability of the attention-based decision-making process is currently limited. Future work will focus on accelerating attention convergence via prior-informed initialization, extending validation through unsupervised domain adaptation, and enhancing interpretability via causal inference.

Author Contributions

Conceptualization, C.L. and Y.L. (Yubai Liu); methodology, C.L. and M.W.; software, M.W. and Y.L. (Yuelong Li); validation, X.S.; formal analysis, Y.L. (Yuelong Li); data curation, S.F.; writing—original draft preparation, Y.L. (Yubai Liu) and S.F.; writing—review and editing, J.G. and X.S.; visualization, Y.L. (Yubai Liu); supervision, J.G.; project administration, J.G.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Talent Scientific Research Fund of Liaoning Petrochemical University (Grant No. 2021XJJL-005), Liaoning Provincial Natural Science Foundation Program (Grant No. 2022-BS-293), Basic Scientific Research Project of Liaoning Provincial Department of Education (Grant No. LJKMZ20220718), Fushun Revitalization Talents Program (Grant No. FSYC202207005), and Science and Technology Programme Joint Scheme of Liaoning Provincial (National Natural Science Foundation—General Projects) (Grant No. 2025-MSLH-450).

Data Availability Statement

The original data used in this study are the public PHM2010 tool wear challenge dataset (available from the PHM Society: https://www.phmsociety.org/competition/phm/10, accessed on 26 February 2026). No new data were generated in the course of this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, X.; Yang, Z.; Xia, M.; Xia, M.; Liu, C.; Zhou, Y.; Guo, Y. Tool condition monitoring model based on DAE–SVR. Machines 2025, 13, 115. [Google Scholar] [CrossRef]
Lin, Z.; Fan, Y.; Tan, J.; Li, Z.; Yang, P.; Wang, H.; Duan, W. Tool wear prediction based on XGBoost feature selection combined with PSO-BP network. Sci. Rep. 2025, 15, 3096. [Google Scholar] [CrossRef]
Li, N.; Wang, X.; Wang, W.; Xin, M.; Yuan, D.; Zhang, M. A multi-feature dataset of coated end milling cutter tool wear whole life cycle. Sci. Data 2025, 12, 16. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Zhang, J.; Mo, R.; Zhang, X. In-process tool condition forecasting based on a deep learning method. Robot. Comput.-Integr. Manuf. 2020, 64, 101924. [Google Scholar] [CrossRef]
Hu, D.; Tang, Z. ResGRUA model for tool wear prediction based on encoder-decoder. In 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE); IEEE: Harbin, China, 2020; pp. 1067–1070. [Google Scholar] [CrossRef]
Quan, Y.; Liu, C.F.; Yuan, Z.; Yan, B.L. Hybrid Data Augmentation Combining Screening-Based MCGAN and Manual Transformation for Few-Shot Tool Wear State Recognition. IEEE Sens. J. 2024, 24, 12186–12196. [Google Scholar] [CrossRef]
Alajmi, M.S.; Almeshal, A.M. Estimation and optimization of tool wear in conventional turning of 709M40 alloy steel using support vector machine (SVM) with Bayesian optimization. Materials 2021, 14, 3773. [Google Scholar] [CrossRef]
Cogun, C.; Ayli, E. Machine learning-driven approach for reducing tool wear in die-sinking electrical discharge machining. Arab J. Sci. Eng. 2025, 10010-6. [Google Scholar] [CrossRef]
Liu, M.K.; Tseng, Y.H.; Tran, M.Q. Tool wear monitoring and prediction based on sound signal. Int. J. Adv. Manuf. Technol. 2019, 103, 3361–3373. [Google Scholar] [CrossRef]
Wu, D.Z.; Jennings, C.; Terpenny, J.; Gao, R.X.; Kumara, S. A comparative study on machine learning algorithms for smart manufacturing: Tool wear prediction using random forests. J. Manuf. Sci. Eng. 2017, 139, 071018. [Google Scholar] [CrossRef]
Truong, T.T.; Airao, J.; Hojati, F.; Ilvig, C.F.; Azarhoushang, B.; Karras, P.; Aghababaei, R. Data-driven prediction of tool wear using Bayesian regularized artificial neural networks. Measurement 2024, 238, 115303. [Google Scholar] [CrossRef]
Kong, D.D.; Chen, Y.J.; Li, N. Gaussian process regression for tool wear prediction. Mech. Syst. Signal Process. 2018, 104, 556–574. [Google Scholar] [CrossRef]
Koresh, E.; Halevi, T.; Meir, Y.; Dilmoney, D.; Dror, T.; Gross, R.; Tevet, O.; Hodassman, S.; Kanter, I. Scaling in deep and shallow learning architectures. Phys. A Stat. Mech. Its Appl. 2024, 646, 129909. [Google Scholar] [CrossRef]
Bao, Z.; Liu, C.; Yang, H.; Zhang, J.; Li, Y. From theory to industry: A survey of deep learning-enabled bearing fault diagnosis in complex environments. Eng. Appl. Artif. Intell. 2026, 163, 113068. [Google Scholar] [CrossRef]
Yang, C.; Zhou, J.; Li, E.; Wang, M.; Jin, T. Local-feature and global-dependency based tool wear prediction using deep learning. Sci. Rep. 2022, 12, 14574. [Google Scholar] [CrossRef] [PubMed]
He, X.; Zhong, M.; He, C.; Wu, J.; Yang, H.; Zhao, Z.; Yang, W.; Jing, C.; Li, Y.; Gao, C. A novel tool wear identification method based on a semi-supervised LSTM. Lubricants 2025, 13, 72. [Google Scholar] [CrossRef]
Cheng, M.; Jiao, L.; Yan, P.; Jiang, H.; Wang, R.; Qiu, T.; Wang, X. Intelligent tool wear monitoring and multi-step prediction based on deep learning model. J. Manuf. Syst. 2022, 62, 286–300. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Quan, Y.; Liu, C.F.; Yuan, Z.; Zhou, Y. An intelligent multiscale spatiotemporal fusion network model for TCM. IEEE Sens. J. 2023, 23, 6628–6637. [Google Scholar] [CrossRef]
García-Pérez, A.; Ziegenbein, A.; Schmidt, E.; Shamsafar, F.; Fernández-Valdivielso, A.; Llorente-Rodríguez, R.; Weigold, M. CNN-based in situ tool wear detection: A study on model training and data in inserts. J. Manuf. Syst. 2023, 68, 85–98. [Google Scholar] [CrossRef]
Mo, S.; Yang, C.; Mo, Y.; Yao, Z.; Li, B.; Fan, S.; Wang, H. From global to local: A lightweight CNN approach for long-term time series forecasting. Complex Eng. 2025, 123, 110192. [Google Scholar] [CrossRef]
Zhu, Q.; Zhang, G.; Zou, X.; Wang, X.; Huang, J.; Li, X. ConvMambaSR: Leveraging State-Space Models and CNNs in a Dual-Branch Architecture for Remote Sensing Imagery Super-Resolution. Remote Sens. 2024, 16, 3254. [Google Scholar] [CrossRef]
Zhang, J.; Zeng, Y.; Starly, B. Recurrent neural networks with long term temporal dependencies in machine tool wear diagnosis and prognosis. SN Appl. Sci. 2021, 3, 442. [Google Scholar] [CrossRef]
Wu, X.; Li, J.; Jin, Y.; Zheng, S. Modeling and analysis of tool wear prediction based on SVD and BiLSTM. Int. J. Adv. Manuf. Technol. 2020, 106, 4391–4399. [Google Scholar] [CrossRef]
Cai, W.; Zhang, W.; Hu, X.; Liu, Y. A hybrid information model based on long short-term memory network for tool condition monitoring. J. Intell. Manuf. 2020, 31, 1497–1510. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D: Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Huang, Z.W.; Zhu, J.M.; Lei, J.T.; Li, X.R.; Tian, F.Q. Tool wear predicting based on multisensory raw signals fusion by reshaped time series convolutional neural network in manufacturing. IEEE Access 2019, 7, 178640–178651. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhang, K.; Fung, K.-M.; Thai, T.C.; Moore, K.; Mannel, R.S.; Liu, H.; Zheng, B.; Qiu, Y. Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 2022, 79, 102444. [Google Scholar] [CrossRef]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef] [PubMed]
Mohammed, R.H.; El-saieed, A.M. Chaotic billiards optimized hybrid transformer and XGBoost model for robust and sustainable time series forecasting. Sci. Rep. 2025, 15, 25962. [Google Scholar] [CrossRef]
Zhang, J.; Xing, Z.; Wu, M.; Gui, Y.; Zheng, B. Enhancing low-light images via skip cross-attention fusion and multi-scale lightweight transformer. J. Real-Time Image Process. 2024, 21, 42. [Google Scholar] [CrossRef]

Figure 1. Paper flowchart.

Figure 2. CNN section(The ellipsis denotes multiple consecutive input samples).

Figure 3. Transformer section.

Figure 4. The architecture of the proposed model(The ellipsis denotes multiple consecutive input samples).

Figure 5. Experimental platform and procedure.

Figure 6. Results for all test sets.

Figure 7. Evaluation comparison of interpolation experiment.

Figure 8. RUL prediction results for model generalization test.

Figure 9. Evaluation comparison of model generalization test.

Figure 10. Attention weights of signals across wear stages.

Table 1. Experimental conditions.

Hardware Conditions	Model and Main Parameters
CNC	Röders RFM760, Röders GmbH, Soltau, Germany
Workpiece material	Inconel 718
Tool	Ball-nose carbide tool
Data acquisition card	NI DAQCard-1200, National Instruments, Austin, TX, USA
Wear measuring device	Leica MZ12, Leica Microsystems, Wetzlar, Germany

Table 2. Model evaluation metrics.

Evaluation	Calculation Formula
RMSE	$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\hat{y_{i}} - y_{i})}^{2}}$
MAE	$M A E = \frac{1}{n} \sum_{i = 1}^{n} \|\hat{y_{i}} - y_{i}\|$
R²	$R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(\hat{y_{i}} - y_{i})}^{2}}{\sum_{i = 1}^{n} {(\bar{y_{i}} - y_{i})}^{2}}$

Table 3. Performance comparison of interpolation experiment.

Model	RMSE			MAE			R²
Model	C1	C4	C6	C1	C4	C6	C1	C4	C6
SVR	13.52 ± 0.72	15.00 ± 0.79	15.47 ± 0.80	10.69 ± 0.55	12.03 ± 0.68	13.63 ± 0.71	0.755 ± 0.01	0.837 ± 0.01	0.851 ± 0.01
CNN	11.20 ± 0.56	16.08 ± 0.85	15.81 ± 0.79	9.14 ± 0.51	12.47 ± 0.66	13.01 ± 0.67	0.832 ± 0.01	0.820 ± 0.01	0.844 ± 0.01
LSTM	7.19 ± 0.40	8.05 ± 0.41	13.74 ± 0.75	5.20 ± 0.31	5.89 ± 0.51	10.17 ± 0.55	0.931 ± 0.01	0.955 ± 0.01	0.883 ± 0.01
Proposed method	2.51 ± 0.15	3.49 ± 0.17	4.89 ± 0.24	1.98 ± 0.15	2.68 ± 0.16	3.00 ± 0.16	0.992 ± 0.00	0.990 ± 0.00	0.985 ± 0.00

Table 4. Performance comparison of model generalization test.

Model	RMSE	MAE	R²
CNN	21.61 ± 1.23	17.53 ± 0.72	0.649 ± 0.01
Transformer	19.56 ± 0.68	14.33 ± 0.08	0.743 ± 0.01
CNN-Transformer	11.04 ± 0.49	8.31 ± 0.45	0.905 ± 0.01
Proposed method	6.92 ± 0.48	6.09 ± 0.39	0.961 ± 0.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Liu, Y.; Sun, X.; Wang, M.; Feng, S.; Li, Y.; Gao, J. RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer. Lubricants 2026, 14, 109. https://doi.org/10.3390/lubricants14030109

AMA Style

Liu C, Liu Y, Sun X, Wang M, Feng S, Li Y, Gao J. RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer. Lubricants. 2026; 14(3):109. https://doi.org/10.3390/lubricants14030109

Chicago/Turabian Style

Liu, Changfu, Yubai Liu, Xiaoning Sun, Meng Wang, Siqi Feng, Yuelong Li, and Jingjing Gao. 2026. "RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer" Lubricants 14, no. 3: 109. https://doi.org/10.3390/lubricants14030109

APA Style

Liu, C., Liu, Y., Sun, X., Wang, M., Feng, S., Li, Y., & Gao, J. (2026). RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer. Lubricants, 14(3), 109. https://doi.org/10.3390/lubricants14030109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RUL Prediction Method for Tools Based on Multi-Channel CNN and Cross-Modal Transformer

Abstract

1. Introduction

2. Preliminary Knowledge

2.1. Convolutional Neural Network (CNN)

2.2. Transformer

3. Proposed Method

3.1. Multi-Channel CNN Part

3.2. Cross-Modal Transformer Part

3.3. Proposed Model Architecture

4. Experiment and Discussion

4.1. Dataset Introduction

4.2. Signal Pretreatment

4.2.1. Sensor Data Standardization

4.2.2. Tensor Structure

4.2.3. Robust Treatment of Wear Labels

4.3. RUL Estimation Logic

4.4. Experimental Results and Performance Analysis

4.4.1. Protocol A: Interpolation Experiment

4.4.2. Protocol B: Model Generalization Test

4.4.3. Physical Interpretation of Attention Weights

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI