1. Introduction
Titanium alloys are widely utilized in aerospace, petrochemicals, biomedicine, and national defense due to their exceptional properties, including low density, high specific strength, excellent corrosion resistance, and superior biocompatibility [
1,
2]. Following prolonged periods of technical integration, independent research, and industrial promotion, China’s titanium industry has entered a phase of rapid development, with production volumes steadily increasing to establish the nation as a global leader in the sector [
3]. However, the stringent requirements for raw materials, processing techniques, and equipment—coupled with the inherent difficulty of manufacturing—result in high production costs, posing significant challenges for industrial processing. Consequently, achieving low-cost, high-quality fabrication while ensuring the reliability and safety of titanium alloys in industrial applications remains a critical challenge for the engineering community.
In the context of titanium alloy forming, numerous scholars have sought to improve traditional drawing technologies. In recent years, innovative techniques such as electroplastic drawing and electrochemical drawing have been proposed. Electroplastic drawing utilizes high-energy pulse currents to enhance material plasticity and reduce deformation resistance, thereby decreasing the drawing force during processing. However, this method requires specialized DC pulse power supplies to maintain system stability and safety, leading to substantial production costs. Conversely, electrochemical drawing replaces traditional lubricants with specialized active electrolytes and applies micro-currents to the workpiece. This maintains favorable tribological properties on the metal surface and reduces surface deformation resistance. Simultaneously, this method allows for the optimization and adjustment of product surface quality with low energy consumption and minimal equipment requirements [
4]. Therefore, enhancing the surface tribological performance of titanium alloys during processing has become a pivotal issue. As early as the 1960s, Gutman et al. [
5] from Ben-Gurion University of the Negev proposed that mechanochemical effects, also known as electrochemical plasticization [
5], can induce changes in the mechanical properties and microstructure of material surfaces. By applying electrochemical treatment to the drawing process, they observed a reduction in surface residual stress and hardness, which increased the plastic deformation capacity of high-strength alloys. Over the past half-century, electrochemical surface plasticization has been studied extensively, leading to the development of various theoretical frameworks and experimental methods aimed at improving the surface quality of titanium alloy drawing processes [
6]. While previous theoretical analyses relied heavily on data obtained from extensive simulation experiments [
7], the advancement of data mining and machine learning provides a novel paradigm. These modern tools enable the development of predictive models based on experimental data to identify the primary factors influencing the electrochemical surface COF of titanium alloys.
In recent years, researchers globally have conducted extensive studies on electrochemical plasticization and the interactions between corrosion and deformation in various materials [
8,
9,
10]. An increasing number of studies have focused on accelerating corrosion through applied currents to exploit the synergy between corrosion and wear [
10,
11], thereby enhancing surface plasticity and facilitating sliding. Gutman et al. [
12] applied electrochemical treatment to metal surfaces during drawing processes, attributing the success of this method to chemomechanical effects (CME), which reduce surface residual stress and hardness, consequently improving the plastic deformation capacity of high-strength alloys. Chen et al. [
13,
14] proposed a novel metal plastic forming technique termed electrochemical cold drawing (ECD). Findings indicated that surface hardness decreases with increasing current density across various electrolyte solutions [
15]. Notably, the current density required for this method (101–102 mA/cm
2) is significantly lower than that of the electroplastic effect (108–109 mA/cm
2). This provides distinct advantages over traditional drawing processes, including reduced deformation resistance and enhanced plasticity. As research into the imp act of electrochemical dissolution on the mechanical properties of metal surfaces has matured, Gutman [
12] utilized ECD on AM60B magnesium alloy bars. Their systematic investigation into ECD parameters—including electrolyte composition, current density, drawing speed, and die aperture—demonstrated that electrochemical treatment significantly enhances alloy formability. However, most existing studies rely solely on experimental data analysis to evaluate improvements in surface tribological properties. In reality, electrochemical tribological performance is subject to the complex, coupled influences of voltage, solution type, concentration, and sliding speed. These variables must be integrated into a multi-factor predictive framework to accurately assess material performance.
The rise of machine learning (ML) has opened a new trajectory for predicting the electrochemical surface COF of titanium alloys. Current research on the tribological characteristics of titanium alloys under electrochemical corrosion often focuses on single factors or limited operating conditions, leaving a gap in the understanding of their coefficient of friction under complex, synergistic environments. Furthermore, most existing ML applications in this field are directed toward predicting material service behavior [
16,
17,
18,
19]. For instance, Alqurashi [
20] proposed a hybrid intelligent framework combining fuzzy logic with Artificial Neural Networks (ANN) to model the erosion-corrosion behavior of glass fiber reinforced pipes (GRP) under harsh conditions, providing accurate quantitative predictions. Zheng et al. [
21] coupled high-throughput testing with ML to develop a model for predicting the erosion-corrosion rate of 90/10 Cu-Ni alloy; a comparison of six ML models revealed that the Random Forest-based model achieved the highest coefficient of determination alongside the lowest error metrics. Kuang and Long [
22] employed various machine learning algorithms to predict the atmospheric corrosion rates of low-alloy steel (LAS). By utilizing attribute transformation descriptors, they effectively analyzed corrosion behaviors and significantly enhanced the generalization capability of the predictive models. Similarly, Li et al. [
23] established a predictive model for the corrosion fatigue crack growth rate of aluminum alloys by integrating the Bayesian bootstrap method with Gradient Boosting Regression Trees (GBRT). The accuracy of their predictive model was quantitatively evaluated using the Mean Squared Error (MSE) and the Coefficient of Determination (R
2) as core performance metrics, providing a robust theoretical reference for the subsequent prediction of corrosion fatigue life in aluminum alloys.
Most prior studies have employed machine learning algorithms primarily for predicting the corrosion fatigue of metals. However, in the electrochemical dynamic tribocorrosion process, the time-evolving behavior of the friction coefficient is influenced by multiple parameters such as solution, concentration, voltage, and speed in a nonlinearly coupled manner. Its prediction is essentially a high-dimensional nonlinear regression problem with strong temporal dependency. Specialized machine learning prediction models for this complex scenario are still insufficient at present. Furthermore, as a data-driven approach, the accuracy of ML models is heavily contingent upon the quality and quantity of the sample data. Based on the aforementioned analysis, this paper focuses on the tribological characteristics of titanium alloys under varying conditions of solution type, concentration, sliding velocity, and voltage to systematically parse the influence of these factors on surface COF. Specifically, this study proposes an enhanced LightGBM model incorporating lag features and rolling statistics. By collecting friction coefficient data across diverse experimental conditions, we modeled and predicted the COF of titanium alloys, established a predictive framework for electrochemical surface COF, and identified the underlying trends of how different operating conditions affect material performance. The core rationale for selecting this algorithm lies in its high compatibility with the requirements of electrochemical coefficient of friction prediction. GBRT exhibit outstanding capabilities in handling complex non-linear relationships and high-dimensional features, effectively capturing the intricate interactions between solution concentration, velocity, and voltage on the friction coefficient. By leveraging the built-in feature importance ranking of LightGBM, combined with lag variables and rolling statistics, redundant and low-contribution features were eliminated.
2. Experimental and Methodology
2.1. Materials and Experiments
The titanium alloy material used in this study was sourced from Shenzhen Pacific Steel Co., Ltd. and sectioned into rectangular specimens with dimensions of 30 mm × 10 mm × 5 mm using electrical discharge machining, as illustrated in
Figure 1. The chemical composition analysis of the titanium alloy sample was conducted by XRF-1800 X-ray fluorescence spectrometer, and the main components are shown in
Table 1. The surface and cross-section of the titanium alloy sample were successively ground with 200#, 400#, 600#, 800#, 1000# and 1200# sandpapers on a grinding and polishing machine, then polished with 1.0-micron diamond polishing agent to a mirror-like finish. After that, the sample was washed with distilled water and ultrasonically cleaned with ethanol for 10 min, dried with cold air and then dried. The electrochemical corrosion and wear performance of the titanium alloy samples were characterized using an MSR-2T electrochemical reciprocating friction and wear tester (Lanzhou Zhongke Kaihua Technology Co., Ltd., Lanzhou, China). This instrument is specifically designed to investigate the corrosive-wear behavior and friction mechanisms of specimens in electrochemical media. Tribological properties were measured using the reciprocating motion module. While electrochemical corrosion behavior was monitored in real time via a CHI604e electrochemical workstation (Shanghai Chenhua Electrochemical Workstation Co., Ltd., Shanghai, China). The reagents and electrochemical corrosion solutions used in the experiments are listed in
Table 2. Specimens were mounted in a 300 mL electrochemical cell. A load was applied vertically onto the specimen surface through a counter-ball indenter mounted on a sensor-integrated loading rod, as shown in
Figure 2. The applied load was generated by calibrated weights stacked in descending order of mass (i.e., smaller weights placed atop larger ones), with the convex side of the counter ball oriented downward to suppress vibration during reciprocation and thereby minimize systematic measurement errors. The electrochemical cell—and thus the lower specimen—was driven horizontally by a motorized actuator to produce reciprocating sliding contact against the stationary counter ball. This configuration enabled continuous acquisition of the time-dependent coefficient of friction at the specimen–counter-ball interface, as shown in
Figure 3.
First, TC4 titanium alloy samples were ground, polished, and ultrasonically cleaned. Electrochemical corrosion and tribological tests were performed using an MSR-2T electrochemical corrosion–tribology coupled testing machine in conjunction with a CHI604e electrochemical workstation. A tungsten carbide (WC) ball (5 mm diameter, 1460 HV) served as the counterface. A normal load of 5 N and a reciprocating amplitude of 5 mm were applied, and the sampling frequency was set to 1 Hz. Tests were conducted at reciprocating frequencies of 60, 120, 180, and 240 cycles/min. The corrosive media were prepared using aqueous sulfuric acid and hydrochloric acid solutions with different concentrations at room temperature. Potentiostatic polarization was carried out at applied potentials of −0.3, 0.2, 0.5, 1.0 and 1.2 V (vs. SCE). By systematically varying the corrosive media, electrochemical potential, and mechanical parameters, the evolution of surface coefficient of friction was investigated. The specific experimental parameters are summarized in
Table 3.
2.2. Experimental Analysis
Electrochemical tribocorrosion tests reveal that under an applied potential of 0.5 V, the alloy exhibits a lower friction coefficient of approximately 0.45 in hydrochloric acid solution compared with sulfuric acid solution. Ultra-depth microscopic observations (
Figure 4) demonstrate that the wear tracks formed in both solutions are relatively smooth with shallow furrows.
In sulfuric acid solution (
Figure 4a), typical delamination wear characteristics are observed. Continuous sliding friction increases the strain rate at contact areas. The combined effect of electrochemical corrosion and stress-induced micro-plastic deformation initiates microcracks on the alloy surface. Synergistic reciprocating sliding and electrochemical corrosion promote crack propagation, eventually leading to spalling and fatigue wear. Meanwhile, the plasticized layer formed via electrochemical corrosion reduces deformation resistance and alleviates local work hardening, thus lowering the friction coefficient under applied potential relative to the uncharged condition.
In hydrochloric acid solution (
Figure 4b), obvious pitting corrosion can be detected. Chloride ions possess high activity under the applied electric field, which severely destroys the passive film on the alloy surface. Enriched chloride ions accelerate localized anodic dissolution and facilitate the formation of pitting pits. Additionally, continuous friction disrupts and decomposes the surface passive film, reducing the shear force between the WC ball and the alloy surface, which further decreases the friction coefficient.
Figure 5 shows the EDS elemental distribution results in different regions of the wear scar surface of TC4 alloy in 0.5 mol/L H
2SO
4 solution. It can be observed that at the potential of −0.3 V, a large area of fresh TC4 matrix remains in the wear scar region with almost no corrosive wear occurring. As the applied potential increases gradually from 0.2 V to 1.0 V, the elemental contents in the wear scar region change correspondingly. The average content of oxygen element rises continuously, reaching a maximum value of 55.33%. Although the average content of sulfur element is relatively low, it also presents an increasing trend. This indicates that oxides and a small amount of sulfides generated by corrosive friction form on the fresh matrix surface, and the accumulation of corrosion products becomes more abundant and thicker with the increase in applied potential.
Figure 6 presents the wear track morphologies of TC4 titanium alloy observed via ultra-depth field microscope under an applied potential of 0.5 V in 0.5 mol/L H
2SO
4 solution at different reciprocating sliding speeds. It can be seen that the wear surface remains relatively smooth at low sliding speeds. As the sliding speed increases, the wear track gradually becomes rougher, accompanied by increased residual wear debris and evident adhesive spalling features.
At a low reciprocating speed of 60 t/min, moderate plastic deformation and shallow ploughing grooves are detected on the wear surface. The low sliding velocity provides sufficient time for electrochemical dissolution at the contact interface. Although friction tends to thin and remove the surface plasticized layer, the combined effect of applied potential and tribocorrosion allows adequate time for the regeneration of corrosive oxide films. The stable-thickness surface film thereby achieves favorable friction-reducing and lubricating effects.
In contrast, at a high sliding speed of 240 t/min, the surface film is frequently sheared and stripped away during continuous friction. The newly formed corrosion products fail to fully cover the contact zone between the WC ball and alloy substrate. The sliding process mainly acts on the fresh anodically dissolved matrix and local oxides, resulting in more severe plastic deformation, denser ploughing grooves and local fragment spalling. Such damage is primarily attributed to corrosion-induced cracking driven by high cyclic contact stress.
This study experimentally investigates the effects of various corrosion parameters, including corrosive medium type, solution concentration, applied potential and reciprocating sliding speed, on the friction coefficient, wear rate, wear track morphology and surface element distribution of the plasticized layer on titanium alloy surfaces.
Experimental results reveal that under the synergistic effect of applied potential and tribocorrosion, the surface plasticized layer of alloy specimens exhibits excellent friction-reducing and lubricating properties in both hydrochloric acid and sulfuric acid solutions. Different corrosive media and concentrations exert distinct influences on the friction coefficient of corroded surfaces. Nevertheless, the quantitative contribution degree of each parameter to surface COF remains unclear. Accordingly, this work aims to establish a data-driven model based on characteristic factors such as solution type, concentration, applied voltage and sliding speed to realize accurate prediction of friction coefficient.
All data used for dataset construction in this study are entirely derived from controlled electrochemical corrosion and wear experiments. In alignment with standard electrochemical corrosion conditions, the solution type, concentration, velocity and voltage were selected as the primary research parameters to explore their influence on the friction coefficient of the titanium alloy. Based on practical operating conditions, 60 min tests were conducted for each parameter, generating 30 initial sets of experimental conditions. Using cross-factor experimental design, the dataset was further expanded to 250 data points. These data were divided into a training set (90%) and a test set (10%). This methodology enabled a comprehensive investigation into the effects of relevant variables on the tribocorrosion friction coefficient of titanium alloy, as well as a rigorous evaluation of its performance under complex service conditions.
2.3. Machine Learning Methods
In the field of material tribology, it is essential to collect, process, and analyze extensive experimental datasets that encompass friction coefficients and their underlying correlations with external operating conditions. However, due to the high-dimensional nature of such data and the complex non-linear relationships between variables, traditional analytical methods often fail to fully reveal the intrinsic associations within the data. Consequently, the application of sophisticated machine learning algorithms and data modeling strategies can effectively extract latent associative features, significantly enhancing the accuracy and reliability of predictive outcomes. This provides a rigorous scientific foundation for optimizing the processing and expanding the industrial applications of titanium alloys.
Random Forest (RF) constitutes a robust bagging ensemble that exhibits low variance and strong resilience to noise and outliers, rendering it a reliable baseline for time-series forecasting. However, its bootstrap sampling procedure may disrupt temporal dependencies, and its predictive capacity is often outperformed by gradient-boosted trees in modeling complex nonlinear and non-stationary dynamics.
LightGBM, an efficient gradient-boosting decision tree implementation, leverages histogram-based splitting, gradient-based one-side sampling (GOSS), and leaf-wise growth to deliver superior predictive accuracy and computational efficiency on large-scale time-series data. It excels in capturing intricate temporal patterns, feature interactions, and residual structures, making it the preferred choice for high-performance forecasting in contemporary research.
In comprehensive benchmarks across diverse time-series domains (e.g., finance, energy, traffic), LightGBM consistently achieves lower forecasting errors (RMSE, MAE, MAPE) than Random Forest, particularly for high-dimensional, large-sample, and dynamically complex datasets. RF remains valuable for small, noisy datasets where stability and interpretability are prioritized.
Considering that the electrochemical corrosion wear dataset of titanium alloys presents high dimensionality, complex nonlinear relationships, and obvious time-series-dependent characteristics, the LightGBM algorithm is adopted in this study for model training and predictive analysis.
2.4. LightGBM Modeling
In this study, the LightGBM model was selected for training and prediction. As an efficient gradient boosting decision tree framework, LightGBM demonstrates exceptional training efficiency and predictive precision when handling large-scale, high-dimensional data by employing a leaf-wise growth strategy and histogram-based optimization [
24,
25,
26]. Its core mechanisms involve iterative residual fitting, a tree-based decision framework, and a regularized objective function.
Figure 7 systematically illustrates the core operational mechanism of the LightGBM framework. At the feature engineering level, the model utilizes continuous feature discretization and histogram construction strategies to quantify floating-point features into discrete bins and aggregate gradient information. By leveraging histogram subtraction for the rapid calculation of splitting gains, the model significantly reduces memory overhead while enhancing computational throughput. At the ensemble learning level, the framework adopts a gradient boosting architecture with a leaf-wise tree growth strategy. Using the training set as input, the model iteratively constructs decision trees across successive generations, where each new tree fits the residual error of the preceding models. Ultimately, a global predictive model is established through the linear superposition of all base learners. This architecture achieves a sophisticated equilibrium between computational efficiency and generalization performance, providing critical technical support for high-efficiency learning in complex data scenarios.
This study employs a systematic workflow for predicting the coefficient of friction (COF) in tribocorrosion processes, as illustrated in
Figure 8. First, tribocorrosion experiments are conducted under varying operating parameters for a duration of 60 min to generate the raw dataset. The acquired raw data is then subjected to a systematic preprocessing stage to remove noise, handle missing values, and standardize the signals. Following preprocessing, we enrich the dataset by engineering lagged features and rolling statistical features. These features are designed to capture the temporal dependencies and evolving statistical characteristics of the friction coefficient signals. Subsequently, the enriched dataset is used to train a Light Gradient Boosting Machine (LightGBM) model. Finally, the trained model is applied to perform time-series prediction of the coefficient of friction (COF).
3. Results and Discussion
3.1. Data Description and Preprocessing
The performance ceiling of a machine learning model is fundamentally determined by data quality, while feature engineering serves as the core process for extracting latent patterns and enhancing model generalization. The experimental data concerning the electrochemical corrosive-wear of titanium alloys exhibit prominent time-series characteristics, with evolutionary behavior governed by the non-linear coupled regulation of factors such as solution concentration, voltage, and sliding velocity. Consequently, this study first performed a exhaustive descriptive analysis of the raw experimental data, followed by the design and implementation of a systematic data preprocessing and feature engineering framework to construct a high-quality dataset for model training and optimization.
The friction coefficient data used in this study were obtained from laboratory records of electrochemical corrosive-wear experiments, encompassing measured values of the friction coefficient over time under diverse experimental conditions. The raw dataset consists of six fields, which are categorized into independent variables (input features) and a dependent variable (target variable) based on the experimental mechanism. The specific definitions and descriptions of each field are provided in
Table 4.
A preliminary Exploratory Data Analysis (EDA) was conducted to evaluate the sample distribution across various experimental conditions. The experiments encompassed sulfuric acid and hydrochloric acid solutions at different concentrations, with sliding velocities set at 60, 120, 180, and 240 rpm. Applied voltage conditions included 0 V (open circuit), 0.2 V, −0.3 V, 0.5 V, 1.0 V, and 1.2 V. Typically, raw experimental data contain minor missing values and outliers resulting from sensor fluctuations, which were corrected during the preprocessing stage. To address these data quality issues, a systematic workflow—comprising data cleaning, outlier handling, missing value imputation, and data standardization—was implemented.
To eliminate the influence of varying measurement scales on model performance, the raw input variables were subjected to Min-Max Normalization according to the following equation:
In Equation (1),
and
represent the raw and normalized values of the ith variable, respectively, while
and
denote the maximum and minimum values of that variable. Concurrently, to mitigate the adverse impact of the wide numerical distribution range of the friction coefficient on model fitting, a logarithmic transformation was applied to the friction coefficient
for each data set under corrosive-wear conditions. The transformed value,
, served as the model output variable to enhance fitting performance and convergent stability for data with large numerical spans. To quantitatively evaluate the predictive accuracy and generalization capability of the constructed models, the Coefficient of Determination (
) and Mean Squared Error (
) were selected as core evaluation metrics, defined as follows:
In Equations (2) and (3), is the actual value of the target variable, is the predicted value, is the mean of the actual values, and n is the total number of samples. Generally, an value closer to 1 and an closer to 0 indicate smaller overall prediction errors and higher accuracy in model fitting and forecasting.
3.2. Feature Selection
Feature selection is a critical stage in enhancing the accuracy of time-series prediction models. In this study, in addition to retaining experimental conditions as base features, we constructed lag features, rolling statistical features, and derived statistical features based on the inherent characteristics of the time series. This approach was designed to systematically characterize the dynamic evolutionary patterns of the friction coefficient. The base features were extracted directly from the raw experimental data to represent the macroscopic operating environment of the electrochemical corrosive-wear experiments. These primarily include temporal features and condition features. Temporal features encompass absolute time and normalized relative time (the ratio of elapsed time to the total experimental duration). Condition features include solution encoding, concentration, sliding velocity, and operating voltage. The construction of time-series features specifically addresses the significant temporal dependencies and hysteresis effects inherent in friction coefficient variations. To effectively capture these dynamic characteristics, two key categories of features were developed:
(1) Lag Features: Lag features utilize the friction coefficient values from historical time points to facilitate the prediction of the current state, effectively reflecting the intrinsic memory effect of the tribological system. As shown in Equation (4), for the target variable (friction coefficient
), this study constructed lag features with orders
= 1, 2, …, 5:
By incorporating these lag features, the model can effectively learn the short-term inertial changes and fluctuation trends of the friction coefficient.
(2) Rolling Statistical Features: Rolling statistical features employ a sliding window mechanism to smooth the data, enabling the extraction of statistical patterns within local time scales. This approach is instrumental in suppressing experimental noise. In this study, sliding windows with sizes of 3, 5, and 10 were utilized to calculate the following features:
Rolling Mean: Characterizes the local average level of the friction coefficient.
Rolling Standard Deviation: Reflects the local fluctuation intensity of the friction coefficient.
Rolling Extremes (Minimum/Maximum): Captures the boundary values of the friction coefficient within the window.
The calculation for the rolling mean is defined as follows:
In Equation (5), w represents the size of the sliding window.
In time-series prediction tasks, data samples possess inherent temporal correlations. Adopting traditional random partitioning methods can easily introduce data leakage (the inclusion of “future” information), leading to over-optimistic and unreliable evaluation results. Consequently, this study adopts a rigorous chronological splitting strategy. The dataset was partitioned proportionally using individual experiments as the basic unit. First, the raw data were grouped according to their unique Experiment IDs. Subsequently, within each experimental group, the first 90% of samples were assigned chronologically to the training set for model parameter learning, while the remaining 10% served as the test set to validate generalization capability. Finally, the training and test sets from all experiments were merged to form the global training and test sets. This partitioning method ensures that all test samples occur strictly later in time than the training samples, effectively simulating a real-world prediction scenario. This approach eliminates information leakage and ensures that the model evaluation results are both objective and credible.
Owing to the high-dimensional and complex non-linear characteristics of the titanium alloy electrochemical corrosive-wear dataset, directly applying a standard LightGBM gradient boosting decision tree model for fitting would not only increase the computational burden but also potentially degrade predictive accuracy through the introduction of irrelevant features. To address these challenges, an enhanced LightGBM approach was developed, as illustrated in the flowchart in
Figure 9. This method constructs a predictive model for the electrochemical surface COF prediction of titanium alloys by incorporating lag features and rolling statistical features. Furthermore, input feature optimization is achieved through a hybrid approach combining Pearson correlation coefficients with LightGBM’s built-in feature importance ranking. As a boosting-based ensemble learning method, GBDT iteratively optimizes and approximates residuals by serially combining multiple decision trees. It utilizes additive models and a forward stagewise algorithm for optimization, employing the negative gradient of the loss function to achieve steepest descent approximation, thereby efficiently addressing high-dimensional non-linear modeling problems. Compared to approaches using only raw instantaneous features, traditional feature selection, or conventional ML models, the proposed method offers two distinct advantages. Temporal Depth: By integrating lag and rolling features, the model fully extracts historical dependencies and local dynamic patterns within the time-series data, significantly enhancing its capability to represent complex, time-varying processes. Efficiency and Robustness: The integration of Pearson correlation filtering effectively reduces feature dimensionality and noise interference. This ensures high predictive precision while simultaneously improving training efficiency, generalization capability, and model interpretability.
3.3. Data Description and Hyperparameter Configuration
This study systematically analyzes and quantitatively evaluates the electrochemical corrosion and wear dataset in terms of global statistical properties and data quality. Multiple critical data characteristics are exhaustively characterized, including total sample scale, intrinsic feature configuration, missing data ratio, statistical distribution, abnormal sample distribution, and cross-feature correlation. The exhaustive data inspection enhances the reliability of model training and ensures the reproducibility of subsequent predictive experiments.
The dataset contains 32,470 valid time-series sampling points and five input variables, which are divided into categorical and numerical features according to physical experimental attributes. The electrolyte solution type is treated as a categorical feature, while the remaining four environmental and operational parameters are defined as continuous numerical variables. The friction coefficient is regarded as the core regression target for corrosion wear prediction. All experimental measurements were collected under independent and standardized electrochemical operating conditions. The sufficient sample size satisfies the statistical requirements for reliable machine learning training and performance verification, which further supports the robust evaluation of model generalization capability in complex wear prediction tasks.
To simulate real industrial monitoring scenarios where future tribological parameters are predicted based on historical sequential observations, a strict time-series sequential partitioning strategy is employed in this study. All samples collected from each 60 min independent experiment are divided chronologically without random shuffling. Specifically, the first 90% of temporal data are used for model training and parameter optimization, and the last 10% are reserved as the unseen test set to evaluate temporal extrapolation performance and long-term wear forecasting accuracy. This time-aware partitioning strictly maintains temporal causality, completely avoids potential time-series data leakage, and guarantees the practical applicability and scientific validity of the prediction evaluation results.
The time variable is synchronously recorded at fixed intervals throughout the entire experimental process, forming a complete, evenly distributed, and missing-free time-series index. As a fundamental sequential label, the time variable ensures uniform sampling frequency without requiring complex distribution fitting. Other key control factors, including solution concentration, sliding speed, applied voltage, and ambient temperature, are precisely configured before each test with balanced sample distribution across different gradient levels, satisfying the standard requirements of controlled-variable electrochemical experiments. Statistical results show that the friction coefficient ranges from 0.079 to 0.743, with a mean value of 0.506 and a standard deviation of 0.087. The overall data distribution is approximately symmetric, with no severe skewness or extreme outliers. Such stable and complete data distribution effectively reflects the full evolutionary process of titanium alloy friction coefficient, covering both the initial low-friction running-in stage and the subsequent stable high-friction wear stage, thereby providing high-quality sequential samples for supporting the accurate prediction of surface COF prediction using the LightGBM framework in this study, specific super parameters are shown in
Table 5 and
Table 6.
Based on the aforementioned dataset statistical characteristics and optimal hyperparameter configuration, exhaustive predictive experiments are conducted to further evaluate the model performance and analyze the corresponding results.
3.4. Data-Driven Prediction of Electrochemical Surface Coefficient of Friction
Figure 10 presents the Pearson correlation coefficient heatmap, which quantitatively reveals the linear association intensity between each feature of the electrochemical corrosive-wear experiment and the target variable (friction coefficient). The diagonal elements represent the unit values of variable self-correlation. A strong positive correlation was observed between solution encoding and concentration (r = 0.63), suggesting potential multicollinearity. The target variable, friction coefficient, exhibits moderate to strong positive correlations with time (r = 0.64) and velocity (r = 0.50), indicating that these are the primary linear factors driving the dynamic evolution of the friction coefficient. In contrast, the linear correlations between the friction coefficient and solution encoding, concentration, and voltage are relatively weak (|r| < 0.20). This suggests that their influences may manifest in non-linear forms, providing a critical rationale for the subsequent feature selection and modeling strategies.
As illustrated in
Figure 11 and
Figure 12, the conventional LightGBM model, relying solely on the five base features, exhibits poor fitting performance and weak predictive capability. The coefficient of determination (R
2) for the training and test sets reached only 0.771 and 0.487, respectively. The predicted data points are highly dispersed, showing significant deviations across high, medium, and low wear rate regions, which indicates a deficient learning capacity regarding the underlying features of the training data. Furthermore, the prediction errors are substantial: the MSE, RMSE, and MAE for the training set are 0.00169, 0.04101, and 0.03091, while those for the test set are 0.00368, 0.06062, and 0.05034, respectively. The root cause of this performance lag lies in the temporal dependencies and hysteresis effects inherent in the friction coefficient (COF) time series. Unlike traditional static tabular data, the core characteristic of time-series data is autocorrelation—the COF at the current moment is strongly correlated with historical observations. This dynamic evolutionary pattern cannot be fully characterized by static experimental condition features alone.
As a tree-based model founded on the assumption of sample independence, LightGBM, if not explicitly provided with temporal information, can only learn the average trends from external operating conditions. It struggles to capture the inertia, abrupt changes, and periodic fluctuations of the COF itself. Consequently, a model that neglects temporal lag can only output an “average friction coefficient” based on current conditions, failing to predict accelerations or sudden shifts in the COF sequence. This results in a fitting curve that is overly smooth and centered around the mean, leading to significant deviations from the actual fluctuating COF values and explaining the fundamental failure of the base-feature model.
As demonstrated in
Figure 12, the fitting and generalization results of the conventional LightGBM model are suboptimal. This stems from its architectural reliance on linear combinations of inputs, which performs poorly when confronted with the highly non-linear characteristics inherent in electrochemical corrosive-wear datasets. In contrast, the fitting results incorporating both lag and rolling features are highly satisfactory. A comparison of generalization errors reveals that the enhanced model significantly outperforms the conventional version. Specifically,
Figure 13 illustrates the coefficient of determination (R
2) improved to 0.979 for the training set and 0.951 for the test set. Correspondingly, the error metrics for the training set—MSE, RMSE, and MAE—decreased to 0.00015, 0.012285, and 0.00946, respectively. For the test set, these values dropped to 0.000348, 0.018653, and 0.013139. The predicted values align closely with the experimental data, exhibiting minimal deviations and maintaining stability even in high-wear-rate regions. These results validate the efficacy of the proposed feature selection strategy, which optimizes predictive performance and enhances generalization accuracy by eliminating redundant features.
Given that the friction coefficient is a representative time-series variable with significant autocorrelation, current observations are heavily dependent on historical states. Relying solely on basic experimental features is insufficient to fully characterize its dynamic evolutionary patterns. Consequently, this study developed lag and rolling statistical features, as illustrated in
Figure 13, to explicitly introduce temporal dependency information. This allows the model to capture the short-term inertia, local fluctuations, and temporal patterns of the COF, thereby substantially elevating both fitting and predictive performance.
Figure 14 illustrates the feature importance ranking after incorporating lag and rolling statistical features. Notable discrepancies exist between these results and those of the conventional LightGBM model shown in
Figure 11, reflecting differences in how each model interprets the data characteristics. According to the importance ranking in
Figure 14, the first-order lag feature of the friction coefficient (COF_lag1) holds a dominant position, with its importance significantly exceeding all other features. This finding offers direct experimental evidence for the strong autocorrelation of friction coefficient sequences, revealing that the wear condition at the previous moment dominates the real-time variation in friction. It well characterizes the inherent memory effect and short-term inertia of tribological systems.
Rolling statistical features (such as rolling mean and rolling standard deviation) also demonstrate high importance. This suggests that the friction coefficient exhibits stable trends and fluctuation patterns within local time windows. These features effectively capture the dynamic evolutionary modes of the sequence, providing essential information for the model to identify local extrema and trend transitions.
Temporal features (absolute and relative time) possess moderate importance, indicating that the cumulative effect of the experimental process is a significant factor influencing the long-term variation in the friction coefficient. This aligns with the findings from the Pearson correlation analysis, which showed a strong positive correlation between time and the friction coefficient.
In contrast, the solution encoding feature exhibits a relatively low level of importance. This implies that within the context of this specific experimental dataset, the linear and non-linear impacts of the solution type on the friction coefficient are limited. Its influence is likely partially subsumed by temporal and other operating condition features. This observation provides a logical basis for future feature pruning and model lightweighting efforts.
Quantitative analysis was performed on each group of results in
Figure 14. Mean square error (MSE) and coefficient of determination (R
2) were adopted to evaluate the fitting performance and generalization error of the two algorithms, and the comparison results are presented in
Table 7.
It can be observed from
Table 7 that the enhanced LightGBM model achieves favorable fitting capability. In terms of generalization error on the test set, the proposed optimized model exhibits distinctly superior generalization performance compared with the conventional model.
4. Conclusions
In this study, a machine learning framework was developed to predict the factors influencing the electrochemical surface COF prediction of titanium alloys. The research systematically explored material behavior and model predictive capabilities under diverse operating conditions. An enhanced LightGBM model, incorporating lag and rolling statistical features, was constructed and validated using experimental datasets. To evaluate predictive precision, RMSE, MAE, and MSE were employed as core metrics, alongside an analysis of feature importance for variables such as solution concentration, experimental duration, and sliding velocity. The results demonstrate that the modified LightGBM algorithm effectively addresses challenges in modeling high-dimensional, non-linear complex data, exhibiting superior generalization ability and predictive accuracy. The established model accurately fits and predicts the electrochemical surface COF of titanium alloys. Moreover, through feature contribution analysis, the model offers robust interpretability, providing critical data support and a theoretical reference for the optimization of electrochemical machining processes for titanium alloys. Therefore, the main contributions of this work are threefold:
(1) Constructing a systematic dataset of titanium alloy COF prediction under multi-factor electrochemical conditions. Proposing an enhanced LightGBM model that incorporates lag and rolling-window features to effectively capture the temporal dynamics of friction coefficient. The enhanced LightGBM model, integrating lag and rolling features, demonstrated the highest overall performance. Feature analysis identified that among the five investigated factors (solution concentration, Solution type, voltage, sliding speed and experimental duration), solution concentration as the most critical factor influencing the friction coefficient, followed by experimental duration, while sliding velocity exhibited a relatively minor impact.
(2) In the prediction of friction coefficients, the enhanced LightGBM model outperformed conventional models across both training and testing datasets, achieving the lowest prediction errors and superior fitting, particularly in high-friction coefficient regimes. Conversely, the conventional LightGBM model exhibited the poorest performance, characterized by significant prediction errors and an increased number of outliers in the medium-to-high friction ranges, indicating an insufficient capacity to fit complex, non-linear data.
(3) Identifying and ranking the key influencing factors (e.g., solution concentration, test duration) through feature importance analysis, providing actionable insights for process optimization. Feature importance analysis confirmed that solution concentration is the primary driver of friction coefficient variations, yielding the highest contribution scores across all models. Experimental duration followed in importance, though weight evaluations varied by model. While the conventional LightGBM model overemphasized the role of voltage, the enhanced model assigned it a lower weight. Velocity had the least impact overall, despite holding a slightly higher relative importance in the conventional model than solution type. These discrepancies highlight how different model architectures interpret data characteristics, providing a crucial reference for future model optimization and feature selection strategies.
The enhanced LightGBM model we developed—incorporating lag and rolling statistical features tailored to titanium alloy electrochemical corrosive-wear data—effectively performs variable selection and model construction in high-dimensional spaces. This approach is feasible and practical for predicting and analyzing the factors that govern the COF of titanium alloy surfaces in complex electrochemical corrosive environments. While this study concentrates on machine learning-based COF prediction under standardized experimental settings, thorough evaluation of statistical repeatability will be addressed in follow-up research.