3.1. Data Source and Normalization Processing
Data was sourced from high-temperature and high-pressure weight-loss experiments on tubing corrosion. The dataset encompasses 8 feature variables and 1 target variable, totaling 220 data points. The experimental data structure is divided into influencing factors and outcomes. The selected influencing factors are: temperature (T), CO
2 partial pressure (PCO
2), H
2S partial pressure (PH
2S), N
2 partial pressure (PN
2), total pressure (PT), flow velocity (V), corrosion time (Time), and pH value (pH). The outcome is the corrosion rate. The corrosion subjects are three commonly used industrial metals: 2205DSS, N80, and CT80.
Table 1 shows partial examples from the raw corrosion rate dataset. Additionally, the input features in the current study are limited to specific factors currently available and do not yet cover other critical factors that may influence the internal corrosion rate of oil pipelines. In-depth exploration of the relevant influence mechanisms and multi-factor coupling effects still requires further research to supplement and refine.
Following the data processing methods outlined in
Section 2.1, the raw data (which, coming from our weight-loss corrosion experiments, ideally have no missing or outlier values) underwent cleaning operations including missing value imputation and outlier removal. Subsequently, the Min-Max normalization method was applied to map all feature variable values to the [0, 1] range.
Table 2 presents partial examples of the normalized dataset.
Data distribution histogram is shown in
Figure 4 to more intuitively illustrate the coverage range and distribution of the features.
The normalized data were divided using five-fold cross-validation while being stratified by material type. By strictly maintaining the proportion of samples from different materials in each fold identical to the overall dataset, the material distribution characteristics of each fold were ensured to closely match the original data distribution. This approach avoided excessive concentration or absence of any single material type in a particular fold, thereby safeguarding the objectivity and reliability of the model’s prediction results.
3.2. Global Analysis of Features and Coupled Reconstruction
This study focuses on constructing a new dataset through feature coupling and weight reassignment to train and validate models, thereby revealing underlying feature interaction mechanisms. This section builds coupled features by integrating mathematical operations with corrosion domain knowledge, quantifying their synergistic influence beyond independent feature analysis. First, SHAP analysis is applied to the dataset, followed by normalization to obtain the average SHAP values for each independent feature. This enables analysis of their fundamental influence on corrosion rates and preliminary exploration of underlying physical mechanisms. A negative value for a feature indicates an inhibitory effect in the current sample, while a positive value signifies that the presence or intensity of that condition promotes the predicted outcome.
The horizontal bar chart in
Figure 5 illustrates the contribution levels of each feature. Results indicate that temperature is the dominant factor controlling internal pipeline corrosion behavior; CO
2 partial pressure serves as the key driving force, ranking second; while pH and H
2S partial pressure is the third and fourth most significant factors. Other factors, such as flow velocity, total pressure, and time, exert relatively limited influence. This ranking reveals three core characteristics—temperature, CO
2 partial pressure, and pH—whose physical mechanisms are relatively well-established: temperature primarily governs reaction kinetics and protective film evolution; CO
2 partial pressure dominates cathodic depolarization reactions; and pH directly defines the corrosive intensity of the medium [
37,
38]. Notably, nitrogen, as an inert gas, exhibits corrosion-inhibiting effects by diluting the concentration of corrosive gases, thereby reducing their effective concentration at the metal surface [
39].
Correlation analysis was performed on all features to identify highly linearly correlated feature pairs, providing a basis for subsequent feature coupling. The Pearson correlation matrix in
Figure 6 displays correlations among features, where numerical values represent correlation coefficients, positive/negative signs indicate directionality, and absolute values reflect linear relationship strength. Results show: Total pressure and CO
2 partial pressure exhibit a strong negative correlation with a coefficient of −0.72, further validating the mechanism of corrosion inhibition through nitrogen injection to increase system total pressure and dilute corrosive gases. Concurrently, CO
2 partial pressure and H
2S partial pressure—two critical corrosion parameters—show a moderate positive correlation, indicating they do not act independently in the corrosive environment but exhibit a degree of interdependence.
Building upon the correlations identified in the above analysis, this study next employs the SHAP-Sobol feature weighting method. By calculating the weights of individual features and their coupling terms, we quantify the relative contributions of independent effects and interaction effects in corrosion prediction. For the identified key feature pairs, this study selects the total pressure-temperature and CO
2 partial pressure-H
2S partial pressure feature sets for coupling and weight quantification analysis. This addresses two typical interaction scenarios: the environmental coupling of total pressure and temperature in actual pipeline service conditions, and the medium synergy between CO
2 partial pressure and H
2S partial pressure. The goal is to clarify the specific influence intensity of these core interaction terms on corrosion rates.
Table 3 presents the feature prediction weights before and after coupling, while
Figure 7 illustrates the comparison between independent and coupled feature prediction weights. Among them, Group 1 contains only the original independent features (no coupling terms); Group 2 introduces the temperature-total pressure (T-TP) interaction term based on the original features; Group 3 introduces the CO
2-H
2S partial pressure (PCO
2-PH
2S) interaction term; and Group 4 includes both interaction terms.
A comparative analysis reveals distinct feature prediction weights across each group. The first group exhibits a total feature prediction weight of 1, as it incorporates only independent features without accounting for coupling effects between them. In contrast, the subsequent three groups all yield total weights exceeding 1. This occurs because within the SHAP-Sobol analytical framework, the method independently quantifies two distinct contributions: First, the marginal importance of each feature. Second, the synergistic contribution from feature coupling effects. These contributions neither cancel each other out nor overlap; instead, they manifest as a complex, multi-factor logical superposition characteristic of complex systems. When coupling effects generate additional explanatory power, the total weight may slightly exceed 1. It should be noted that weight normalization is not performed here. The purpose is to preserve the original magnitude of the coupled contribution, facilitating an intuitive comparison of the weight jump before and after coupling. Forced normalization would mask the true degree of enhancement of the coupled term, thereby reducing interpretability. Subsequently, these prediction weights will be embedded into the established model architecture. By comparing prediction results before and after feature coupling, we will intuitively reveal the patterns of how coupling effects influence corrosion prediction.
3.3. Model Hyperparameter Optimization
Before embedding feature prediction weights into the model, it is essential to systematically identify the optimal hyperparameter combination. This ensures subsequent predictions utilize a consistent model architecture, thereby attributing all result variations solely to the introduction of the feature weight quantization strategy and avoiding interference with hyperparameter tuning. Employing Bayesian optimization through multiple iterative rounds, we ultimately identified eight high-performance model configurations (
Figure 8). The distribution of models within the parameter space exhibits distinct performance gradient characteristics, where color intensity correlates positively with prediction accuracy—darker regions indicating superior model performance under corresponding parameter settings. This optimization outcome provides reliable candidates for the final determination of hyperparameter combinations. Building upon this foundation, we will comprehensively evaluate each model’s generalization capability and stability through convergence curves of training and validation losses, thereby establishing a more holistic basis for finalizing model hyperparameter configurations. During the model testing phase, each configuration was trained and validated using identical datasets, including uniform training and validation set partitions, ensuring comparability among models.
The loss function employed is the Mean Squared Error (MSE). The resulting loss curves are shown in
Figure 9. The testing process is strictly confined to 200 boosting rounds, with complete records of training and validation loss variations. Dense scatter plots visually display the loss curve trajectories of each model throughout the training cycle, clearly illustrating the convergence characteristics and overfitting tendencies under different hyperparameter configurations.
Based on the loss curve analysis, all eight models demonstrated excellent convergence characteristics. The training and validation losses good performance for all models achieved rapid and stable decreases within 200 training boosting rounds, with no noticeable signs of overfitting. This synchronized behavior indicates the models possess satisfactory generalization within the tested dataset. Notably, Model 3 achieved optimal performance on the validation set, with a smooth and continuously decreasing validation loss curve, demonstrating excellent prediction accuracy and stability. Although Model 8 performed slightly less well, its rapid convergence capability in the early stages remains significant for scenarios requiring swift model validation. Minor fluctuations observed in some models during late training stages may indicate optimization instability. Although overall model performance is satisfactory, robustness under more extreme operating conditions requires further validation. After a comprehensive evaluation, this study adopts the hyperparameter configuration of Model 3 for subsequent prediction tasks. The remaining two models also underwent the same process to select their best configurations, and all subsequent comparative analyses will uniformly use these optimally configured models. The specific configurations are shown in
Table 4.
3.4. Analysis of Prediction Results
The feature prediction weights from each group were embedded into the selected model framework to predict and analyze the corrosion rates of three tubular steel grades. Results shown in
Figure 10,
Figure 11 and
Figure 12 demonstrate the performance of models incorporating different feature combinations in predicting corrosion rates for the three tubing materials. The error distribution of prediction results reveals significant differences in sensitivity to feature groups across materials due to variations in material properties and corrosion mechanisms. For the 2205DSS duplex stainless steel, the prediction accuracy improved after introducing the gas interaction term, with a noticeable increase in the proportion of data points falling within the 15% error band. This result suggests that even for this corrosion-resistant duplex stainless steel, the synergistic effect of CO
2 and H
2S still affects the stability of its passive film, further indicating that this coupled feature is key to controlling its corrosion rate. According to the SHAP analysis, the CO
2-H
2S interaction term has a contribution weight of 0.5171. The model infers that when the partial pressure ratio of the two gases increases, the passive film stability decreases, leading to a higher predicted corrosion rate; conversely, a lower rate is predicted. This behavior is consistent with changes in the film breakdown potential observed in electrochemical tests.
CT80 carbon steel exhibits a stronger dependence on the T-PT coupling feature. After introducing this feature combination, the model prediction error significantly decreases and the high-accuracy prediction points become more concentrated, reflecting the important role of temperature and pressure in the corrosion behavior of this material. The model assigns the highest weight (0.5452) to the T-PT coupling feature, and the predictions show that under high-temperature and high-pressure conditions the corrosion rate increases sharply. The interpretability analysis indicates that temperature accelerates the reaction kinetics while pressure increases gas solubility, jointly promoting corrosion; therefore, temperature control should be prioritized in the field. It is worth noting that although characteristics such as time and flow velocity appear in all combinations, the introduction of the T-PT coupling term also enhances the synergistic representation capability of these fundamental parameters.
The predicted corrosion rate results for N80 steel show a distribution distinct from those of CT80 and 2205DSS, with the prediction accuracy strongly depending on the interaction of corrosive gases. Specifically, in the low-to-medium corrosion rate region (<4.0 mm/a), the model relies mainly on the CO2-H2S interaction term for prediction, and the predicted points lie densely near the ideal line, agreeing well with the measured values. However, when the actual corrosion rate exceeds 4.0 mm/a, the predicted points begin to deviate significantly from the ideal line, forming a distinct divergence zone. This phenomenon suggests that as the corrosion intensity increases, the corrosion mechanism of N80 steel may gradually shift from a CO2-H2S interaction-dominated mode to a composite mechanism governed by more localized factors such as pitting and flow field disturbance. The interpretability analysis indicates that the divergence at high corrosion rates stems from the absence of features describing corrosion product film rupture in the current model; therefore, relevant factors should be introduced in the future to improve prediction reliability in the high-corrosion-rate range.
Figure 13 displays the comparison profiles between predicted and actual values for the three materials. The predicted values uniformly adopt the model prediction results embedded with the fourth set of feature prediction weights. Overall, the model demonstrates good predictive accuracy on the test set in predicting corrosion rates for all three materials. The predicted curves for N80 and CT80 exhibit slight fluctuations relative to the actual curves, while the predicted curve for 2205DSS shows a smoother, more stable trend. This variation correlates positively with the inherent complexity of corrosion behavior and data variability for each material. Collectively, these findings demonstrate the model’s ability to effectively capture the distinct corrosion characteristics of different materials, with prediction results possessing potential engineering reference value.
Table 5 compares the predictive performance metrics of models with different feature configurations. Results show the coefficient of determination R
2 reached an outstanding maximum of 0.98, indicating the model possesses strong predictive capability for oil pipe corrosion rates without overfitting. Regarding error analysis, both MAE and RMSE metrics remain within ideal ranges. The prediction error for 2205DSS steel is the lowest, while those for CT80 and N80 steels are slightly higher but still within acceptable limits. Overall results demonstrate that the constructed model possesses excellent generalization capability and engineering practicality, providing reliable data support and decision-making basis for tubing material selection and corrosion protection.
For the remaining two prediction models, this study compares their predictive performance on the overall dataset with that of the core XGBoost model. Through the visualization of the prediction results of each model on the same test set, as shown in
Figure 14, the scatter distribution of the predicted values versus the true values for the three models is presented.
As can be observed from the figure, the prediction points of XGBoost are the most concentrated, basically distributed along the diagonal line within the corrosion rate range of 0–5 mm/a, indicating that its prediction error is small and uniformly distributed. The prediction points of Random Forest show some dispersion in the low-value range (0–2 mm/a), but still maintain a good tracking trend in the high-value range. The prediction points of Gaussian Process Regression (GPR) are relatively more dispersed, with noticeable fluctuations especially in the intermediate value region, yet it still effectively captures the overall trend of corrosion rate variation. When further optimizing the prediction model, one may consider introducing more discriminative feature combinations tailored for high corrosion rate conditions to enhance the model’s adaptability across the full rate range and its engineering guidance value.
It is worth noting that the confidence interval output capability of GPR gives it unique value in corrosion risk early warning. As shown in
Figure 15, GPR not only provides point estimates on the test set but also supplies prediction confidence intervals that vary with features. In regions where data are sparse or operating conditions are abnormal (e.g., the right end of the figure), the confidence interval widens significantly, indicating reduced reliability of the model prediction, which should be carefully interpreted in conjunction with field experience. In contrast, XGBoost and Random Forest only output single predicted values and cannot quantify such uncertainty, making GPR an important tool for corrosion risk assessment.
Table 6 presents the comparison results of various evaluation indicators for the prediction performance of the three machine learning models.
Table 7 shows the time spent on training and validation for the three models.
Overall, the XGBoost model demonstrates the best accuracy in corrosion rate prediction, with its coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE) all significantly outperforming those of the RF and GPR models. Although the training time of the XGBoost model is slightly longer than that of the RF and GPR models, its prediction speed is extremely fast, enabling rapid and accurate identification of abnormal fluctuations in corrosion rate based on real-time operating parameters, making it suitable for deployment in online monitoring systems. The RF model has the fastest training speed but slightly lower prediction accuracy, making it suitable for rapid modeling or preliminary exploration scenarios. Although the GPR model lacks sufficient accuracy, it provides uncertainty intervals for predictions and can serve as a supplementary tool for risk assessment and decision support. In practical applications, the model can be flexibly selected according to requirements for accuracy, modeling speed, and uncertainty evaluation.
3.5. Discussion on the Applicability and Limitations of This Study
The model developed in this study applies to three pipe materials (2205DSS, CT80, N80) and provides corrosion rate predictions that combine accuracy and interpretability under the tested complex operating conditions. For other types of corrosion behavior, the established comprehensive prediction process may remain applicable, provided that relevant feature data reflecting their mechanisms can be obtained. This offers preliminary evidence to help identify key corrosion factors and formulate control strategies, suggesting potential for extension.
However, this study has several limitations. First, model training relies on existing feature variables and does not yet encompass other potential influencing factors; therefore, its generalization capability under novel or extreme conditions requires further validation, preferably with independent datasets. Second, the model’s inclusion of an attention mechanism and interpretability analysis modules results in higher computational complexity compared to traditional prediction methods. Finally, differences in feature responses across materials indicate that constructing a fully universal prediction model remains challenging, and adaptation to specific material systems is necessary.
Within the current dataset, the proposed interpretable corrosion prediction model supports the formulation of pipeline integrity management strategies, potentially enabling a shift from reactive maintenance to proactive early warning. This approach may enhance system safety and environmental risk prevention capabilities while reducing operational costs, although these benefits have not yet been demonstrated in live field applications. Future work will continue validating and optimizing this method across broader material systems and corrosion scenarios, further integrating mechanistic models with field data to build a more robust and efficient corrosion prediction platform, with the ultimate goal of achieving wider practical applicability.