1. Introductory
During long-term operation, pipelines are subjected to corrosion, wear, and aging, leading to a gradual reduction in structural integrity. This deterioration compromises operational safety and ultimately results in the scrapping and decommissioning of a large number of oil and gas pipelines. The residual oil within these decommissioned pipelines contains various toxic substances, such as sulfides and heavy metals, which, if leaked, can cause severe air and soil pollution. In addition, residual oil is highly flammable, and prolonged accumulation may pose a significant fire hazard [
1]. Cleaned pipelines also possess significant recycling value. Once thoroughly cleaned and inspected, end-of-life pipelines can be recycled as scrap metal, generating economic returns. Moreover, fully cleaned pipelines can be repurposed for transporting gases or liquids such as hydrogen, carbon dioxide, or industrial water, enabling efficient reuse. With the growing demand for oil and gas transportation, the total length of pipelines worldwide has reached approximately 2.5 million kilometers. As a large number of pipelines continue to be decommissioned, the scale and urgency of cleaning operations present a serious challenge that must be effectively addressed.
In the cleaning process of waste oil and gas pipelines, physical methods are typically employed first, such as high-pressure water jetting and mechanical cleaning. High-pressure water jetting removes most loose dirt and solid impurities by impacting the pipeline’s inner wall with high-speed water flow. Mechanical cleaning uses specialized tools to brush and scrape sludge and deposits within the pipeline. Although these physical methods are effective at eliminating loose contaminants, they often fail to thoroughly clean stubborn oil, sludge, and chemically corrosive substances inside the pipeline [
2].
To ensure thorough cleaning, physical and chemical methods are often combined in practical pipeline cleaning processes. Chemical cleaning agents react with contaminants to dissolve or break down grease and deposits, facilitating their removal. By using targeted chemical agents, chemical cleaning not only offers a more comprehensive solution for various pipeline contaminants but also enables more efficient cleaning in complex geometries or confined spaces [
3].
The selection of chemical cleaning agents is critical during chemical cleaning. The characteristics of oil and gas resources and the types of pipeline contaminants vary significantly across regions, requiring tailored cleaning agents based on specific conditions. For instance, in high-viscosity crude oil areas like Venezuela and the Canadian oil sands, crude oil contains substantial amounts of asphalt, wax, and heavy hydrocarbons; in such cases, strong organic solvent-based cleaning agents are necessary. In regions such as the Middle East and North Africa, where sulfur content is higher and oil contamination is more corrosive, specialized cleaning agents with enhanced corrosion resistance are required. Choosing the appropriate cleaning agent not only improves cleaning effectiveness but also ensures pipeline safety and integrity throughout the cleaning process.
From the above analysis, it is clear that chemical cleaning is essential for achieving thorough cleaning of waste oil and gas pipelines. During chemical cleaning, the selection of cleaning agents and optimization of their components are key factors for ensuring efficient and complete cleaning. The success of chemical cleaning depends heavily on precisely formulating cleaning agents tailored to the specific contamination characteristics of the oil and gas pipeline. However, screening and optimizing cleaning agent components is challenging due to the vast selection space created by multiple variables, complex nonlinear interactions among factors, varying cleaning requirements, and external conditions. Scholars have conducted extensive research on cleaning agent component screening as outlined below.
In 2008, Li Y and others used orthogonal experimental methods and empirical experimental methods to screen and compound the experiment [
4]. The experiment took more than 20 h to determine the ratio of sludge cleaning agents. The accuracy rate after first-level cleaning reached 96.2% [
4].
In 2012, Jiaqiang Jing et al. found that a large amount of oil and natural gas existed in the abandoned pipeline, which was a potential safety hazard, and carried out cleaning experiments with single chemical and compounded chemical through empirical experimental method, and the indoor experiments showed that the single chemical had the optimal cleaning effect [
5].
In 2020, Yang, Fajie et al. found that in order to solve the problem of high wax content in Chinese pipelines, the cleaning agents on the market were not suitable for Chinese pipelines, and developed a hydrophilic cleaning agent through comprehensive experimental method and empirical experimental method, and indoor experiments and field applications showed that this cleaning agent had strong cleaning ability [
6].
The complexity of the cleaning process variables determines that the full-scale experimental method cannot be carried out in most cases. Mostly orthogonal and empirical experimental methods are used. The accuracy of cleaning agents screened in the literature is high, and the accuracy is hardly comparable due to different research problems and different accuracy evaluation criteria. However, the advantages and disadvantages of the comprehensive experimental method, orthogonal experimental method, and empirical experimental method are analyzed from the point of view of experimental design principles, as shown in
Table 1.
As can be seen in
Table 1, the main drawback of the previous methods is that the accuracy is dependent on the investment of experimental time and the experimental data. This does not imply that the investment of time is sufficient to obtain the corresponding accuracy. In fact, when the relationship between variables is more complex, it is difficult to capture the interactions between the variables and the subtle effects of the variables on the results through only scattered experimental points.
Using traditional methods, researchers are often able to formulate effective cleaning agents through extensive experimentation. In other words, such methods are feasible when time cost is not a constraint. However, a more critical limitation lies in the fact that traditional experimental approaches typically allow only qualitative analysis of variables, lacking the ability for quantitative evaluation. As a result, it is difficult to uncover underlying mechanisms, and the applicability of traditional methods to other contexts becomes limited. This highlights the urgent need to introduce mathematical models or domain-specific knowledge with clear mechanisms and explicit formulations. In recent years, researchers have improved traditional approaches by incorporating single-variable analysis and expert-based methods.
In 2007, Rocha et al. developed a thermochemical method using their expertise to remove wax sediments and residual crude oil from oil pipelines [
7]. The results are good in indoor experiments, and compared with traditional methods, this method is efficient, safe, and low-cost [
7].
In 2024, Liu and others caused secondary pollution during pipeline cleaning [
8]. They used professional knowledge and single-factor analysis to screen the cleaning agent, combined thermal chemical cleaning to test the cleaning agent cleaning rate, and finally, used new emulsions instead of surfactant [
8].
In order to make the investigation more comprehensive, the screening method of the petroleum engineering working fluid was also investigated.
In 2021, Kang conducted a single-factor analysis to solve the problem that conventional well repair fluid in Shunbei high-temperature oil and gas wells can easily degrade, using clean water as the base liquid, 1.5% of laboratory synthetic resin particles as temporary plugging agents, and polyazine heterocyclic compounds as stabilizers, and developed a flexible polymer gel well repair fluid [
9]. Indoor experiments showed that the dynamic viscosity of the well repair fluid can be maintained at 20–140 mPa·s at 180 °C, with good sand carrying effect and significant shear dilution [
9].
In 2023, in order to ensure the development effect of chemical drive, Guan Wenting and others used indoor physics simulation experiments and numerical simulation technology, combined with a professional knowledge method to optimize the system concentration, and indoor oil-driving experiments increased the recovery rate by 1.9% more than the original weak alkali composite system [
10].
The professional knowledge method and the advantages and disadvantages of the single-factor experimental method are shown in
Table 2.
As can be seen from
Table 2, the two experimental methods currently in common use are still essentially traditional experimental methods with a certain scope of application and limitations. The expertise method is applicable to problems with clearer theories but may not be effective when dealing with novel or not yet fully understood systems, while the single-factor experimental method makes it difficult to analyze the interaction between multiple factors when there are more variables.
The above analysis shows that expert knowledge methods and single-factor experimental methods can, to some extent, quantify the influence of variables. Under appropriate conditions, expert-based approaches are capable of constructing multivariable mathematical models, while single-factor methods can assess the impact of individual variables. However, as application scenarios evolve and new problems arise, expert-based methods often reveal significant limitations, even though single-factor methods remain relatively unaffected. As a result, the combination of these two methods fails to achieve a comprehensive qualitative analysis of multiple variables on the target function. Nevertheless, the introduction of mathematical methods still offers valuable insights.
Traditional methods, expert knowledge, and single-factor analysis—whether used individually or in combination—fail to overcome the technical barrier of quantitative analysis for multiple variables. Although mathematical modeling offers a feasible approach for quantitative analysis, in practical scenarios with numerous variables and complex relationships, simple mathematical methods still have limited applicability in model construction and factor selection.
Therefore, returning to the core issue, the essence of evaluating cleaning effectiveness for abandoned pipeline cleaning agents lies in clarifying the interactions among multiple variables, identifying key influencing factors, and selecting dominant ones. This enables uncovering the intrinsic relationships between cleaning agent performance and various influencing factors from a higher-dimensional perspective. Big data analysis methods have emerged in this context, leveraging their strengths in data mining, feature extraction, and pattern recognition to provide novel approaches for identifying and modeling dominant factors in complex systems, thus laying a solid foundation for the optimization and mechanistic understanding of abandoned pipeline cleaning agents.
However, investigations reveal that the application of big data methods in the field of abandoned pipeline cleaning remains limited, especially in process optimization and cleaning agent selection, with few related studies available for direct reference. In contrast, big data techniques have been widely applied and proven effective in other areas of petroleum engineering. Based on this, a systematic review of relevant case studies has been conducted to abstract the methodological logic and technical approaches, thereby demonstrating the feasibility and suitability of applying big data methods to cleaning agent selection for abandoned pipelines.
In 2005, Zheng Lihui et al., in order to achieve the performance requirements of working fluids, utilized a larger treatment agent dosage, thus rendering the cost reduction less obvious [
11]. They measured the experimental data on the working fluid performance of oil and gas wells with two kinds of treating agents. According to the characteristics of quadratic data distribution, they established mathematical relationship equations of experimental data and optimization model by applying the nonlinear multivariate least squares method and optimization theory. After many field applications, this new method of cost control has remarkable economic benefits [
11].
In 2016, Jinfeng et al., to optimize oil and gas working fluids, used the Multiple Regression Experimental Design (MRED) method to establish relationships between test indicators and multiple factors, eliminating human interference [
12]. Applied to fuzzy ball drilling and completion fluids with varying densities and viscosities, the method produced optimal formulations that were successfully implemented in the field, demonstrating its effectiveness in multifactor optimization [
12].
In 2019, Zhijuan et al. carried out sensitivity experiments to identify the cause of the sudden drop in fluid production in SHB-X wells; the maximum injury rate of 38.4%, however, did not match the site and was not the main cause; the drilling fluids were well matched, and the injury was not enough to stop production [
13]. Analyzing the 100-day data of seven wells, it was found that the well depth of 7880 m and high oil recovery rate triggered wax formation and asphaltene plugging, which led to a decrease in fluid production index by more than 60% and was the main controlling factor. Controlling daily fluid production is the key to stabilize production in deep carbonates [
13].
In 2021, Xiangchun et al. in order to solve the problem of not being able to quantitatively characterize the relationship between drilling fluid performance and reservoir injury, and to guide the optimization of drilling fluids, used big data methods, such as multivariate regression and stripping algorithms, and obtained a mathematical model in which apparent viscosity, density, kinetic–plasticity ratio, and pH were the main influences, which was capable of accurately reflecting the relationship between the performance of the drilling fluids and the degree of injury and successfully guided the optimization in the field, with the improvement of daily gas production by About 800 m
3, the effect is remarkable [
14].
In 2022, Yang prepared four oil-based drilling fluids of the same composition and different densities in the room based on the field formula, explored the influence of temperature and pressure on the density of oil-based drilling fluids, and established a warm-pressure binary regression mathematical model of oil-based drilling fluid density [
15]. The accuracy of the model was verified by using oil-based drilling fluids of different densities in the field. The results show that there is a high consistency between the predicted value and the measured value, with an average prediction accuracy of 97.93%, which can meet the needs of field use [
15].
In 2023, in order to accurately guide the adjustment of drilling fluid density under underbalanced drilling conditions, Zhang et al. adopted a method of adjusting fluid density while drilling based on logging parameters by analyzing the response characteristics of gas well logging in buried hill formations [
16]. At the same time, from a statistical point of view, the correlation between bottom hole pressure difference and gas measurement parameters was proposed, and the Caofeidian M structure drilling fluid density adjustment model was established, which realized the timely and accurate adjustment of drilling fluid density during drilling [
16]. In the same year, Zhang et al. addressed the challenge of evaluating unsteady inflow dynamics without bottomhole pressure or explicit multifactor equations [
17]. Through theoretical derivation and multifactor fitting, they developed a calculation model relating bottomhole pressure to time, stress sensitivity coefficient, skin factor, cumulative production, and threshold pressure gradient. The results show that the model achieved an accuracy of 82.3–94.76% from the initial production stage to the pressure stabilization stage [
17].
Based on the above survey of big data application cases, it can be found that the selection of cleaning agents for abandoned pipelines is essentially consistent with the problems in other fields, with the following specific similarities.
- (1)
They all involve the influence of multiple variables—for example, cleaning agent performance is jointly affected by factors such as temperature, concentration, and residue composition—making the problem isomorphic to multiparameter optimization issues like drilling fluid and completion fluid formulation.
- (2)
The objective function requires a mathematical mapping between variables and target performance indicators to achieve quantitative correlation and prediction among multiple parameters, enabling performance prediction and parameter optimization.
- (3)
Traditional methods have clear limitations in terms of experimental cost, efficiency, and accuracy, while big data methods can improve screening efficiency and reduce trial-and-error through historical data analysis and mathematical modeling.
Therefore, although the application of big data methods in the selection of cleaning agents for abandoned pipelines currently lacks sufficient literature support and direct case studies, the above analysis shows that this type of problem shares a high degree of consistency in structural characteristics and modeling logic with other engineering problems where big data has been successfully applied. Thus, there is both a theoretical basis and practical feasibility for introducing big data methods into cleaning agent selection. This approach is expected to overcome the limitations of traditional methods in terms of efficiency and accuracy and significantly enhance the precision and speed of the screening process.
2. Indoor Research
2.1. Variable Identification and Data Acquisition
Before commencing the work, a theoretical analysis is conducted to classify the variables involved in the research problem. These variables are categorized into controllable, uncontrollable, and semi-controllable based on their degree of controllability, and the specific variables relevant to the problem are clearly identified.
2.2. Standardization or Normalization of Underlying Data
Due to the presence of physical dimensions, inconsistent units can significantly affect the modeling process. Therefore, standardization is applied to eliminate the impact of dimensional differences on subsequent calculations.
The base data matrix is A (a
ij are the elements of which are defined as variables), and the standardized matrix is B (b
ij are the elements of which are defined as variables). The standard deviation normalization formula is shown in (1).
In Equation (1), is the mean; is the variance.
After data standardization, the influence of dimensional units on the data scale is eliminated. However, if the standardized data still shows large deviations or poor concentration, model instability may occur during fitting. Therefore, normalization is required for data that is too small, overly concentrated, or dispersed. The min-max normalization formula is shown in Equation (2).
In Equation (2), is the maximum value of the data, is the Minimum value of the data.
After normalization is complete, the dataset is converted to new data with size distribution of [0, 1] and dimensionless, suitable for comprehensive modeling analysis.
2.3. Objectives and Single-Factor Analysis
Due to the limited number of experimental cases obtained from laboratory-based pipeline cleaning tests and the presence of numerous influencing factors, directly incorporating all variables into a regression model often leads to inaccurate results. Therefore, it is necessary to first apply single-factor linear regression to establish regression equations between the dependent variable and each individual independent variable. This approach helps identify the relationship between each influencing factor and the dependent variable, enabling effective variable selection. The general form of the linear equation obtained is shown in Equation (3).
In Equation (3), is the dependent variable, a is the intercept, b is the correlation coefficient, and Xi is the independent variable.
Alternatively, Pearson correlation analysis can be conducted to assess the positive or negative correlation between the target variable and individual factors [
18]. This allows for a more targeted approach in performing multiple regression analysis by selecting only the most relevant factors. The Pearson correlation analysis formula is shown in (4).
In Equation (4), rxy is the correlation coefficient; cov(x,y) is the covariance of x, y; and , are the standard deviations.
2.4. Establishment of an Optimal Mathematical Model
Linear regression analysis is a method used to study the quantitative relationships among objective phenomena. Since these relationships are often complex and interdependent, the variation of a particular factor is typically influenced by two or more other variables. Based on newly acquired data, a multiple regression model can be developed to establish a mathematical relationship between the target variable and all relevant factors, enabling in-depth and systematic analysis [
19].
The general form of the multiple regression model is shown in Equation (5)
In Equation (5), is the dependent variable; is the intercept; , , …, is the parameter of the regression equation; and , , …, is the k independent variables.
To identify the most suitable fitting model, the regression methods employed in this study typically include linear regression, ridge regression, lasso regression, and stepwise regression [
20].
There are many indicators reflecting the accuracy and error of the regression model, such as MSE, RMSE, MAE, R
2, etc. [
21], and the weighted assessment index is chosen for this method. The formula for the weighted assessment index is shown in Equation (6).
In Equation (6), the weighted evaluation index combines multiple error metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R2), to provide a comprehensive assessment of model performance compared with any single error metric. A higher weighted evaluation index indicates larger overall prediction errors and lower model fit, whereas a lower index suggests smaller prediction errors and better model fit. Accordingly, the weighted evaluation index can be used to select regression models with superior goodness-of-fit and minimal error.
2.5. Searching for Master Controllers and Modeling
Common methods for identifying the main controlling factors include the elimination method, the contribution rate elimination method, and the expert knowledge method.
Elimination methods are stepwise procedures used to remove irrelevant or redundant variables, commonly applied in regression analysis. By excluding variables that contribute minimally to the dependent variable or exhibit high multicollinearity with other predictors, these methods simplify the model and enhance its interpretability and predictive performance. The process involves sequentially evaluating the importance of each variable, often using information criteria such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Typically, the regression begins with a full model including all variables, then progressively eliminates nonsignificant or highly collinear variables until only the most explanatory ones remain. Elimination can be implemented via techniques such as stepwise regression, ridge regression, and lasso regression, where variable removal is guided by statistical significance levels or coefficient penalization terms [
22].
The contribution rate method is used to identify the primary controlling factors by quantifying each factor’s contribution to the overall outcome. Commonly applied in statistical analyses such as analysis of variance (ANOVA [
23]) and principal component analysis (PCA), it measures the influence magnitude of different factors on the dependent variable. The basic procedure involves calculating contribution rates based on the optimal regression model and ranking variables according to their contributions. Subsequently, variables are set to zero one by one, and the model is refitted with the remaining variables to recalculate and reorder their contribution rates. If the relative ranking of a variable’s contribution remains unchanged, it indicates that the factor has minimal effect on the outcome or its influence has been substituted by other variables, allowing it to be classified as a nonprimary factor and removed. This iterative zeroing and refitting process continues until the primary factors with significant influence on the result are selected. Through this stepwise elimination, a simplified model is ultimately obtained that contains only the key controlling factors with substantial impact [
24]. The contribution rate can be calculated according to Equation (7).
In Equation (7), the numerator is the absolute value of the regression coefficient for the variable; the denominator is the sum of the absolute values of the regression coefficients.
The expert knowledge method involves selecting key variables and constructing models based on the experience and expertise of domain specialists. This approach relies heavily on a deep understanding of the industry, technology, or discipline, and depends on experts’ judgments to determine which factors are important and which can be disregarded. Domain experts identify critical factors by leveraging practical experience, theoretical knowledge, historical data, industry standards, and existing research findings to assess which variables may significantly influence the target variable. Unlike statistical methods, the expert knowledge method emphasizes practical application and industry practices, integrating experiential insights and theoretical guidance in variable selection and analysis.
2.6. Back-Calculate to Obtain the Optimal Solution
The process of reverse-calculating the optimal formulation involves selecting appropriate mathematical methods to derive formulations that meet the target criteria. The specific approach depends on factors such as problem complexity, the number of variables, and the characteristics of the objective function. For continuous and differentiable objective functions, several commonly used optimization methods are available, each suitable for different scenarios.
The first category is linear programming (LP), which is an optimization technique for linear objective functions. Its goal is to find the optimal solution—typically the maximum or minimum value of the objective function—while satisfying a set of linear constraints [
25].
The objective function of the linear programming problem can be expressed as Equation (8)
When the practical problem is applied, usually, x
i > 0. The other constraints are shown in Equation (9).
Linear programming methods are applicable only when both the objective function and the constraints are linear. The problem can be solved by finding the extremum of a multivariate function. For problems involving a large number of variables or constraints, computational tools such as MATLAB R2024a (24.1.0.2537033) are often employed to obtain the solution.
The second category is nonlinear programming (NLP), where at least one of the objective function or constraints is nonlinear. Unlike linear programming, NLP allows for complex nonlinear relationships in the objective function and constraints, making it suitable for addressing a wider range of real-world problems such as production processes, chemical reactions, and physical phenomena [
26].
There are several common nonlinear programming solution methods, such as gradient descent, Lagrange Multipliers, and so on.
- (1)
Gradient Descent
Gradient descent is an iterative optimization algorithm used to find the minimum of a function. Widely applied in machine learning, deep learning, and mathematical optimization, gradient descent adjusts model parameters to minimize the loss function. Simply put, the core idea of gradient descent is to iteratively move in the direction of the steepest descent of the function to locate its minimum. This direction is determined by computing the gradient of the function at the current point [
27].
The gradient is a vector that indicates the direction of the greatest rate of change in a function’s magnitude. For a differentiable univariate function, the gradient corresponds to its derivative. For example, when seeking the maximum value, if the derivative at a point is positive, it indicates that the maximum lies to the right of that point. For continuously differentiable multivariate functions, the gradient is a vector composed of partial derivatives. The positive direction of the gradient points toward the direction of the steepest ascent of the function, while the opposite direction points toward the steepest descent.
- (2)
Lagrange Multipliers
The method of Lagrange multipliers is a classical approach designed to handle problems with equality constraints. Its core idea is to introduce Lagrange multipliers to transform a constrained optimization problem into an unconstrained problem involving the Lagrangian function’s stationary points. If the original objective function has an extremum under the constraints, then the constructed Lagrangian function must also have an extremum (with corresponding variable values equal). Consequently, the extremum points of the objective function under constraints lie among the stationary points of the Lagrangian function [
28].
After constructing the Lagrangian function, the stationary points of this new function are solved to find the solution. Since the method of Lagrange multipliers is specifically designed for optimization problems with equality constraints, it may not be suitable for certain practical engineering problems. Therefore, improvements and extensions based on the classical Lagrange method are often necessary.
The limitation of the Lagrange multiplier method arises because its stationary points occur only on the constraint surface, where equality constraints are satisfied. When the constraints involve inequalities, the Lagrange method cannot explore the values of the objective function within the feasible region, potentially causing the extremum to lie inside the constraint boundaries. To address such problems, one can first find the unconstrained extrema of the objective function, then check whether these extrema lie within the feasible region. Extremal points outside the constraint boundaries are discarded, and only those inside are considered. Finally, the values of these interior extrema are compared with the objective function values at the stationary points of the Lagrangian to determine the overall optimum.
Depending on the specific practical problem, an appropriate method from those described above is selected. By incorporating variable ranges or constraint conditions, the optimal value of the objective function can be determined, which in turn defines the values of the variables. This process facilitates the optimization of formulations, cleaning parameters, and other controllable factors [
29].
In summary, the specific steps and theoretical basis of the big data screening method have been described in detail. To visually present the technical route of this study, enhance the understanding of its logical flow, and improve the operability and applicability of the big data approach, a flowchart has been developed based on the context of abandoned pipeline cleaning agent selection, as shown in
Figure 1.
3. Methodological Applications
Samples were collected from six regions—Dalian, Zhuozhou, Langfang, Jinzhou, Qinhuangdao, and Shenyang. Using citric acid as the cleaning agent, the dosage and related cleaning parameters were varied. The relationship between the cleaning efficiency of the cleaning agent and geological factors, as well as cleaning parameters, was investigated. Based on the aforementioned methods, the optimal cleaning agent dosage and cleaning parameters were determined.
3.1. Variable Determination and Experimental Data Acquisition
The contradiction between the adhesion of sediments and the cleaning ability of the cleaning agent is a major concern during pipeline cleaning. In order to effectively resolve this contradiction, key variables such as clay mineral content, alluvial mineral content, detergent composition, contact time, flow rate, and temperature are considered centrally, as these factors directly determine the effectiveness of cleaning.
Clay mineral content and total rock mineral content are the key variables determining sediment adhesion. Clay minerals possess strong adsorptive properties, making sediments difficult to remove, while the hardness and density of total rock minerals further increase removal difficulty. The type and structure of sediments directly influence the interaction between the cleaning agent and sediments. Therefore, understanding sediment composition is crucial for selecting suitable cleaning agents and determining cleaning methods. This forms the foundation for optimizing cleaning efficiency.
The dosage of the cleaning agent is crucial for the success of chemical cleaning. Different types of deposits—such as oil residues, mineral scales, or water scale—require specific amounts of cleaning agent for effective removal. The cleaning agent reacts with the deposits to weaken their adhesion to the pipeline wall, facilitating the removal of deposits from the inner surface. Contact time, flow rate, and temperature are key operational variables affecting the cleaning efficiency. A longer contact time ensures sufficient penetration and reaction between the cleaning agent and the deposits; a higher flow rate enhances the mechanical scouring effect and helps remove dissolved contaminants; and a suitable temperature improves the solubility and reaction rate of the cleaning agent, which is especially effective for high-viscosity oil and waxy substances. Therefore, adjusting these factors can significantly enhance cleaning performance.
In the pipeline cleaning process, numerous other variables exist, such as the pipe material, inner diameter, wall thickness, and geometric configuration, all of which can influence the cleaning procedure to some extent. However, their effects are relatively minor, especially regarding the interaction between the cleaning agent and the deposits, where they do not directly determine the cleaning agent’s performance or sediment removal efficiency. For example, pipe material may affect cleaning outcomes, but in chemical cleaning, the composition of the deposits and their reaction with the cleaning agent are more critical. While pipe diameter and wall thickness influence the selection and arrangement of cleaning equipment, they do not directly determine sediment removal efficiency. Additionally, pipe curvature and geometry may impact the choice of cleaning methods but do not significantly alter the effectiveness of the cleaning agent, particularly in the context of fluid flow and chemical reactions. Therefore, the influence of these other variables is more indirect and can often be simplified in practical applications, allowing focus on the most critical variables.
In practical applications, focusing on the most critical variables enhances the efficiency and precision of cleaning operations, while less significant variables can be reasonably neglected under specific conditions. This simplification of the cleaning strategy contributes to improved overall operational efficiency.
In summary, the relevant variables can be categorized as follows:
- (1)
Uncontrollable residue components: hornblende, clay minerals, hard gypsum, barite, hematite, zeolite, magnesite, pyrite, dolomite, plagioclase, calcite, potassium feldspar, quartz, n-octane, n-nonane, n-decane, nC11–nC35;
- (2)
Semi-controllable variables: stirring speed, contact time, test environment temperature, citric acid dosage.
In this study, steel sheets of similar size and mass (thin steel sheets) were used to simulate pipeline walls for residue cleaning evaluation tests. To mimic the formation environment, the steel sheets were first heated at 200 °C for 30 min in an electric blast drying oven, then removed. Residues of similar mass were sequentially adhered to the steel sheets, which were weighed after cooling to room temperature, and the mass before cleaning was recorded as W
1. Subsequently, the steel sheets were immersed in the test cleaning agent, with variations in temperature, stirring speed, and immersion time. The mass after cleaning was recorded as W
2, which was used to calculate the cleaning efficiency. The cleaning rate is calculated as shown in Equation (10).
In Equation (10), is the cleaning rate, %; is the mass of attached residue steel sheet before cleaning, g; and is the mass of attached residue steel sheet after cleaning, g.
3.1.1. Cleaning Rate of Inorganic Component Residues in Different Regions of Citric Acid Detergent
In the laboratory, a muffle furnace was used to remove volatile components from the residues at high temperatures. After this high-temperature volatilization treatment, the residue changed from a clay-like state to a hard, blocky form. The hardened residue was then ground using a grinder and subsequently sieved through standard sieves to obtain particles sized between 200 and 400 mesh, which were used as test samples. Finally, the sieved residues were analyzed for whole-rock composition using X-ray diffraction (XRD), determining the contents of minerals including amphibole, clay minerals, gypsum, barite, hematite, zeolite, magnesite, pyrite, dolomite, plagioclase, calcite, potassium feldspar, and quartz from six different regions. The cleaning efficiency of the cleaning agents was tested separately on residues from different regions, and the experimental data were recorded, as shown in
Table 3.
As can be seen from
Table 3, among the samples from different regions, the cleaning rates of Qinhuangdao and Zhuozhou are relatively high, 13.02% and 12.17%, respectively, corresponding to a higher content of hornblende, which tentatively suggests that hornblende may help to enhance the cleaning effect. In contrast, the cleaning rate of the Shenyang sample is negative (−2.12%), which is not physically meaningful in practice, and may stem from experimental error, data recording bias, or problems with the benchmark setting of the cleaning rate, so the reliability of this data point is questionable and should be treated with caution or excluded from the analysis. Nonetheless, the Shenyang sample with a high content of square zeolite (65.2%) still had a low cleaning rate, a trend similar to that of the other samples, suggesting that square zeolite may not have facilitated the cleaning process, or even had an inhibitory effect under certain conditions. In addition, clay minerals only appeared in samples with low cleaning rates, such as Dalian (0.42% cleaning rate) and Shenyang, further suggesting that they may have a detrimental effect on cleaning. Taken together, hornblende shows some positive effect, while clay minerals and square zeolite may negatively affect the cleaning rate, and the remaining minerals have relatively insignificant effects.
3.1.2. Cleaning Rate of Organic Component Residues in Different Regions of Citric Acid Detergent
The organic components of the residues in each region were analyzed by gas chromatography, and the contents of the organic components in the residues can be derived, as shown in
Table 4.
As shown in
Table 4, there are significant differences in the distribution of n-alkanes among samples from different regions, particularly in the high carbon number range (e.g., above nC30). The Langfang samples exhibit notably higher contents in the high carbon number interval (nC31–nC35), with nC35 reaching 15.68%, indicating heavier components and longer chain lengths. In contrast, the Shenyang samples show elevated levels in the medium to low carbon range (e.g., nC10–nC15), especially nC13 and nC15, which reach 8.49% and 8.71%, respectively, suggesting a predominance of medium-chain alkanes. Meanwhile, the distributions in samples from Qinhuangdao, Zhuozhou, Jinzhou, and Dalian are more balanced, featuring a more uniform carbon number distribution between C10 and C25, and relatively fewer heavy components. Notably, Dalian samples have high contents of low-carbon components (e.g., n-nonane, n-decane) but relatively low levels of components above C30, with nC35 at only 7.63%. Overall, Langfang samples trend heavier, Shenyang samples are enriched in medium-chain alkanes, and other regions exhibit a more balanced distribution. These differences may reflect distinct hydrocarbon generation mechanisms influenced by varying geological origins or physicochemical conditions.
3.1.3. Cleaning Efficiency of Citric Acid Cleaning Agent Under Different Formulation Parameters
A temperature-controlled magnetic stirrer was used to evaluate the cleaning efficiency of the detergent on residues from different regions under varying stirring speeds, contact times, and temperatures. The experimental data were recorded and are presented in
Table 5.
3.2. Standardization and Normalization of Experimental Data
Most experimental data have dimensions, and under different dimensions, the influence of variables on the results varies. To eliminate the impact of dimensions on the fitting process, data were standardized and normalized using Equations (1) and (2).
3.3. Pairwise Variable Analysis Between Citric Acid Cleaning Efficiency and Each Single Variable
3.3.1. Relationship Between the Cleaning Efficiency of Citric Acid Cleaning Agent and the Inorganic Components of the Residues
Pearson correlation analysis and
t-tests were used to examine the relationship between the inorganic components of the residues and the cleaning efficiency. The Pearson correlation coefficient and the significance level (
p-value) for the total inorganic content of the residues are shown in
Figure 2.
As shown in
Figure 2, the
p-values for the significance of the relationships between the cleaning rate and the individual inorganic components are all greater than 0.05, indicating that these components do not have a statistically significant direct impact on the cleaning rate when analyzed individually. Therefore, a comprehensive analysis is needed. In duct cleaning modeling, these inorganic components can be treated collectively for analytical purposes. In terms of correlation, the cleaning rate showed a positive correlation with the individual contents of hornblende, galena, and calcite, with correlation coefficients of 0.6533, 0.2796, and 0.5762, respectively. This may be attributed to their solubility or potential synergistic reactions with the cleaning agent. The weak positive correlation with galena suggests that it might play a limited auxiliary role in cleaning through ion exchange. In contrast, the cleaning rate exhibited negative correlations with the contents of clay minerals, hard gypsum, barite, hematite, magnesite, pyrite, dolomite, plagioclase feldspar, potassium feldspar, and quartz. The correlation coefficients were −0.5471, −0.6832, −0.3559, −0.5157, −0.3872, −0.3878, −0.3609, −0.3872, and −0.3609, respectively. Among them, hard gypsum and clay minerals showed the strongest inhibitory effects, which may be due to their reactions with water that produce precipitates or their ability to adsorb active ingredients in the cleaning agent, thereby reducing cleaning efficiency.
3.3.2. Relationship Between the Cleaning Rate of Citric Acid Detergents and the Organic Components of the Residue
Similarly, the relationship between the organic component of the residue and the cleaning rate was analyzed by using Pearson correlation analysis and
t-test, and the Pearson correlation coefficient of the total inorganic component of the residue and the
p-value of the significance level were obtained as shown in
Figure 3.
As shown in
Figure 3, the
p-values for the significance of the relationships between the organic components of the residues and the cleaning rate are all greater than 0.05, indicating that, from a single-factor analysis perspective, these organic components do not have a statistically significant direct impact on the cleaning rate. Therefore, a comprehensive analysis is required. In pipeline cleaning modeling, these organic components can be treated as a whole for modeling purposes. In terms of correlation, n-nonane (0.03), nC18 (0.024), and nC19 (0.037) exhibited weak positive correlations with the cleaning rate, which may be negligible or related to a stabilizing effect from dissolution. Compounds such as n-octane (0.114), n-decane (0.07), nC11 (0.148), nC12 (0.184), nC13 (0.219), nC14 (0.182), nC15 (0.231), nC16 (0.149), and nC17 (0.119) showed stronger positive correlations. This suggests that the cleaning agent may have a stronger solubilizing effect on short- to medium-chain alkanes, thereby improving cleaning efficiency.
3.3.3. Relationship Between the Cleaning Efficiency of Citric Acid Cleaning Agent and Stirring Speed, Contact Time, Temperature, and Cleaning Agent Dosage
Using Pearson correlation analysis, it was found that agitation speed, contact time, temperature, detergent dosage, and cleaning rate all showed statistically significant relationships at the 0.001 level. The correlation coefficients for agitation speed (0.482), contact time (0.623), ambient test temperature (0.59), and detergent dosage (0.626) all indicated positive correlations with the cleaning rate. This suggests that these variables can be included as independent variables in the modeling equation.
However, since the cleaning rate is influenced by multiple factors, and the correlation coefficients between each individual variable and the cleaning rate are not particularly high, single-factor fitting yields limited predictive power. Therefore, greater emphasis should be placed on the interaction effects among these variables when modeling the cleaning rate.
3.4. Establishment of the Optimal Multivariate Mathematical Model for the Cleaning Efficiency of Citric Acid Cleaning Agent
Based on the above analysis, the inorganic components of residues from the six regions were collectively treated as one variable (x1), while the organic components were treated as another variable (x2). Stirring speed (x3), cleaning time (x4), cleaning temperature (x5), and the dosage of citric acid cleaning agent (x6) were taken as four additional variables. Using SPSS 29.0.2.0(20), all experimental data from the six regions were modeled with cleaning efficiency as the dependent variable, and the organic and inorganic components, experimental operational parameters, and cleaning agent dosage as independent variables. Four regression methods were applied: linear regression (least squares), ridge regression, Lasso regression, and stepwise regression.
When modeling with linear regression (least squares), the F-test results show a significance p-value of 0.160, indicating no statistical significance at the conventional level. Therefore, the null hypothesis that the regression coefficients are zero cannot be rejected, and the model is considered invalid.
Regarding multicollinearity among variables, the Variance Inflation Factor (VIF) values for cleaning time and cleaning temperature exceed 10, indicating the presence of multicollinearity. Therefore, ridge regression, stepwise regression, and Lasso regression were used for modeling. The evaluation parameters of these models are presented in
Table 6.
According to the data in
Table 6, using Equation (6) to select the optimal model, the model evaluation indices for ridge regression, stepwise regression, and Lasso regression were found to be 2.36, 2.19, and 1.68, respectively. This indicates that the Lasso regression model has better goodness of fit and smaller error compared with the other two models. Therefore, the Lasso regression equation was chosen as the optimal model for the citric acid cleaning agent, as shown in Equation (11).
In Equation (12), X1 is the inorganic content, %; X2 is the organic content, %; X3 is the agitation speed, m3/h; X4 is the contact time, min; X5 is the cleaning temperature, °C; and X6 is the amount of citric acid cleaning agent added, g.
3.5. Elimination-Based Identification of the Main Controlling Factors Affecting the Cleaning Efficiency of Citric Acid Cleaning Agent
Since Lasso regression introduces an L1 regularization term to penalize the coefficients of variables, it tends to produce a sparse model in which only a small number of features have nonzero coefficients. This characteristic aligns well with the goal of variable selection. Therefore, when the dataset contains many features but only a portion is truly relevant, Lasso can effectively identify the most important variables. As a result, the Lasso model inherently possesses variable elimination capability, making it unnecessary to perform additional variable elimination to identify the main controlling factors.
3.6. Inverse Calculation of the Optimal Citric Acid Dosage for Maximum Cleaning Efficiency Under Cost Constraints
The equations obtained from Lasso regression are linear equations with constant coefficients. When a specific target cleaning rate is set, if the rank of the augmented matrix equals the rank of the coefficient matrix but is less than the number of variables, the system has infinitely many solutions, making linear optimization infeasible. However, if cost constraints are introduced—usually in the form of linear equations—each feasible solution corresponds to a specific cost. By optimizing the cost, the formulation with the lowest cost can be determined.
To achieve the lowest cost formulation of retired oil and gas pipeline cleaning agent components while meeting a specific cleaning rate target (Y), this study introduces particle swarm optimization (PSO) to search for the optimal solution based on the predictive model constructed from Lasso regression. The PSO algorithm offers advantages such as fast convergence, ease of implementation, and suitability for continuous parameter space optimization, making it well suited for solving the multivariable nonlinear optimization problem in this study.
- (1)
Design of the Optimization Objective Function
In the Lasso regression model, a linear expression has been determined, shown in Equation (12).
In Equation (12), xi denotes the dosage of the ith component or operating parameter,
is the corresponding regression coefficient, and
is the intercept term. In this study, the objective function is to minimize the total cost of the cleaner, where the unit price corresponding to each variable is ci. The cost function can be expressed as Equation (13).
To ensure that the optimization results are acceptable in engineering applications, further constraints are introduced to require the cleaning rate predicted by the regression model to satisfy the target value Y
target set by the user, with the expression shown in Equation (14).
If a group of configurations x fails to satisfy the above condition, a penalty cost is assigned to it such that cost is equal to positive infinity, so that combinations that do not satisfy the cleaning rate requirement are automatically excluded from the optimization process.
- (2)
PSO solving process setting
In particle swarm optimization, each particle represents a parameter configuration x = (x1, ..., x6) to be evaluated, which is continuously updated in the parameter space according to its velocity and position to converge to the optimal solution. The flow of the algorithm is as follows:
Initialization phase: Set the number of particles (population size) to 100 and the maximum number of iterations to 1000; each particle is randomly initialized in the six-dimensional parameter space, with the search range determined based on the physical or engineering boundaries of each variable.
Fitness evaluation: The configuration corresponding to each particle is evaluated for cost using the objective function. If the predicted cleaning rate constraint is satisfied, the calculated cost is returned; otherwise, the value is set to positive infinity.
Iterative search: During the iteration process, particles continuously update their positions and velocities, guided by their individual historical best positions (pBest) and the global best position (gBest), gradually converging toward the optimal parameter combination.
Multiple runs: Considering the randomness of the PSO algorithm, to improve the stability and diversity of the solutions, this study repeats the optimization process 10 times and selects the top 10 optimal configurations with the lowest costs as the recommended results.
- (3)
Boundary Settings and Parameter Definitions
During the optimization process, the upper and lower bounds of each parameter variable are set based on laboratory data and engineering experience. For example, the detergent dosage (X6) is set within the range of [0, 15]%, and the cleaning temperature (X5) is set within [0, 70] °C. The process is implemented through programming in Python 3.13.5, allowing users to dynamically adjust the boundary values of organic/inorganic content according to the residue properties via a graphical interface, thereby achieving optimal cleaning formulation selection.