Next Article in Journal
Optimisation Study of Investment Decision-Making in Distribution Networks of New Power Systems—Based on a Three-Level Decision-Making Model
Previous Article in Journal
Transient Stability Enhancement of Virtual Synchronous Generator Through Analogical Phase Portrait Analysis
Previous Article in Special Issue
Experimental Evaluation of Residual Oil Saturation in Solvent-Assisted SAGD Using Single-Component Solvents
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Screening Decommissioned Oil and Gas Pipeline Cleaners Using Big Data Analytics Methods

1
North Pipeline Company of the National Pipeline Network Group, Langfang 065099, China
2
College of Petroleum Engineering, China University of Petroleum (Beijing), Beijing 102249, China
*
Author to whom correspondence should be addressed.
Energies 2025, 18(13), 3496; https://doi.org/10.3390/en18133496
Submission received: 26 May 2025 / Revised: 25 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025
(This article belongs to the Special Issue Enhanced Oil Recovery: Numerical Simulation and Deep Machine Learning)

Abstract

Traditional methods, such as full-factorial, orthogonal, and empirical experiments, show limited accuracy and efficiency in selecting cleaning agents for decommissioned oil and gas pipelines. They also lack the ability to quantitatively analyze the impact of multiple variables. This study proposes a data-driven optimization approach to address these limitations. Residue samples from six regions, including Dalian and Shenyang, were analyzed for inorganic components using XRD and for organic components using GC. Citric acid was used as a model cleaning agent, and cleaning efficiency was tested under varying temperature, agitation, and contact time. Key variables showed significant correlations with cleaning performance. To further quantify the combined effects of multiple factors, multivariate regression methods such as multiple linear regression and ridge regression were employed to establish predictive models. A weighted evaluation approach was used to identify the optimal model, and a method for inverse prediction was proposed. This study shows that, compared with traditional methods, the data-driven approach improves accuracy by 3.67% and efficiency by 82.5%. By efficiently integrating and analyzing multidimensional data, this method not only enables rapid identification of optimal formulations but also uncovers the underlying relationships and combined effects among variables. It offers a novel strategy for the efficient selection and optimization of cleaning agents for decommissioned oil and gas pipelines, as well as broader chemical systems.

1. Introductory

During long-term operation, pipelines are subjected to corrosion, wear, and aging, leading to a gradual reduction in structural integrity. This deterioration compromises operational safety and ultimately results in the scrapping and decommissioning of a large number of oil and gas pipelines. The residual oil within these decommissioned pipelines contains various toxic substances, such as sulfides and heavy metals, which, if leaked, can cause severe air and soil pollution. In addition, residual oil is highly flammable, and prolonged accumulation may pose a significant fire hazard [1]. Cleaned pipelines also possess significant recycling value. Once thoroughly cleaned and inspected, end-of-life pipelines can be recycled as scrap metal, generating economic returns. Moreover, fully cleaned pipelines can be repurposed for transporting gases or liquids such as hydrogen, carbon dioxide, or industrial water, enabling efficient reuse. With the growing demand for oil and gas transportation, the total length of pipelines worldwide has reached approximately 2.5 million kilometers. As a large number of pipelines continue to be decommissioned, the scale and urgency of cleaning operations present a serious challenge that must be effectively addressed.
In the cleaning process of waste oil and gas pipelines, physical methods are typically employed first, such as high-pressure water jetting and mechanical cleaning. High-pressure water jetting removes most loose dirt and solid impurities by impacting the pipeline’s inner wall with high-speed water flow. Mechanical cleaning uses specialized tools to brush and scrape sludge and deposits within the pipeline. Although these physical methods are effective at eliminating loose contaminants, they often fail to thoroughly clean stubborn oil, sludge, and chemically corrosive substances inside the pipeline [2].
To ensure thorough cleaning, physical and chemical methods are often combined in practical pipeline cleaning processes. Chemical cleaning agents react with contaminants to dissolve or break down grease and deposits, facilitating their removal. By using targeted chemical agents, chemical cleaning not only offers a more comprehensive solution for various pipeline contaminants but also enables more efficient cleaning in complex geometries or confined spaces [3].
The selection of chemical cleaning agents is critical during chemical cleaning. The characteristics of oil and gas resources and the types of pipeline contaminants vary significantly across regions, requiring tailored cleaning agents based on specific conditions. For instance, in high-viscosity crude oil areas like Venezuela and the Canadian oil sands, crude oil contains substantial amounts of asphalt, wax, and heavy hydrocarbons; in such cases, strong organic solvent-based cleaning agents are necessary. In regions such as the Middle East and North Africa, where sulfur content is higher and oil contamination is more corrosive, specialized cleaning agents with enhanced corrosion resistance are required. Choosing the appropriate cleaning agent not only improves cleaning effectiveness but also ensures pipeline safety and integrity throughout the cleaning process.
From the above analysis, it is clear that chemical cleaning is essential for achieving thorough cleaning of waste oil and gas pipelines. During chemical cleaning, the selection of cleaning agents and optimization of their components are key factors for ensuring efficient and complete cleaning. The success of chemical cleaning depends heavily on precisely formulating cleaning agents tailored to the specific contamination characteristics of the oil and gas pipeline. However, screening and optimizing cleaning agent components is challenging due to the vast selection space created by multiple variables, complex nonlinear interactions among factors, varying cleaning requirements, and external conditions. Scholars have conducted extensive research on cleaning agent component screening as outlined below.
In 2008, Li Y and others used orthogonal experimental methods and empirical experimental methods to screen and compound the experiment [4]. The experiment took more than 20 h to determine the ratio of sludge cleaning agents. The accuracy rate after first-level cleaning reached 96.2% [4].
In 2012, Jiaqiang Jing et al. found that a large amount of oil and natural gas existed in the abandoned pipeline, which was a potential safety hazard, and carried out cleaning experiments with single chemical and compounded chemical through empirical experimental method, and the indoor experiments showed that the single chemical had the optimal cleaning effect [5].
In 2020, Yang, Fajie et al. found that in order to solve the problem of high wax content in Chinese pipelines, the cleaning agents on the market were not suitable for Chinese pipelines, and developed a hydrophilic cleaning agent through comprehensive experimental method and empirical experimental method, and indoor experiments and field applications showed that this cleaning agent had strong cleaning ability [6].
The complexity of the cleaning process variables determines that the full-scale experimental method cannot be carried out in most cases. Mostly orthogonal and empirical experimental methods are used. The accuracy of cleaning agents screened in the literature is high, and the accuracy is hardly comparable due to different research problems and different accuracy evaluation criteria. However, the advantages and disadvantages of the comprehensive experimental method, orthogonal experimental method, and empirical experimental method are analyzed from the point of view of experimental design principles, as shown in Table 1.
As can be seen in Table 1, the main drawback of the previous methods is that the accuracy is dependent on the investment of experimental time and the experimental data. This does not imply that the investment of time is sufficient to obtain the corresponding accuracy. In fact, when the relationship between variables is more complex, it is difficult to capture the interactions between the variables and the subtle effects of the variables on the results through only scattered experimental points.
Using traditional methods, researchers are often able to formulate effective cleaning agents through extensive experimentation. In other words, such methods are feasible when time cost is not a constraint. However, a more critical limitation lies in the fact that traditional experimental approaches typically allow only qualitative analysis of variables, lacking the ability for quantitative evaluation. As a result, it is difficult to uncover underlying mechanisms, and the applicability of traditional methods to other contexts becomes limited. This highlights the urgent need to introduce mathematical models or domain-specific knowledge with clear mechanisms and explicit formulations. In recent years, researchers have improved traditional approaches by incorporating single-variable analysis and expert-based methods.
In 2007, Rocha et al. developed a thermochemical method using their expertise to remove wax sediments and residual crude oil from oil pipelines [7]. The results are good in indoor experiments, and compared with traditional methods, this method is efficient, safe, and low-cost [7].
In 2024, Liu and others caused secondary pollution during pipeline cleaning [8]. They used professional knowledge and single-factor analysis to screen the cleaning agent, combined thermal chemical cleaning to test the cleaning agent cleaning rate, and finally, used new emulsions instead of surfactant [8].
In order to make the investigation more comprehensive, the screening method of the petroleum engineering working fluid was also investigated.
In 2021, Kang conducted a single-factor analysis to solve the problem that conventional well repair fluid in Shunbei high-temperature oil and gas wells can easily degrade, using clean water as the base liquid, 1.5% of laboratory synthetic resin particles as temporary plugging agents, and polyazine heterocyclic compounds as stabilizers, and developed a flexible polymer gel well repair fluid [9]. Indoor experiments showed that the dynamic viscosity of the well repair fluid can be maintained at 20–140 mPa·s at 180 °C, with good sand carrying effect and significant shear dilution [9].
In 2023, in order to ensure the development effect of chemical drive, Guan Wenting and others used indoor physics simulation experiments and numerical simulation technology, combined with a professional knowledge method to optimize the system concentration, and indoor oil-driving experiments increased the recovery rate by 1.9% more than the original weak alkali composite system [10].
The professional knowledge method and the advantages and disadvantages of the single-factor experimental method are shown in Table 2.
As can be seen from Table 2, the two experimental methods currently in common use are still essentially traditional experimental methods with a certain scope of application and limitations. The expertise method is applicable to problems with clearer theories but may not be effective when dealing with novel or not yet fully understood systems, while the single-factor experimental method makes it difficult to analyze the interaction between multiple factors when there are more variables.
The above analysis shows that expert knowledge methods and single-factor experimental methods can, to some extent, quantify the influence of variables. Under appropriate conditions, expert-based approaches are capable of constructing multivariable mathematical models, while single-factor methods can assess the impact of individual variables. However, as application scenarios evolve and new problems arise, expert-based methods often reveal significant limitations, even though single-factor methods remain relatively unaffected. As a result, the combination of these two methods fails to achieve a comprehensive qualitative analysis of multiple variables on the target function. Nevertheless, the introduction of mathematical methods still offers valuable insights.
Traditional methods, expert knowledge, and single-factor analysis—whether used individually or in combination—fail to overcome the technical barrier of quantitative analysis for multiple variables. Although mathematical modeling offers a feasible approach for quantitative analysis, in practical scenarios with numerous variables and complex relationships, simple mathematical methods still have limited applicability in model construction and factor selection.
Therefore, returning to the core issue, the essence of evaluating cleaning effectiveness for abandoned pipeline cleaning agents lies in clarifying the interactions among multiple variables, identifying key influencing factors, and selecting dominant ones. This enables uncovering the intrinsic relationships between cleaning agent performance and various influencing factors from a higher-dimensional perspective. Big data analysis methods have emerged in this context, leveraging their strengths in data mining, feature extraction, and pattern recognition to provide novel approaches for identifying and modeling dominant factors in complex systems, thus laying a solid foundation for the optimization and mechanistic understanding of abandoned pipeline cleaning agents.
However, investigations reveal that the application of big data methods in the field of abandoned pipeline cleaning remains limited, especially in process optimization and cleaning agent selection, with few related studies available for direct reference. In contrast, big data techniques have been widely applied and proven effective in other areas of petroleum engineering. Based on this, a systematic review of relevant case studies has been conducted to abstract the methodological logic and technical approaches, thereby demonstrating the feasibility and suitability of applying big data methods to cleaning agent selection for abandoned pipelines.
In 2005, Zheng Lihui et al., in order to achieve the performance requirements of working fluids, utilized a larger treatment agent dosage, thus rendering the cost reduction less obvious [11]. They measured the experimental data on the working fluid performance of oil and gas wells with two kinds of treating agents. According to the characteristics of quadratic data distribution, they established mathematical relationship equations of experimental data and optimization model by applying the nonlinear multivariate least squares method and optimization theory. After many field applications, this new method of cost control has remarkable economic benefits [11].
In 2016, Jinfeng et al., to optimize oil and gas working fluids, used the Multiple Regression Experimental Design (MRED) method to establish relationships between test indicators and multiple factors, eliminating human interference [12]. Applied to fuzzy ball drilling and completion fluids with varying densities and viscosities, the method produced optimal formulations that were successfully implemented in the field, demonstrating its effectiveness in multifactor optimization [12].
In 2019, Zhijuan et al. carried out sensitivity experiments to identify the cause of the sudden drop in fluid production in SHB-X wells; the maximum injury rate of 38.4%, however, did not match the site and was not the main cause; the drilling fluids were well matched, and the injury was not enough to stop production [13]. Analyzing the 100-day data of seven wells, it was found that the well depth of 7880 m and high oil recovery rate triggered wax formation and asphaltene plugging, which led to a decrease in fluid production index by more than 60% and was the main controlling factor. Controlling daily fluid production is the key to stabilize production in deep carbonates [13].
In 2021, Xiangchun et al. in order to solve the problem of not being able to quantitatively characterize the relationship between drilling fluid performance and reservoir injury, and to guide the optimization of drilling fluids, used big data methods, such as multivariate regression and stripping algorithms, and obtained a mathematical model in which apparent viscosity, density, kinetic–plasticity ratio, and pH were the main influences, which was capable of accurately reflecting the relationship between the performance of the drilling fluids and the degree of injury and successfully guided the optimization in the field, with the improvement of daily gas production by About 800 m3, the effect is remarkable [14].
In 2022, Yang prepared four oil-based drilling fluids of the same composition and different densities in the room based on the field formula, explored the influence of temperature and pressure on the density of oil-based drilling fluids, and established a warm-pressure binary regression mathematical model of oil-based drilling fluid density [15]. The accuracy of the model was verified by using oil-based drilling fluids of different densities in the field. The results show that there is a high consistency between the predicted value and the measured value, with an average prediction accuracy of 97.93%, which can meet the needs of field use [15].
In 2023, in order to accurately guide the adjustment of drilling fluid density under underbalanced drilling conditions, Zhang et al. adopted a method of adjusting fluid density while drilling based on logging parameters by analyzing the response characteristics of gas well logging in buried hill formations [16]. At the same time, from a statistical point of view, the correlation between bottom hole pressure difference and gas measurement parameters was proposed, and the Caofeidian M structure drilling fluid density adjustment model was established, which realized the timely and accurate adjustment of drilling fluid density during drilling [16]. In the same year, Zhang et al. addressed the challenge of evaluating unsteady inflow dynamics without bottomhole pressure or explicit multifactor equations [17]. Through theoretical derivation and multifactor fitting, they developed a calculation model relating bottomhole pressure to time, stress sensitivity coefficient, skin factor, cumulative production, and threshold pressure gradient. The results show that the model achieved an accuracy of 82.3–94.76% from the initial production stage to the pressure stabilization stage [17].
Based on the above survey of big data application cases, it can be found that the selection of cleaning agents for abandoned pipelines is essentially consistent with the problems in other fields, with the following specific similarities.
(1)
They all involve the influence of multiple variables—for example, cleaning agent performance is jointly affected by factors such as temperature, concentration, and residue composition—making the problem isomorphic to multiparameter optimization issues like drilling fluid and completion fluid formulation.
(2)
The objective function requires a mathematical mapping between variables and target performance indicators to achieve quantitative correlation and prediction among multiple parameters, enabling performance prediction and parameter optimization.
(3)
Traditional methods have clear limitations in terms of experimental cost, efficiency, and accuracy, while big data methods can improve screening efficiency and reduce trial-and-error through historical data analysis and mathematical modeling.
Therefore, although the application of big data methods in the selection of cleaning agents for abandoned pipelines currently lacks sufficient literature support and direct case studies, the above analysis shows that this type of problem shares a high degree of consistency in structural characteristics and modeling logic with other engineering problems where big data has been successfully applied. Thus, there is both a theoretical basis and practical feasibility for introducing big data methods into cleaning agent selection. This approach is expected to overcome the limitations of traditional methods in terms of efficiency and accuracy and significantly enhance the precision and speed of the screening process.

2. Indoor Research

2.1. Variable Identification and Data Acquisition

Before commencing the work, a theoretical analysis is conducted to classify the variables involved in the research problem. These variables are categorized into controllable, uncontrollable, and semi-controllable based on their degree of controllability, and the specific variables relevant to the problem are clearly identified.

2.2. Standardization or Normalization of Underlying Data

Due to the presence of physical dimensions, inconsistent units can significantly affect the modeling process. Therefore, standardization is applied to eliminate the impact of dimensional differences on subsequent calculations.
The base data matrix is A (aij are the elements of which are defined as variables), and the standardized matrix is B (bij are the elements of which are defined as variables). The standard deviation normalization formula is shown in (1).
b ij =   a ij     μ ij σ ij
In Equation (1),   μ ij is the mean; σ ij is the variance.
After data standardization, the influence of dimensional units on the data scale is eliminated. However, if the standardized data still shows large deviations or poor concentration, model instability may occur during fitting. Therefore, normalization is required for data that is too small, overly concentrated, or dispersed. The min-max normalization formula is shown in Equation (2).
b ij   =   a ij     M ( a ij ) M ( a ij )     m ( a ij )
In Equation (2), M ( a ij ) is the maximum value of the data, m ( a ij ) is the Minimum value of the data.
After normalization is complete, the dataset is converted to new data with size distribution of [0, 1] and dimensionless, suitable for comprehensive modeling analysis.

2.3. Objectives and Single-Factor Analysis

Due to the limited number of experimental cases obtained from laboratory-based pipeline cleaning tests and the presence of numerous influencing factors, directly incorporating all variables into a regression model often leads to inaccurate results. Therefore, it is necessary to first apply single-factor linear regression to establish regression equations between the dependent variable and each individual independent variable. This approach helps identify the relationship between each influencing factor and the dependent variable, enabling effective variable selection. The general form of the linear equation obtained is shown in Equation (3).
Y ^   =   a + bX i
In Equation (3), Y ^ is the dependent variable, a is the intercept, b is the correlation coefficient, and Xi is the independent variable.
Alternatively, Pearson correlation analysis can be conducted to assess the positive or negative correlation between the target variable and individual factors [18]. This allows for a more targeted approach in performing multiple regression analysis by selecting only the most relevant factors. The Pearson correlation analysis formula is shown in (4).
r xy   =   Cov x , y σ x σ y
In Equation (4), rxy is the correlation coefficient; cov(x,y) is the covariance of x, y; and σ x , σ y are the standard deviations.

2.4. Establishment of an Optimal Mathematical Model

Linear regression analysis is a method used to study the quantitative relationships among objective phenomena. Since these relationships are often complex and interdependent, the variation of a particular factor is typically influenced by two or more other variables. Based on newly acquired data, a multiple regression model can be developed to establish a mathematical relationship between the target variable and all relevant factors, enabling in-depth and systematic analysis [19].
The general form of the multiple regression model is shown in Equation (5)
Y ^ = a + i n b i X i
In Equation (5), Y ^ is the dependent variable; a is the intercept; b 1 , b 2 , …, b k is the parameter of the regression equation; and X 1 , X 2 , …, X k is the k independent variables.
To identify the most suitable fitting model, the regression methods employed in this study typically include linear regression, ridge regression, lasso regression, and stepwise regression [20].
There are many indicators reflecting the accuracy and error of the regression model, such as MSE, RMSE, MAE, R2, etc. [21], and the weighted assessment index is chosen for this method. The formula for the weighted assessment index is shown in Equation (6).
Weighted   Assessment   Index = 0 . 5   ×   MSE   +   0 . 2   ×   RMSE   +   0 . 2   ×   MAE   +   0 . 1   ×   1     R 2
In Equation (6), the weighted evaluation index combines multiple error metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R2), to provide a comprehensive assessment of model performance compared with any single error metric. A higher weighted evaluation index indicates larger overall prediction errors and lower model fit, whereas a lower index suggests smaller prediction errors and better model fit. Accordingly, the weighted evaluation index can be used to select regression models with superior goodness-of-fit and minimal error.

2.5. Searching for Master Controllers and Modeling

Common methods for identifying the main controlling factors include the elimination method, the contribution rate elimination method, and the expert knowledge method.
Elimination methods are stepwise procedures used to remove irrelevant or redundant variables, commonly applied in regression analysis. By excluding variables that contribute minimally to the dependent variable or exhibit high multicollinearity with other predictors, these methods simplify the model and enhance its interpretability and predictive performance. The process involves sequentially evaluating the importance of each variable, often using information criteria such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Typically, the regression begins with a full model including all variables, then progressively eliminates nonsignificant or highly collinear variables until only the most explanatory ones remain. Elimination can be implemented via techniques such as stepwise regression, ridge regression, and lasso regression, where variable removal is guided by statistical significance levels or coefficient penalization terms [22].
The contribution rate method is used to identify the primary controlling factors by quantifying each factor’s contribution to the overall outcome. Commonly applied in statistical analyses such as analysis of variance (ANOVA [23]) and principal component analysis (PCA), it measures the influence magnitude of different factors on the dependent variable. The basic procedure involves calculating contribution rates based on the optimal regression model and ranking variables according to their contributions. Subsequently, variables are set to zero one by one, and the model is refitted with the remaining variables to recalculate and reorder their contribution rates. If the relative ranking of a variable’s contribution remains unchanged, it indicates that the factor has minimal effect on the outcome or its influence has been substituted by other variables, allowing it to be classified as a nonprimary factor and removed. This iterative zeroing and refitting process continues until the primary factors with significant influence on the result are selected. Through this stepwise elimination, a simplified model is ultimately obtained that contains only the key controlling factors with substantial impact [24]. The contribution rate can be calculated according to Equation (7).
Contribution   Rate = b i i n b i
In Equation (7), the numerator is the absolute value of the regression coefficient for the variable; the denominator is the sum of the absolute values of the regression coefficients.
The expert knowledge method involves selecting key variables and constructing models based on the experience and expertise of domain specialists. This approach relies heavily on a deep understanding of the industry, technology, or discipline, and depends on experts’ judgments to determine which factors are important and which can be disregarded. Domain experts identify critical factors by leveraging practical experience, theoretical knowledge, historical data, industry standards, and existing research findings to assess which variables may significantly influence the target variable. Unlike statistical methods, the expert knowledge method emphasizes practical application and industry practices, integrating experiential insights and theoretical guidance in variable selection and analysis.

2.6. Back-Calculate to Obtain the Optimal Solution

The process of reverse-calculating the optimal formulation involves selecting appropriate mathematical methods to derive formulations that meet the target criteria. The specific approach depends on factors such as problem complexity, the number of variables, and the characteristics of the objective function. For continuous and differentiable objective functions, several commonly used optimization methods are available, each suitable for different scenarios.
The first category is linear programming (LP), which is an optimization technique for linear objective functions. Its goal is to find the optimal solution—typically the maximum or minimum value of the objective function—while satisfying a set of linear constraints [25].
The objective function of the linear programming problem can be expressed as Equation (8)
z   =   c 1 x 1 + c 2 x 2   +   c 3 x 3   + +   c n x n
When the practical problem is applied, usually, xi > 0. The other constraints are shown in Equation (9).
a 11 a 12 a 1 n a 21 a 22 a 2 n a m 1 a m 2 a mn x 1 x 2 x n     b 1 b 2 b n
Linear programming methods are applicable only when both the objective function and the constraints are linear. The problem can be solved by finding the extremum of a multivariate function. For problems involving a large number of variables or constraints, computational tools such as MATLAB R2024a (24.1.0.2537033) are often employed to obtain the solution.
The second category is nonlinear programming (NLP), where at least one of the objective function or constraints is nonlinear. Unlike linear programming, NLP allows for complex nonlinear relationships in the objective function and constraints, making it suitable for addressing a wider range of real-world problems such as production processes, chemical reactions, and physical phenomena [26].
There are several common nonlinear programming solution methods, such as gradient descent, Lagrange Multipliers, and so on.
(1)
Gradient Descent
Gradient descent is an iterative optimization algorithm used to find the minimum of a function. Widely applied in machine learning, deep learning, and mathematical optimization, gradient descent adjusts model parameters to minimize the loss function. Simply put, the core idea of gradient descent is to iteratively move in the direction of the steepest descent of the function to locate its minimum. This direction is determined by computing the gradient of the function at the current point [27].
The gradient is a vector that indicates the direction of the greatest rate of change in a function’s magnitude. For a differentiable univariate function, the gradient corresponds to its derivative. For example, when seeking the maximum value, if the derivative at a point is positive, it indicates that the maximum lies to the right of that point. For continuously differentiable multivariate functions, the gradient is a vector composed of partial derivatives. The positive direction of the gradient points toward the direction of the steepest ascent of the function, while the opposite direction points toward the steepest descent.
(2)
Lagrange Multipliers
The method of Lagrange multipliers is a classical approach designed to handle problems with equality constraints. Its core idea is to introduce Lagrange multipliers to transform a constrained optimization problem into an unconstrained problem involving the Lagrangian function’s stationary points. If the original objective function has an extremum under the constraints, then the constructed Lagrangian function must also have an extremum (with corresponding variable values equal). Consequently, the extremum points of the objective function under constraints lie among the stationary points of the Lagrangian function [28].
After constructing the Lagrangian function, the stationary points of this new function are solved to find the solution. Since the method of Lagrange multipliers is specifically designed for optimization problems with equality constraints, it may not be suitable for certain practical engineering problems. Therefore, improvements and extensions based on the classical Lagrange method are often necessary.
The limitation of the Lagrange multiplier method arises because its stationary points occur only on the constraint surface, where equality constraints are satisfied. When the constraints involve inequalities, the Lagrange method cannot explore the values of the objective function within the feasible region, potentially causing the extremum to lie inside the constraint boundaries. To address such problems, one can first find the unconstrained extrema of the objective function, then check whether these extrema lie within the feasible region. Extremal points outside the constraint boundaries are discarded, and only those inside are considered. Finally, the values of these interior extrema are compared with the objective function values at the stationary points of the Lagrangian to determine the overall optimum.
Depending on the specific practical problem, an appropriate method from those described above is selected. By incorporating variable ranges or constraint conditions, the optimal value of the objective function can be determined, which in turn defines the values of the variables. This process facilitates the optimization of formulations, cleaning parameters, and other controllable factors [29].
In summary, the specific steps and theoretical basis of the big data screening method have been described in detail. To visually present the technical route of this study, enhance the understanding of its logical flow, and improve the operability and applicability of the big data approach, a flowchart has been developed based on the context of abandoned pipeline cleaning agent selection, as shown in Figure 1.

3. Methodological Applications

Samples were collected from six regions—Dalian, Zhuozhou, Langfang, Jinzhou, Qinhuangdao, and Shenyang. Using citric acid as the cleaning agent, the dosage and related cleaning parameters were varied. The relationship between the cleaning efficiency of the cleaning agent and geological factors, as well as cleaning parameters, was investigated. Based on the aforementioned methods, the optimal cleaning agent dosage and cleaning parameters were determined.

3.1. Variable Determination and Experimental Data Acquisition

The contradiction between the adhesion of sediments and the cleaning ability of the cleaning agent is a major concern during pipeline cleaning. In order to effectively resolve this contradiction, key variables such as clay mineral content, alluvial mineral content, detergent composition, contact time, flow rate, and temperature are considered centrally, as these factors directly determine the effectiveness of cleaning.
Clay mineral content and total rock mineral content are the key variables determining sediment adhesion. Clay minerals possess strong adsorptive properties, making sediments difficult to remove, while the hardness and density of total rock minerals further increase removal difficulty. The type and structure of sediments directly influence the interaction between the cleaning agent and sediments. Therefore, understanding sediment composition is crucial for selecting suitable cleaning agents and determining cleaning methods. This forms the foundation for optimizing cleaning efficiency.
The dosage of the cleaning agent is crucial for the success of chemical cleaning. Different types of deposits—such as oil residues, mineral scales, or water scale—require specific amounts of cleaning agent for effective removal. The cleaning agent reacts with the deposits to weaken their adhesion to the pipeline wall, facilitating the removal of deposits from the inner surface. Contact time, flow rate, and temperature are key operational variables affecting the cleaning efficiency. A longer contact time ensures sufficient penetration and reaction between the cleaning agent and the deposits; a higher flow rate enhances the mechanical scouring effect and helps remove dissolved contaminants; and a suitable temperature improves the solubility and reaction rate of the cleaning agent, which is especially effective for high-viscosity oil and waxy substances. Therefore, adjusting these factors can significantly enhance cleaning performance.
In the pipeline cleaning process, numerous other variables exist, such as the pipe material, inner diameter, wall thickness, and geometric configuration, all of which can influence the cleaning procedure to some extent. However, their effects are relatively minor, especially regarding the interaction between the cleaning agent and the deposits, where they do not directly determine the cleaning agent’s performance or sediment removal efficiency. For example, pipe material may affect cleaning outcomes, but in chemical cleaning, the composition of the deposits and their reaction with the cleaning agent are more critical. While pipe diameter and wall thickness influence the selection and arrangement of cleaning equipment, they do not directly determine sediment removal efficiency. Additionally, pipe curvature and geometry may impact the choice of cleaning methods but do not significantly alter the effectiveness of the cleaning agent, particularly in the context of fluid flow and chemical reactions. Therefore, the influence of these other variables is more indirect and can often be simplified in practical applications, allowing focus on the most critical variables.
In practical applications, focusing on the most critical variables enhances the efficiency and precision of cleaning operations, while less significant variables can be reasonably neglected under specific conditions. This simplification of the cleaning strategy contributes to improved overall operational efficiency.
In summary, the relevant variables can be categorized as follows:
(1)
Uncontrollable residue components: hornblende, clay minerals, hard gypsum, barite, hematite, zeolite, magnesite, pyrite, dolomite, plagioclase, calcite, potassium feldspar, quartz, n-octane, n-nonane, n-decane, nC11–nC35;
(2)
Semi-controllable variables: stirring speed, contact time, test environment temperature, citric acid dosage.
In this study, steel sheets of similar size and mass (thin steel sheets) were used to simulate pipeline walls for residue cleaning evaluation tests. To mimic the formation environment, the steel sheets were first heated at 200 °C for 30 min in an electric blast drying oven, then removed. Residues of similar mass were sequentially adhered to the steel sheets, which were weighed after cooling to room temperature, and the mass before cleaning was recorded as W1. Subsequently, the steel sheets were immersed in the test cleaning agent, with variations in temperature, stirring speed, and immersion time. The mass after cleaning was recorded as W2, which was used to calculate the cleaning efficiency. The cleaning rate is calculated as shown in Equation (10).
η = W 1     W 2 W 1 × 100 %
In Equation (10), η is the cleaning rate, %; W 1 is the mass of attached residue steel sheet before cleaning, g; and W 2 is the mass of attached residue steel sheet after cleaning, g.

3.1.1. Cleaning Rate of Inorganic Component Residues in Different Regions of Citric Acid Detergent

In the laboratory, a muffle furnace was used to remove volatile components from the residues at high temperatures. After this high-temperature volatilization treatment, the residue changed from a clay-like state to a hard, blocky form. The hardened residue was then ground using a grinder and subsequently sieved through standard sieves to obtain particles sized between 200 and 400 mesh, which were used as test samples. Finally, the sieved residues were analyzed for whole-rock composition using X-ray diffraction (XRD), determining the contents of minerals including amphibole, clay minerals, gypsum, barite, hematite, zeolite, magnesite, pyrite, dolomite, plagioclase, calcite, potassium feldspar, and quartz from six different regions. The cleaning efficiency of the cleaning agents was tested separately on residues from different regions, and the experimental data were recorded, as shown in Table 3.
As can be seen from Table 3, among the samples from different regions, the cleaning rates of Qinhuangdao and Zhuozhou are relatively high, 13.02% and 12.17%, respectively, corresponding to a higher content of hornblende, which tentatively suggests that hornblende may help to enhance the cleaning effect. In contrast, the cleaning rate of the Shenyang sample is negative (−2.12%), which is not physically meaningful in practice, and may stem from experimental error, data recording bias, or problems with the benchmark setting of the cleaning rate, so the reliability of this data point is questionable and should be treated with caution or excluded from the analysis. Nonetheless, the Shenyang sample with a high content of square zeolite (65.2%) still had a low cleaning rate, a trend similar to that of the other samples, suggesting that square zeolite may not have facilitated the cleaning process, or even had an inhibitory effect under certain conditions. In addition, clay minerals only appeared in samples with low cleaning rates, such as Dalian (0.42% cleaning rate) and Shenyang, further suggesting that they may have a detrimental effect on cleaning. Taken together, hornblende shows some positive effect, while clay minerals and square zeolite may negatively affect the cleaning rate, and the remaining minerals have relatively insignificant effects.

3.1.2. Cleaning Rate of Organic Component Residues in Different Regions of Citric Acid Detergent

The organic components of the residues in each region were analyzed by gas chromatography, and the contents of the organic components in the residues can be derived, as shown in Table 4.
As shown in Table 4, there are significant differences in the distribution of n-alkanes among samples from different regions, particularly in the high carbon number range (e.g., above nC30). The Langfang samples exhibit notably higher contents in the high carbon number interval (nC31–nC35), with nC35 reaching 15.68%, indicating heavier components and longer chain lengths. In contrast, the Shenyang samples show elevated levels in the medium to low carbon range (e.g., nC10–nC15), especially nC13 and nC15, which reach 8.49% and 8.71%, respectively, suggesting a predominance of medium-chain alkanes. Meanwhile, the distributions in samples from Qinhuangdao, Zhuozhou, Jinzhou, and Dalian are more balanced, featuring a more uniform carbon number distribution between C10 and C25, and relatively fewer heavy components. Notably, Dalian samples have high contents of low-carbon components (e.g., n-nonane, n-decane) but relatively low levels of components above C30, with nC35 at only 7.63%. Overall, Langfang samples trend heavier, Shenyang samples are enriched in medium-chain alkanes, and other regions exhibit a more balanced distribution. These differences may reflect distinct hydrocarbon generation mechanisms influenced by varying geological origins or physicochemical conditions.

3.1.3. Cleaning Efficiency of Citric Acid Cleaning Agent Under Different Formulation Parameters

A temperature-controlled magnetic stirrer was used to evaluate the cleaning efficiency of the detergent on residues from different regions under varying stirring speeds, contact times, and temperatures. The experimental data were recorded and are presented in Table 5.

3.2. Standardization and Normalization of Experimental Data

Most experimental data have dimensions, and under different dimensions, the influence of variables on the results varies. To eliminate the impact of dimensions on the fitting process, data were standardized and normalized using Equations (1) and (2).

3.3. Pairwise Variable Analysis Between Citric Acid Cleaning Efficiency and Each Single Variable

3.3.1. Relationship Between the Cleaning Efficiency of Citric Acid Cleaning Agent and the Inorganic Components of the Residues

Pearson correlation analysis and t-tests were used to examine the relationship between the inorganic components of the residues and the cleaning efficiency. The Pearson correlation coefficient and the significance level (p-value) for the total inorganic content of the residues are shown in Figure 2.
As shown in Figure 2, the p-values for the significance of the relationships between the cleaning rate and the individual inorganic components are all greater than 0.05, indicating that these components do not have a statistically significant direct impact on the cleaning rate when analyzed individually. Therefore, a comprehensive analysis is needed. In duct cleaning modeling, these inorganic components can be treated collectively for analytical purposes. In terms of correlation, the cleaning rate showed a positive correlation with the individual contents of hornblende, galena, and calcite, with correlation coefficients of 0.6533, 0.2796, and 0.5762, respectively. This may be attributed to their solubility or potential synergistic reactions with the cleaning agent. The weak positive correlation with galena suggests that it might play a limited auxiliary role in cleaning through ion exchange. In contrast, the cleaning rate exhibited negative correlations with the contents of clay minerals, hard gypsum, barite, hematite, magnesite, pyrite, dolomite, plagioclase feldspar, potassium feldspar, and quartz. The correlation coefficients were −0.5471, −0.6832, −0.3559, −0.5157, −0.3872, −0.3878, −0.3609, −0.3872, and −0.3609, respectively. Among them, hard gypsum and clay minerals showed the strongest inhibitory effects, which may be due to their reactions with water that produce precipitates or their ability to adsorb active ingredients in the cleaning agent, thereby reducing cleaning efficiency.

3.3.2. Relationship Between the Cleaning Rate of Citric Acid Detergents and the Organic Components of the Residue

Similarly, the relationship between the organic component of the residue and the cleaning rate was analyzed by using Pearson correlation analysis and t-test, and the Pearson correlation coefficient of the total inorganic component of the residue and the p-value of the significance level were obtained as shown in Figure 3.
As shown in Figure 3, the p-values for the significance of the relationships between the organic components of the residues and the cleaning rate are all greater than 0.05, indicating that, from a single-factor analysis perspective, these organic components do not have a statistically significant direct impact on the cleaning rate. Therefore, a comprehensive analysis is required. In pipeline cleaning modeling, these organic components can be treated as a whole for modeling purposes. In terms of correlation, n-nonane (0.03), nC18 (0.024), and nC19 (0.037) exhibited weak positive correlations with the cleaning rate, which may be negligible or related to a stabilizing effect from dissolution. Compounds such as n-octane (0.114), n-decane (0.07), nC11 (0.148), nC12 (0.184), nC13 (0.219), nC14 (0.182), nC15 (0.231), nC16 (0.149), and nC17 (0.119) showed stronger positive correlations. This suggests that the cleaning agent may have a stronger solubilizing effect on short- to medium-chain alkanes, thereby improving cleaning efficiency.

3.3.3. Relationship Between the Cleaning Efficiency of Citric Acid Cleaning Agent and Stirring Speed, Contact Time, Temperature, and Cleaning Agent Dosage

Using Pearson correlation analysis, it was found that agitation speed, contact time, temperature, detergent dosage, and cleaning rate all showed statistically significant relationships at the 0.001 level. The correlation coefficients for agitation speed (0.482), contact time (0.623), ambient test temperature (0.59), and detergent dosage (0.626) all indicated positive correlations with the cleaning rate. This suggests that these variables can be included as independent variables in the modeling equation.
However, since the cleaning rate is influenced by multiple factors, and the correlation coefficients between each individual variable and the cleaning rate are not particularly high, single-factor fitting yields limited predictive power. Therefore, greater emphasis should be placed on the interaction effects among these variables when modeling the cleaning rate.

3.4. Establishment of the Optimal Multivariate Mathematical Model for the Cleaning Efficiency of Citric Acid Cleaning Agent

Based on the above analysis, the inorganic components of residues from the six regions were collectively treated as one variable (x1), while the organic components were treated as another variable (x2). Stirring speed (x3), cleaning time (x4), cleaning temperature (x5), and the dosage of citric acid cleaning agent (x6) were taken as four additional variables. Using SPSS 29.0.2.0(20), all experimental data from the six regions were modeled with cleaning efficiency as the dependent variable, and the organic and inorganic components, experimental operational parameters, and cleaning agent dosage as independent variables. Four regression methods were applied: linear regression (least squares), ridge regression, Lasso regression, and stepwise regression.
When modeling with linear regression (least squares), the F-test results show a significance p-value of 0.160, indicating no statistical significance at the conventional level. Therefore, the null hypothesis that the regression coefficients are zero cannot be rejected, and the model is considered invalid.
Regarding multicollinearity among variables, the Variance Inflation Factor (VIF) values for cleaning time and cleaning temperature exceed 10, indicating the presence of multicollinearity. Therefore, ridge regression, stepwise regression, and Lasso regression were used for modeling. The evaluation parameters of these models are presented in Table 6.
According to the data in Table 6, using Equation (6) to select the optimal model, the model evaluation indices for ridge regression, stepwise regression, and Lasso regression were found to be 2.36, 2.19, and 1.68, respectively. This indicates that the Lasso regression model has better goodness of fit and smaller error compared with the other two models. Therefore, the Lasso regression equation was chosen as the optimal model for the citric acid cleaning agent, as shown in Equation (11).
Y = - 187.986 + 5.935 X 1     0.703 X 2 + 0.265 X 3 + 0.618 X 4 + 4.024 X 5 + 5.411 X 6
In Equation (12), X1 is the inorganic content, %; X2 is the organic content, %; X3 is the agitation speed, m3/h; X4 is the contact time, min; X5 is the cleaning temperature, °C; and X6 is the amount of citric acid cleaning agent added, g.

3.5. Elimination-Based Identification of the Main Controlling Factors Affecting the Cleaning Efficiency of Citric Acid Cleaning Agent

Since Lasso regression introduces an L1 regularization term to penalize the coefficients of variables, it tends to produce a sparse model in which only a small number of features have nonzero coefficients. This characteristic aligns well with the goal of variable selection. Therefore, when the dataset contains many features but only a portion is truly relevant, Lasso can effectively identify the most important variables. As a result, the Lasso model inherently possesses variable elimination capability, making it unnecessary to perform additional variable elimination to identify the main controlling factors.

3.6. Inverse Calculation of the Optimal Citric Acid Dosage for Maximum Cleaning Efficiency Under Cost Constraints

The equations obtained from Lasso regression are linear equations with constant coefficients. When a specific target cleaning rate is set, if the rank of the augmented matrix equals the rank of the coefficient matrix but is less than the number of variables, the system has infinitely many solutions, making linear optimization infeasible. However, if cost constraints are introduced—usually in the form of linear equations—each feasible solution corresponds to a specific cost. By optimizing the cost, the formulation with the lowest cost can be determined.
To achieve the lowest cost formulation of retired oil and gas pipeline cleaning agent components while meeting a specific cleaning rate target (Y), this study introduces particle swarm optimization (PSO) to search for the optimal solution based on the predictive model constructed from Lasso regression. The PSO algorithm offers advantages such as fast convergence, ease of implementation, and suitability for continuous parameter space optimization, making it well suited for solving the multivariable nonlinear optimization problem in this study.
(1)
Design of the Optimization Objective Function
In the Lasso regression model, a linear expression has been determined, shown in Equation (12).
Y ^ = β 0 + i = 1 6 β i x i
In Equation (12), xi denotes the dosage of the ith component or operating parameter, β i is the corresponding regression coefficient, and β 0 is the intercept term. In this study, the objective function is to minimize the total cost of the cleaner, where the unit price corresponding to each variable is ci. The cost function can be expressed as Equation (13).
Cost x = i = 1 6 c i x i
To ensure that the optimization results are acceptable in engineering applications, further constraints are introduced to require the cleaning rate predicted by the regression model to satisfy the target value Ytarget set by the user, with the expression shown in Equation (14).
Y target   Y ^   1
If a group of configurations x fails to satisfy the above condition, a penalty cost is assigned to it such that cost is equal to positive infinity, so that combinations that do not satisfy the cleaning rate requirement are automatically excluded from the optimization process.
(2)
PSO solving process setting
In particle swarm optimization, each particle represents a parameter configuration x = (x1, ..., x6) to be evaluated, which is continuously updated in the parameter space according to its velocity and position to converge to the optimal solution. The flow of the algorithm is as follows:
Initialization phase: Set the number of particles (population size) to 100 and the maximum number of iterations to 1000; each particle is randomly initialized in the six-dimensional parameter space, with the search range determined based on the physical or engineering boundaries of each variable.
Fitness evaluation: The configuration corresponding to each particle is evaluated for cost using the objective function. If the predicted cleaning rate constraint is satisfied, the calculated cost is returned; otherwise, the value is set to positive infinity.
Iterative search: During the iteration process, particles continuously update their positions and velocities, guided by their individual historical best positions (pBest) and the global best position (gBest), gradually converging toward the optimal parameter combination.
Multiple runs: Considering the randomness of the PSO algorithm, to improve the stability and diversity of the solutions, this study repeats the optimization process 10 times and selects the top 10 optimal configurations with the lowest costs as the recommended results.
(3)
Boundary Settings and Parameter Definitions
During the optimization process, the upper and lower bounds of each parameter variable are set based on laboratory data and engineering experience. For example, the detergent dosage (X6) is set within the range of [0, 15]%, and the cleaning temperature (X5) is set within [0, 70] °C. The process is implemented through programming in Python 3.13.5, allowing users to dynamically adjust the boundary values of organic/inorganic content according to the residue properties via a graphical interface, thereby achieving optimal cleaning formulation selection.

4. Discussion of Results and Observations

4.1. Comparison of the Efficiency in Obtaining Guided Formulations Among Full Factorial, Orthogonal, Empirical Experimental Methods, and Big Data Approaches

4.1.1. Comparison of the Speed in Obtaining Formulations Using Four Different Methods

Traditional experimental methods include comprehensive experiments, orthogonal experiments, and empirical experiments. The speed of obtaining guided formulations can be measured by the time required to design and complete the experiments for each method. The less time consumed, the better the efficiency; conversely, more time indicates poorer efficiency.
Due to the large number of variables involved in the experiment, comprehensive experiments, orthogonal experiments, and empirical experiments cannot fully consider every variable. To ensure feasibility, these three experimental methods only consider controllable variables, namely detergent dosage, contact time, cleaning temperature, and stirring speed, each at three levels. Based on this, the big data method designs 6 experimental groups, the comprehensive experiment designs 81 groups, the orthogonal experiment designs 9 groups, and the empirical experiment designs 8 groups. The time consumed for each method is recorded, and the guided formulation for detergent preparation is obtained, as shown in Figure 4.
As can be seen from Figure 4, the big data method requires the least time, only 120 min. Apart from the big data method, the time taken by traditional experimental methods was 1620 min, 280 min, and 160 min, respectively, with an average of 686.7 min. Compared with traditional methods, the big data method reduces the average time by 82.5%, thus improving efficiency by 82.5%. This clearly demonstrates the advantage of the big data method in terms of speed.

4.1.2. Analysis of the Reasons for the Superior Speed of Formulation Acquisition Using the Big Data Method

The big data method demonstrates a significant advantage in rapidity for formulation optimization compared with comprehensive, orthogonal, and empirical experimental methods. This advantage mainly stems from the big data method’s ability to efficiently process and analyze large volumes of data, automatically identifying optimal solutions. In contrast, traditional experimental methods rely on manual experiment design and iterative adjustments, which involve time delays and operational complexity. The rapidity of the big data method is reflected in its capability to quickly extract patterns from data and predict optimization results through modeling, thereby significantly reducing experimental time and labor intensity.
Firstly, the full factorial experimental method is an exhaustive approach that requires testing all possible combinations of factors to ensure that every variable and its interactions are considered. Theoretically, the full factorial method guarantees no potential factors are overlooked. However, in practice, as the number of factors increases, the number of experiments grows exponentially. For example, with 10 factors each having 3 levels, the full factorial method would require 59,049 experiments. This exponential increase in experiment numbers results in lengthy experimental cycles, consuming substantial time and resources, making it impossible to quickly obtain the optimal formulation, thus demonstrating poor rapidity.
The orthogonal experimental method reduces the number of experiments by designing a representative combination of experiments, thereby improving experimental efficiency. Although the orthogonal method can obtain relatively accurate results with fewer experiments, it still relies on manual selection of factor-level combinations. When dealing with multifactor, high-dimensional optimization problems, the design complexity and number of experiments for the orthogonal method remain relatively high, requiring substantial adjustments in the experimental plan. Therefore, although the orthogonal method is more rapid than the full factorial method, its rapidity cannot compare with that of big data methods when variable relationships are complex.
The empirical experimental method relies on the experimenter’s experience, optimizing formulations through continuous adjustments and observation of results. Its main limitation is the heavy dependence on personal judgment, with adjustments often being sporadic and unsystematic. The empirical method generally lacks a systematic experimental design, leading to additional experiments and repeated validations for each adjustment. When dealing with complex formulations, the empirical method has a longer experimental cycle, making it difficult to quickly achieve efficient optimization results, and it is prone to repetitive testing and inefficient exploration.
Compared with these traditional methods, big data approaches leverage powerful computational capabilities and automated algorithms to efficiently identify underlying patterns in data and perform real-time optimization. Unlike traditional methods that rely on manual experimental design and adjustments, big data methods can process and optimize data almost in real-time, significantly shortening the experimental cycle and reducing the time spent on repeated trials and validations, thereby markedly enhancing the speed of formulation optimization.

4.2. Comparison of the Accuracy of Different Guidance Formulations Obtained by Full-Factorial, Orthogonal, Empirical Experimental Methods, and Big Data Approaches

4.2.1. Comparison of Formulation Accuracy Achieved by Four Different Methods

The cleaning efficiency is measured by calculating the percentage of residue removed relative to the total mass of the residue and the steel sheet. It can be quantified using Equation (10).
After preparing the detergents according to the guidance schemes, they were assigned identification codes: the detergent obtained by the big data method was labeled as A, the one from the full-scale experimental method as B, the orthogonal experimental method as C, and the empirical experimental method as D. Additionally, a control group E was established, representing the theoretical maximum cleaning efficiency assuming complete removal of the residue under ideal conditions.
Detergents A, B, C, and D were each used to clean steel sheets with the same mass and coated with an identical amount of residue from the Dalian area. For each detergent, six steel sheets were cleaned, and the cleaning rates obtained were averaged to represent the final cleaning efficiency.
Accuracy can be evaluated by the accuracy rate, which is defined as the ratio of the cleaning rate to the theoretical maximum cleaning rate. This metric reflects how closely the cleaning performance of the four detergent groups approaches that of the control group E. The accuracy rate can be quantified using Equation (15).
α i   =   y i y E
In Equation (15), the yi denotes the cleaning rate of the detergent formulated according to the guideline formulation given by the different experimental methods.
The accuracy of the cleaning rate of each formulation cleaner was calculated separately, with 100% for group E, as shown in Figure 5.
As can be seen in Figure 5, among the four detergent groups A, B, C, and D, the accuracy rate of the big data method was 87.5%, while the accuracy rates for the full factorial, orthogonal, and empirical experimental methods were 90.7%, 84.3%, and 78.1%, respectively. On one hand, excluding the big data method, the average accuracy rate of the other three experimental methods was 84.4%, with the big data method outperforming them by 3.67%. On the other hand, defining the difference between the accuracy rate and the control group E’s accuracy rate as the under-accuracy rate, the under-accuracy rates of the full factorial, orthogonal, and empirical methods were 9.3%, 15.7%, and 21.9%, respectively. The average under-accuracy rate of these three methods was 15.6%, while that of the big data method was 12.5%, which is 24.8% lower than the other three experimental methods.
In summary, although the full factorial experimental method achieved the highest accuracy rate, the big data method closely followed with a comparable accuracy. The big data method, based on statistical principles, may introduce some errors, but these errors remain within an acceptable range. Therefore, the big data method demonstrates excellent accuracy and strong practical applicability.

4.2.2. Analysis of the Reasons for the High Accuracy of Cleaning Agents Formulated by Big Data Methods

The big data method exhibits a significant advantage in formulation accuracy compared with the full factorial, orthogonal, and empirical experimental methods. This superiority stems from its ability to leverage machine learning and data analytics to precisely model complex causal relationships, capturing nonlinear and high-dimensional interactions, thereby providing higher predictive accuracy. In contrast, traditional experimental methods, although offering some experimental data support, often suffer from limitations due to experimental design, assumptions, and human errors, failing to fully reflect all critical factors and their complex interrelations in the formulation [30].
The full factorial method theoretically covers all factor combinations and interactions by exhaustive experimentation. While it provides comprehensive variable coverage, its accuracy is practically constrained by feasibility and resource limitations. The method typically assumes factor independence or linear effects, but in reality, when factor numbers increase exponentially, many formulation performances depend on nonlinear and interaction effects that simple experimental designs cannot fully capture, limiting its final accuracy.
Orthogonal experimental design reduces the number of tests while maintaining representative factor coverage. However, like the full factorial method, it struggles to accurately model complex nonlinear dependencies and interactions among factors. When strong nonlinear dependencies exist, orthogonal methods may lack sufficient information to precisely describe these relations, impacting final formulation accuracy. Moreover, reliance on manually designed tables may omit important factor combinations, resulting in incomplete outcomes.
Empirical methods depend on experimenters’ prior experience and intuition, typically optimizing based on accumulated data and historical knowledge. Its accuracy is affected by subjective judgment and experience limitations. Experimenters might adjust factors based on known insights but often miss subtle interactions or overlooked variables due to a lack of systematic data analysis and modeling. This becomes especially problematic in complex, high-dimensional formulation scenarios, where empirical methods tend to perform poorly in terms of accuracy.
In contrast, big data methods use advanced algorithms such as machine learning and deep learning to automatically learn from large historical datasets, capturing complex nonlinear relationships and interactions. Their core strength lies in extracting deep insights from multidimensional data and generating precise predictive models. These models can handle highly complex and nonlinear formulation problems by optimizing parameters through large-scale data training, accurately predicting formulation performance under various conditions. Machine learning models self-adjust and iteratively improve predictive accuracy through each training cycle. This data-driven approach reveals intricate factor interactions and latent patterns, delivering more comprehensive and accurate formulation optimization results than traditional methods.
Therefore, the accuracy of big data methods arises from their powerful modeling capabilities in high-dimensional, multifactor environments. They effectively capture nonlinearities and complex interactions, avoiding errors introduced by simplified experimental designs, limited assumptions, and subjective judgments inherent in traditional methods. Through comprehensive data analysis and automated modeling, big data methods achieve more precise formulation predictions and significantly enhance the accuracy of optimization outcomes [31].

5. Conclusions

The cleaning efficiency of the detergent is influenced by too many factors, leading to complex analysis. A big data approach is proposed to address the limitations of previous methods in terms of accuracy and rapidity. The main conclusions obtained are as follows.
(1)
Compared with the full-scale experimental method, the orthogonal experimental method, and the empirical experimental method, the big data method has good accuracy, with an average accuracy improvement of 3.67%, and quantifies the degree of influence of each variable on the cleaning rate. Taking citric acid cleaner as an example, the model equation of the cleaning rate of citric acid cleaner is obtained, which is Y = −187.986 + 5.935X1 − 0.703X2 + 0.265X3 + 0.618X4 + 4.024X5 + 5.411X6, and the inverse Calculation method is given.
(2)
In terms of swiftness, big data methods show a clear advantage of 82.5% over traditional methods. This is the most valuable and important advantage of big data methods.
Since the application of this study to the selection of cleaning agents for abandoned pipelines is limited, and every method inevitably has its own limitations, this research is no exception. The specific limitations and corresponding recommendations are as follows.
(1)
Big data methods rely heavily on data quality. Incomplete or inconsistent data can reduce accuracy and reliability. In complex systems, poor model interpretability also limits practical decision-making, especially in high-risk fields like petroleum engineering [32]. It is recommended to establish a systematic data management framework to ensure data accuracy, consistency, and integrity. This will enhance overall data quality. Additionally, integrating multiple models can improve the handling of complex variable relationships and better address specific screening and matching challenges [33].
(2)
As the application of this study is still limited and validation is based only on laboratory experiments, field testing has not yet been conducted. Further on-site validation is needed to assess the model’s reliability and practical applicability.

Author Contributions

In this study, R.L. provided project help and technical assistance; J.Z., data processing and article organization; L.S., technical assistance; L.J., data processing and idea organization; S.C., technical assistance; and L.Z., experimental and thesis guidance. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by North Pipeline Company of the National Pipeline Network Group grant number GWHT20230030172 and The APC was funded by Lihui Zheng.

Data Availability Statement

Data Availability Statement: This study did not report any data.

Acknowledgments

We are grateful to the laboratory for providing a good experimental environment and advanced equipment, and to the members for their help in data collection, analysis, and research methodology. The help and support from many organizations and individuals during the research and thesis writing process are gratefully acknowledged.

Conflicts of Interest

Authors Rongguang Li, Ling Sun and Sixun Chen were employed by the company North Pipeline Company of the National Pipeline Network Group. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential con-flict of interest.

References

  1. Wasim, M.; Djukic, M.B. External corrosion of oil and gas pipelines: A review of failure mechanisms and predictive preventions. J. Nat. Gas Sci. Eng. 2022, 100, 104467. [Google Scholar] [CrossRef]
  2. Haydu, R.; Brantley, W.E., III; Casenhiser, J. Industrial Chemical Cleaning Methods; Higgins, L.R., Wikoff, D.J., Eds.; McGraw-Hill Education: New York, NY, USA, 2008. [Google Scholar]
  3. Turner-Walker, G. The nature of cleaning: Physical and chemical aspects of removing dirt, stains and corrosion. In Proceedings of the International Symposium on Cultural Heritage Conservation, Tainan, Taiwan, 6–8 November 2012. [Google Scholar]
  4. Li, Y.-C.; Wang, D.; Wang, Y.; Sun, J.-X. Study on process conditions of oil sludge washing and petroleum retrievement by thermochemical washing. Environ. Pollut. Control. 2008, 30, 39–42. [Google Scholar]
  5. Jing, J.; Yu, J.; Yao, Y.; Jiang, D.; Huang, M. Replacement and Cleaning of Deserted Inshore Oil and Gas Pipelines. In ICPTT 2009: Advances and Experiences with Pipelines and Trenchless Technology for Water, Sewer, Gas, and Oil Applications; American Society of Civil Engineers: Reston, VA, USA, 2012; pp. 1349–1361. [Google Scholar]
  6. Yang, F.; Chi, S.; Wen, Y.; Wang, H.; Li, J.; Zhu, F.; Li, C.; Tian, W. Preparation of Cleaning Agent for Treatment of Abandoned Oil Pipeline. In Proceedings of the E3S Web of Conferences, Yogyakarta, Indonesia, 7–8 September 2020; EDP Sciences: Les Ulis, France, 2020; p. 04027. [Google Scholar]
  7. Rocha, N.O.; Khalil, C.N.; Leite, L.C.; Goja, A.M. Thermochemical process to remove sludge from storage tanks. In Proceedings of the SPE International Conference on Oilfield Chemistry, Houston, TX, USA, 28 February–2 March 2007; SPE: Richardson, TX, USA, 2007; p. SPE-105765-MS. [Google Scholar]
  8. Liu, H.; Wang, X.; Zhai, Y.; Xu, T. Application and development of chemical heat washing technology in petroleum oily sludge treatment: A review. Separations 2024, 11, 26. [Google Scholar] [CrossRef]
  9. Hongbing, K.; Chengcheng, N.; Hu, J.; Dingxiang, G.; Changlou, D. Ultra-High Temperature Workover Fluid with Flexible Rubber Particles Used in Shunbei Oilfield. Drill. Fluid Complet. Fluid 2021, 38, 525–530. [Google Scholar]
  10. Wenting, G.; Xiaoqin, Z.; Zhibiao, C. Surfactant concentration optimization of weak alkali ASP flooding system for Class Ⅲ reservoirs in Daqing Oilfield. Pet. Geol. Oilfield Dev. Daqing 2023, 42, 99–107. [Google Scholar]
  11. Zheng, L.-H.; Yan, J.-N.; Chen, M.; Zhang, G.-Q. Optimization model for cost control of working fluids in oil and gas wells. Acta Pet. Sin. 2005, 26, 102. [Google Scholar]
  12. Wang, J.; Zheng, L.; Li, B.; Deng, J. A novel method applied to optimize oil and gas well working fluids. Int. J. Eng. Tech. Res. 2016, 5, 2454–4698. [Google Scholar]
  13. Huang, Z.; Pan, L.; Lu, H.; Zheng, L.; Li, D.; Hai, X. The reasons for sudden production drop by big data analysis in trial production for Well SHB-X in Shunbei oilfield. Pet. Reserv. Eval. Dev. 2019, 41, 341–347. [Google Scholar]
  14. Wang, X.; Liu, H.; Wang, C.; Chen, B. Big data method for evaluating reservoir damage degree of fuzzy ball drilling fluid. Pet. Reserv. Eval. Dev. 2021, 11, 605–613. [Google Scholar]
  15. Yang, L.; Li, Z.; Nie, Q.; Liang, Y.; Jiang, G. Study on effects of temperature and pressure on density of oil based drilling fluids and the mathematical model thereof. Drill. Fluid Complet. Fluid 2022, 39, 151–157. [Google Scholar]
  16. Zhang, X.; Huang, Y.; Guo, M.; Li, Z.; Yuan, R.; Chai, X. Application of technology of drilling fluid density adjustment while drilling to buried hill formation of Caofeidian M structure. Mud Logging Eng. 2023, 34, 15–21. [Google Scholar]
  17. Zhang, P.; Wang, X.; Feng, C.; Zheng, L.; Zhang, Y.; Sun, M. Evaluation method and application of gas-water two-phase unsteady inflow performance of coal reservoir based on multiple factors. Nat. Gas Geosci. 2023, 34, 1641–1651. [Google Scholar]
  18. Dufera, A.G.; Liu, T.; Xu, J. Regression models of Pearson correlation coefficient. Stat. Theory Relat. Fields 2023, 7, 97–106. [Google Scholar] [CrossRef]
  19. Fedotova, O.; Teixeira, L.; Alvelos, H. Software Effort Estimation with Multiple Linear Regression: Review and Practical Application. J. Inf. Sci. Eng. 2013, 29, 925–945. [Google Scholar]
  20. Chen, R.; Paschalidis, I.C. A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 2018, 19, 1–48. [Google Scholar]
  21. Li, H.; Liang, Y.; Xu, Q. Support vector machines and its applications in chemistry. Chemom. Intell. Lab. Syst. 2009, 95, 188–198. [Google Scholar] [CrossRef]
  22. Chakrabarti, A.; Ghosh, J.K. AIC, BIC and recent advances in model selection. Philos. Stat. 2011, 7, 583–605. [Google Scholar]
  23. St, L.; Wold, S. Analysis of variance (ANOVA). Chemom. Intell. Lab. Syst. 1989, 6, 259–272. [Google Scholar]
  24. Luo, C.; Jiang, Z.; Wang, C.; Hu, Z. Relative contribution ratio: A quantitative metrics for multi-parameter analysis. Cogent Math. 2015, 2, 1068979. [Google Scholar] [CrossRef]
  25. Hoffman, A.J. On simple linear programming problems. In Proceedings of Symposia in Pure Mathematics; American Mathematical Society: Providence, RI, USA, 1963; pp. 317–327. [Google Scholar]
  26. Bertsekas, D.P. Nonlinear programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
  27. Désidéri, J.-A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Math. 2012, 350, 313–318. [Google Scholar] [CrossRef]
  28. Rockafellar, R.T. Lagrange multipliers and optimality. SIAM Rev. 1993, 35, 183–238. [Google Scholar] [CrossRef]
  29. Sáez, P.B.; Rittmann, B.E. Model-parameter estimation using least squares. Water Res. 1992, 26, 789–796. [Google Scholar] [CrossRef]
  30. Baig, M.I.; Shuib, L.; Yadegaridehkordi, E. Big Data Tools: Advantages and Disadvantages. J. Soft Comput. Decis. Support Syst. 2019, 6, 14–20. [Google Scholar]
  31. Zhang, H.; Zeng, Y.; Liao, L.; Wang, R.; Hou, X.; Feng, J.; Mulunjkar, A. How to Land Modern Data Science in Petroleum Engineering. In Proceedings of the SPE Asia Pacific Oil and Gas Conference and Exhibition, Online, 12–14 October 2021; SPE: Richardson, TX, USA, 2021; p. D031S025R003. [Google Scholar]
  32. Fan, J.; Han, F.; Liu, H. Challenges of big data analysis. Natl. Sci. Rev. 2014, 1, 293–314. [Google Scholar] [CrossRef]
  33. Cai, L.; Zhu, Y. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 2015, 14, 2. [Google Scholar] [CrossRef]
Figure 1. Technical route for big data screening methods for waste pipeline cleaning agents.
Figure 1. Technical route for big data screening methods for waste pipeline cleaning agents.
Energies 18 03496 g001
Figure 2. Pearson correlation coefficients and p-values between the cleaning efficiency of citric acid cleaning agent and the inorganic components of the residues.
Figure 2. Pearson correlation coefficients and p-values between the cleaning efficiency of citric acid cleaning agent and the inorganic components of the residues.
Energies 18 03496 g002
Figure 3. Pearson correlation coefficients and p-values between the cleaning efficiency of citric acid cleaning agent and the organic components of the residues.
Figure 3. Pearson correlation coefficients and p-values between the cleaning efficiency of citric acid cleaning agent and the organic components of the residues.
Energies 18 03496 g003
Figure 4. Time required to obtain guided formulations using the full-scale experimental method (B), orthogonal experimental method (C), empirical experimental method (D), and big data method (A).
Figure 4. Time required to obtain guided formulations using the full-scale experimental method (B), orthogonal experimental method (C), empirical experimental method (D), and big data method (A).
Energies 18 03496 g004
Figure 5. Comparison of the accuracy of cleaning agent formulations obtained by big data methods and traditional experimental methods.
Figure 5. Comparison of the accuracy of cleaning agent formulations obtained by big data methods and traditional experimental methods.
Energies 18 03496 g005
Table 1. Comparison of the rapidity and accuracy of screening methods for oil and gas pipeline cleaners.
Table 1. Comparison of the rapidity and accuracy of screening methods for oil and gas pipeline cleaners.
Experimental MethodsAdvantageDisadvantage
Holistic approach to experimentsComprehensively analyze the inner laws of things changing with high accuracy.m variables, n number of levels, requires nm experiments. The number of experiments is high, and swiftness is poor.
Orthogonal experimental methodSwiftness improves with fewer variables and levels, as orthogonality allows efficient selection of representative trials from the full set.Swiftness is not as good, and accuracy is not as good when the number of variables and levels is high.
Empirical experimental methodOften the number of experiments is small, and the swiftness is high.Over-reliance on previous data, models, and experience. Does not have the accuracy to solve general problems.
Table 2. Advantages and disadvantages of current common experimental methods for chemical system optimization.
Table 2. Advantages and disadvantages of current common experimental methods for chemical system optimization.
Experimental MethodsAdvantageDisadvantage
Professional Knowledge ActBased on theory and experience in the field, it can quickly identify key factors. Saves time by avoiding unnecessary experiments.Suitable for systems with clear mechanisms and expert knowledge, but weak in handling complex variables.
One-way experimental methodAnalyze the role of each factor individually and identify the main control variables.Best for studies with few factors or clear mechanisms; cannot assess interactions or complex effects.
Table 3. Inorganic content of residues from different regions and cleaning efficiency of citric acid cleaning agent.
Table 3. Inorganic content of residues from different regions and cleaning efficiency of citric acid cleaning agent.
DalianJinzhouLangfangQinhuandaoShenyangZhuozhou
Cleaning rate/%0.4210.520.7913.02−2.1212.17
Quartz/%20.6522.2520.3532.620.1521.85
Potassium feldspar/%16.1501.4503.20
Calcite/%3.10001.950
Plagioclase/%7.711.42.807.250
Dolomite/%20.703.1502.250
Pyrite/%046.6564.567.465.259.3
Magnesite/%20.0500000
Zeolite/%000000
Hematite/%7.8500000
Barite/%003.05000
hard gypsum/%019.72.30018.85
clay minerals/%3.800000
Anatase/%002.4000
Table 4. Organic matter content of residues from different regions and cleaning efficiency of citric acid cleaning agent.
Table 4. Organic matter content of residues from different regions and cleaning efficiency of citric acid cleaning agent.
DalianJinzhouLangfangQinhuandaoShenyangZhuozhou
n-octane/%1.570.991.571.521.850.86
n-nonane/%2.161.271.911.922.721.22
n-decane/%3.711.842.552.574.351.78
nC11/%5.592.593.083.056.252.30
nC12/%5.503.232.573.096.172.64
nC13/%7.414.143.293.548.493.72
nC14/%6.224.543.313.587.034.16
nC15/%7.754.843.213.798.714.50
nC16/%5.574.882.373.815.764.46
nC17/%4.964.802.143.824.924.43
nC18/%3.694.551.743.833.704.15
nC19/%4.355.122.024.414.374.79
nC20/%3.124.781.714.183.284.44
nC21/%2.875.181.604.503.194.75
nC22/%2.534.921.474.282.734.54
nC23/%2.365.441.634.523.205.07
nC24/%2.064.471.303.772.394.23
nC25/%1.824.561.743.822.304.38
nC26/%1.693.681.283.183.473.75
nC27/%1.353.581.383.151.793.95
nC28/%1.363.101.633.111.863.64
nC29/%1.282.942.073.241.463.80
nC30/%1.272.462.863.071.403.37
nC31/%1.962.485.093.551.584.09
nC32/%2.242.527.153.972.273.37
nC33/%3.382.3210.354.212.203.13
nC34/%4.602.3013.294.161.872.54
nC35/%7.632.4815.684.370.681.96
Table 5. Cleaning rate of citric acid cleaner with different formulation parameters.
Table 5. Cleaning rate of citric acid cleaner with different formulation parameters.
Cleaning Rate/%Stirring Speed/(m3/h)Contact Time/minCleaning Temperature/°CCleaning Agent Dosage/g
012010203
6.1524015304.5
7.2636020406
3.886025507.5
0.3518030609
92.55300357010.5
91.26300357012
25.86180306013.5
Table 6. Comparison of errors in three regressions.
Table 6. Comparison of errors in three regressions.
Mountain Ridge ReturnGradual RegressionLasso Returns
MSE3.563.252.38
RMSE1.8871.8031.543
MAE1.9541.8961.755
R20.9110.8690.932
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, R.; Zhao, J.; Sun, L.; Jin, L.; Chen, S.; Zheng, L. Screening Decommissioned Oil and Gas Pipeline Cleaners Using Big Data Analytics Methods. Energies 2025, 18, 3496. https://doi.org/10.3390/en18133496

AMA Style

Li R, Zhao J, Sun L, Jin L, Chen S, Zheng L. Screening Decommissioned Oil and Gas Pipeline Cleaners Using Big Data Analytics Methods. Energies. 2025; 18(13):3496. https://doi.org/10.3390/en18133496

Chicago/Turabian Style

Li, Rongguang, Junqi Zhao, Ling Sun, Long Jin, Sixun Chen, and Lihui Zheng. 2025. "Screening Decommissioned Oil and Gas Pipeline Cleaners Using Big Data Analytics Methods" Energies 18, no. 13: 3496. https://doi.org/10.3390/en18133496

APA Style

Li, R., Zhao, J., Sun, L., Jin, L., Chen, S., & Zheng, L. (2025). Screening Decommissioned Oil and Gas Pipeline Cleaners Using Big Data Analytics Methods. Energies, 18(13), 3496. https://doi.org/10.3390/en18133496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop