Symbolic regression based genetic approximations of the Colebrook equation for flow friction

Widely used in hydraulics, the Colebrook equation for flow friction relates implicitly to the input parameters; the Reynolds number, and the relative roughness of inner pipe surface, with the output unknown parameter; the flow friction factor. In this paper, a few explicit approximations to the Colebrook equation are generated using the ability of artificial intelligence to make inner patterns to connect input and output parameters in explicit way not knowing their nature or the physical law that connects them, but only knowing raw numbers. The fact that the used genetic programming tool does not know the structure of the Colebrook equation which is based on computationally expensive logarithmic law, is used to obtain better structure of the approximations which is less demanding for calculation but also enough accurate. All generated approximations are with low computational cost because they contain a limited number of logarithmic forms used although for normalization of input parameters or for acceleration, but they are also sufficiently accurate. The relative error regarding the friction factor in best case is up to 0.13% with only two logarithmic forms used. As the second logarithm can be accurately approximated by the Pade approximation, practically the same error is obtained also using only one logarithm.


Introduction
The Colebrook equation for flow friction is one of the most used formulas in hydraulics, which is a branch of civil engineering, that deals with the conveyance of liquids through pipes. It is also widely used in mechanical, petroleum and chemical engineering, etc., wherever flow through pipes occur. It is an empirical relation developed by Colebrook [1] based on his experiment with White [2]. The experiment dealt with flow of air/liquid through artificially roughened pipes; Equation (1): In Equation (1), λ is the Darcy flow friction factor, Re is the Reynolds number, and ε/D is the relative roughness of inner pipe surface (all three quantities are dimensionless).
In the Colebrook equation, the flow friction factor λ is implicitly given, λ = f (λ, Re, ε/D) where it can be expressed in an explicit way only approximately [3][4][5][6][7][8], λ ≈ f (Re, ε/D) or otherwise the original equation can be solved iteratively [9,10]. Today, it is important not only to have accurate but also computationally efficient approximations [11][12][13]. Here, we used the ability of artificial intelligence to connect input data; in our case the Reynolds number, Re and the relative roughness of inner pipe surface, ε/D with the output parameter; in our case the flow friction factor, λ not knowing the structure of the Colebrook equation [14][15][16][17]. We used the ability of artificial intelligence to connect input with output data to form patterns not knowing the nature of the data or the physical law that connects them (a similar approach is valid for other branches of hydraulics [18,19]). In that way, we tried to avoid the computationally expensive logarithmic law on which the Colebrook equation is based. As a final product we developed few low-cost but very accurate explicit approximations to the Colebrook equation.

Methods Used, Preparation of Data and Software Tool, Results, Structure of Approximations, Accuracy and Comparative Analysis
The main idea is to use the ability of artificial intelligence to connect input data sets; in our case the Reynolds number, Re and the relative roughness of inner pipe surface, ε/D with the output data set; in our case the flow friction factor, λ; {Re, ε/D}→{λ}, not knowing the physical law which connects input to output. Sign "→" practically represents the Colebrook equation; Equation (1), but the genetic programming tool is not aware of that fact. To prepare data to feed the genetic programming tool, we covered the whole practical domain of applicability of the Colebrook equation; which is for the Reynolds number, Re between 4000 and 10 8 (whole turbulent flow covered) and for the relative roughness of inner pipe surface, ε/D up to 0.05 (pipe covered from practically smooth to the very rough) [20] with a mesh which consists of 90 thousand intersection points {Re, ε/D} for which we calculated very accurately the flow friction factor, λ using the Colebrook equation; Equation (1); Figure 1a,b.
Water 2018, 10, x FOR PEER REVIEW 2 of 14 inner pipe surface, ε/D with the output parameter; in our case the flow friction factor, λ not knowing the structure of the Colebrook equation [14][15][16][17]. We used the ability of artificial intelligence to connect input with output data to form patterns not knowing the nature of the data or the physical law that connects them (a similar approach is valid for other branches of hydraulics [18,19]). In that way, we tried to avoid the computationally expensive logarithmic law on which the Colebrook equation is based. As a final product we developed few low-cost but very accurate explicit approximations to the Colebrook equation.

Methods Used, Preparation of Data and Software Tool, Results, Structure of Approximations, Accuracy and Comparative Analysis
The main idea is to use the ability of artificial intelligence to connect input data sets; in our case the Reynolds number, Re and the relative roughness of inner pipe surface, ε/D with the output data set; in our case the flow friction factor, λ; {Re, ε/D}→{λ}, not knowing the physical law which connects input to output. Sign "→" practically represents the Colebrook equation; Equation (1), but the genetic programming tool is not aware of that fact. To prepare data to feed the genetic programming tool, we covered the whole practical domain of applicability of the Colebrook equation; which is for the Reynolds number, Re between 4000 and 10 8 (whole turbulent flow covered) and for the relative roughness of inner pipe surface, ε/D up to 0.05 (pipe covered from practically smooth to the very rough) [20] with a mesh which consists of 90 thousand intersection points {Re, ε/D} for which we calculated very accurately the flow friction factor, λ using the Colebrook equation; Equation (1); Figure 1a,b.
(a) Having 90 thousand combinations; in order to test robustness of the symbolic regression algorithms we fed the genetic programming tool with 200 triplets {Re, ε/D}i, {λ}i (see Supplementary Material attached to this paper) hoping that it will connect {Re, ε/D}i→{λ}i accurately. The input sample was generated according to the uniform density function of each input variable. The lowdiscrepancy Sobol sequences were employed [21]. These so-called quasirandom sequences have useful properties. In contrary to random numbers, quasirandom numbers cover the space more quickly and evenly. Thus, they leave very few holes. We used [computer software] Eureqa by Nutonian, Inc., Boston, MA, as a genetic programming tool [22,23]. The symbolic regression approach adopted herein [24][25][26][27][28][29] is based upon genetic programming wherein a population of functions is allowed to breed and mutate with the genetic propagation into subsequent generations based on survival-of-the-fittest criteria [30]. The main goal of this study is to make accurate and computationally cheap explicit approximations of the Colebrook equation, where computationally cheap means to contain the least possible number of logarithmic functions and non-integer powers [31][32][33][34][35][36].
We can see that approximations found by Eureqa [computer software] have the form {R, K}→xsol, where the symbol R denotes the Reynolds number, K represents relative roughness and = 1 √ . All accurate models are computationally expensive, as they contain many logarithmic terms with different arguments. Thus, we can see that Eureqa itself requires human knowledge, in order to obtain an accurate but still a computationally cheap approximation of the Colebrook Having 90 thousand combinations; in order to test robustness of the symbolic regression algorithms we fed the genetic programming tool with 200 triplets {Re, ε/D} i , {λ} i (see Supplementary Material attached to this paper) hoping that it will connect {Re, ε/D} i →{λ} i accurately. The input sample was generated according to the uniform density function of each input variable. The low-discrepancy Sobol sequences were employed [21]. These so-called quasirandom sequences have useful properties. In contrary to random numbers, quasirandom numbers cover the space more quickly and evenly. Thus, they leave very few holes. We used [computer software] Eureqa by Nutonian, Inc., Boston, MA, as a genetic programming tool [22,23]. The symbolic regression approach adopted herein [24][25][26][27][28][29] is based upon genetic programming wherein a population of functions is allowed to breed and mutate with the genetic propagation into subsequent generations based on survival-of-the-fittest criteria [30]. The main goal of this study is to make accurate and computationally cheap explicit approximations of the Colebrook equation, where computationally cheap means to contain the least possible number of logarithmic functions and non-integer powers [31][32][33][34][35][36].
We can see that approximations found by Eureqa [computer software] have the form {R, K}→xsol, where the symbol R denotes the Reynolds number, K represents relative roughness and xsol = 1 √ λ . All accurate models are computationally expensive, as they contain many logarithmic terms with different arguments. Thus, we can see that Eureqa itself requires human knowledge, in order to obtain an accurate but still a computationally cheap approximation of the Colebrook equation. Consequently, we will combine several approaches in this paper: Eureqa [computer software] [22,23], the fixed-point iteration [9] and Padé approximation [10]. According to our numerical experiments, Eureqa seems to be useful especially for finding a computationally cheap rational approximation of the Colebrook solution, which serves as a good starting point for the fixed-point iteration method (acceleration). Finally, the Padé approximation is used as a cheap but very accurate approximation of the logarithm in the second and the successful iterations of the fixed-point method.

Input Parameters in Their Raw Form
Using the input parameters in their raw form {Re, ε/D} i →{λ} i , Eureqa, the used genetic programming tool gives a set of approximations in polynomial forms [12]. Knowing that logarithmic expressions and non-integer powers are expensive for computation, we hoped that we have fully accomplished our task. Unfortunately, Eureqa gives a number of not very accurate solutions and here we show Equation (2) with the relative error of λ 0 even up to 16.56% in respect to the accurate λ, where the relative error [5,26] is defined as (|λ accurate − λ|/λ accurate )·100%, where λ accurate is calculated in an iterative procedure using the original implicitly given Colebrook equation [6,9]; Equation (1), while λ is obtained through the presented approximations; Equations (2)-(6). In Equation (2), "↔" means related but not sufficiently accurate: On the other hand, we found that the accuracy can increase significantly using one fixed-point iterative cycle of acceleration [9]; Equation (2a), after which accuracy of λ 1 increases up to 0.98%.

Normalized Input Parameters
Unfortunately, in our case using the input parameters in their raw form the accuracy was not at a high level without acceleration, so having previous experience with the same problem where we used Artificial Neural Network [15,16] to simulate results, we normalized parameters a = log10(Re), b = −log10(ε/D), in order to avoid discrepancy in the scale which are in raw form 1000 < Re < 10 8 and ε/D << 1 and after normalization 3.5 < a < 8 and 1.3 < b < 6.5 (Eureqa, software used a as genetic

Normalized Input Parameters
Unfortunately, in our case using the input parameters in their raw form the accuracy was not at a high level without acceleration, so having previous experience with the same problem where we used Artificial Neural Network [15,16] to simulate results, we normalized parameters a = log 10 (Re), b = −log 10 (ε/D), in order to avoid discrepancy in the scale which are in raw form 1000 < Re < 10 8 and ε/D << 1 and after normalization 3.5 < a < 8 and 1.3 < b < 6.5 (Eureqa, software used a as genetic programming tool also suggested to us a data normalization process) [34][35][36]. The normalization gives relatively good results, and the genetic programming tool generated more accurate results without knowing that the logarithmic form of the Colebrook equation was originally used but only knowing the predicted input and output datasets; Figure 3:  (2)-(up), Equation (2a) after first step of fixed-point acceleration-(middle), and the second step of acceleration-(down); relative error up to 16.56%, up to 0.98% and up to 0.13% respectively.

Normalized Input Parameters
Unfortunately, in our case using the input parameters in their raw form the accuracy was not at a high level without acceleration, so having previous experience with the same problem where we used Artificial Neural Network [15,16] to simulate results, we normalized parameters a = log10(Re), b = −log10(ε/D), in order to avoid discrepancy in the scale which are in raw form 1000 < Re < 10 8 and ε/D << 1 and after normalization 3.5 < a < 8 and 1.3 < b < 6.5 (Eureqa, software used a as genetic programming tool also suggested to us a data normalization process) [34][35][36]. The normalization gives relatively good results, and the genetic programming tool generated more accurate results without knowing that the logarithmic form of the Colebrook equation was originally used but only knowing the predicted input and output datasets; Figure 3: Using our previous experience [15] with the training of the Artificial Neural Network where very good results were achieved through the normalization of parameters; a = log10(Re), b = −log10(ε/D), the genetic programming tool generated a dozen equations with different levels of accuracy and complexity, but fortunately none of them contain logarithms or non-integer power terms. Here, we present the four most successful explicit approximations; Equations (3)- (6). Adding one additional logarithmic form for acceleration using one additional fixed-point iterative step [9,45]; Equation (2a), the accuracy of the approximations increases significantly (about 10 times); Equations (3a)-(6a). Using our previous experience [15] with the training of the Artificial Neural Network where very good results were achieved through the normalization of parameters; a = log 10 (Re), b = −log 10 (ε/D), the genetic programming tool generated a dozen equations with different levels of accuracy and complexity, but fortunately none of them contain logarithms or non-integer power terms. Here, we present the four most successful explicit approximations; Equations (3)-(6). Adding one additional logarithmic form for acceleration using one additional fixed-point iterative step [9,45]; Equation (2a), the accuracy of the approximations increases significantly (about 10 times); Equations (3a)-(6a). Results are in the form {λ} 0 ↔{a = log 10 (Re), b = −log 10 (ε/D)} 0 , {λ} 1 ≈{log 10 (λ 0 )} 1 , {λ} 2 ≈{log 10 (log 10 (λ 0 ))} 2 , etc., where "↔" means related but not sufficiently accurate, while "≈" is reasonably accurate enough.

Discussion-Comparative Analysis
In general, the relative error of approximations of the Colebrook equation is not uniformly dispersed over the domain of applicability of the Colebrook equation [7,29], where the more complex is not always more accurate which can be noticed by comparing Equation (3) and Equation (4) and related Figures 4 and 5, respectively. According to Eureqa, the software package which generated the approximations, Equation (5) is three times more complex compared with Equation (3), while Equation (4) is 1.5 times more complex compared with Equation (3). Using genetic algorithms [25,29] or the Excel fitting tool [27], there is a possibility to increase the accuracy of the presented approximations by changing their parameters near the current value. If the structure of the approximations remains unchanged, the complexity remains also unchanged while accuracy can increase [4,12,13], or in other words, the structure of the equations remains unchanged while the values of the coefficient change in order to fit better values obtained through the Colebrook equation. In that way error from Figures 4-7 can decrease but also the distribution of the error over the domain of applicability changes [7,29].
To summarize our findings, we made Table 1, in which the maximum relative errors of approximations compared to the accurate λ are expressed as a function of complexity.

Discussion-Comparative Analysis
In general, the relative error of approximations of the Colebrook equation is not uniformly dispersed over the domain of applicability of the Colebrook equation [7,29], where the more complex is not always more accurate which can be noticed by comparing Equation (3) and Equation (4) and related Figures 4 and 5, respectively. According to Eureqa, the software package which generated the approximations, Equation (5) is three times more complex compared with Equation (3), while Equation (4) is 1.5 times more complex compared with Equation (3). Using genetic algorithms [25,29] or the Excel fitting tool [27], there is a possibility to increase the accuracy of the presented approximations by changing their parameters near the current value. If the structure of the approximations remains unchanged, the complexity remains also unchanged while accuracy can increase [4,12,13], or in other words, the structure of the equations remains unchanged while the values of the coefficient change in order to fit better values obtained through the Colebrook equation. In that way error from Figures 4-7 can decrease but also the distribution of the error over the domain of applicability changes [7,29].
To summarize our findings, we made Table 1, in which the maximum relative errors of approximations compared to the accurate λ are expressed as a function of complexity.

Discussion-Comparative Analysis
In general, the relative error of approximations of the Colebrook equation is not uniformly dispersed over the domain of applicability of the Colebrook equation [7,29], where the more complex is not always more accurate which can be noticed by comparing Equation (3) and Equation (4) and related Figures 4 and 5, respectively. According to Eureqa, the software package which generated the approximations, Equation (5) is three times more complex compared with Equation (3), while Equation (4) is 1.5 times more complex compared with Equation (3). Using genetic algorithms [25,29] or the Excel fitting tool [27], there is a possibility to increase the accuracy of the presented approximations by changing their parameters near the current value. If the structure of the approximations remains unchanged, the complexity remains also unchanged while accuracy can increase [4,12,13], or in other words, the structure of the equations remains unchanged while the values of the coefficient change in order to fit better values obtained through the Colebrook equation. In that way error from Figures 4-7 can decrease but also the distribution of the error over the domain of applicability changes [7,29].
To summarize our findings, we made Table 1, in which the maximum relative errors of approximations compared to the accurate λ are expressed as a function of complexity. High It is clear that polynomial approximation was accelerated through the fixed-point iterative procedure; Equation (2a) shows better performances compared with those with normalized input parameters; Equations (3a)-(6a). For example, Equation (2a) with two logarithmic functions (used for acceleration) gives a relative error of no more than λ 2 →0.13% compared with the approximately same error of Equation (6a) with three logarithmic forms (two for normalization and one for acceleration); Equation (6a)-λ 1 →0.17%. Also, the expression for λ 0 in case of Equation (2) is a polynomial while Equations (4)-(6) contain sinus trigonometric function [46]. After this careful analysis it can be concluded that it is better to use computationally expensive logarithmic functions for acceleration through Equation (2a) and not for normalization.

•
Moderately accurate: Our Equation (6) with the relative error up to 2% does contain only two logarithmic expressions used for normalization and no non-integer power, and its accuracy can be compared with approximations by Swamee and Jain [53] (2.18-1.75%), Manadili [54] (2-1.5%), Brkić [42,[55][56][57] (2-1.3%), Haland [58] (1.4-1.1%), etc., all with the same or higher complexity as Equation (6). Equation (2a) after the first step of acceleration with only one logarithmic function and with the relative error of up to 2.6% is even more efficient. • Low accuracy: Our accelerated Equation (3a) is very simple with the relative error up to 5.35% but with only one peak of high error (otherwise up to 3% as can be seen from Figure 4); it is more accurate compared with approximations by Round [59]

Possible Simplifications
As already noted, the main goal is to produce not only accurate, but also computationally low cost [11][12][13]29] explicit approximations of the Colebrook equation. Trigonometric functions are used in Equations (4)- (6) and in their accelerated versions Equations (4)-(6), which is not in common use related to the approximations of the Colebrook equation. These trigonometric functions can also have a higher computational cost. On the other hand, use of the Padé approximation sin(x) ≈ x·(60 − 7x 2 )/(60 + 3x 2 ) can potentially overwhelm the problem [64]. The Padé approximation x·(60 − 7x 2 )/(60 + 3x 2 ) within the domain of interest; −0.08821 < x < 1.18456; compared with sin(x) can introduce the relative error up to 0.068%. Within the same domain, Eureqa gives a number of approximations for sin(x), but we have chosen to present here one accurate but relatively simple; sin(x) ≈ x − x 2 /5350.6747 − x 3 /6.0171 + x 5 /127.4678 with the relative error up to 0.003%. This error of approximations for sin(x) also has impact on the final error of our approximations; Equations (4)- (6) and their accelerated pairs; Equations (4a)-(6a).
Also, the Colebrook equation can be transformed in the mathematically equivalent form Equation (7) that is more suitable for further Padé simplifications. The main idea is to use already computed parameter b= −log 10 (ε/D) and to use the Padé polynomial in the form ln (1 − θ), where θ is given by Equation (8). In Equation (7), both 2·log 10 (3.71) ≈ 1.1387478 and 2/ln(10) ≈ 0.8686 are constant, while b = log 10 (ε/D) is recycled as already evaluated during the normalization of input parameters.
Note that Equation (7) does not work for ε/D = 0, but Equation (1) does work; for practical application in the case of a smooth regime, both relations works; for example, even for very smooth surfaces as for ε/D = 10 −9 . In practice [20], pipes are never that smooth to reach the value ε/D = 0.
The idea to eliminate the remained logarithmic form from the accelerated equations using the Padé approximation [64] for ln(1 − θ) where θ is defined by Equation (8) in this case is not sustainable because of a wide domain of θ that contains a value between −30,394.85651 and −2.92065 × 10 −6 within the practical domain of the Colebrook equation; 4000 < Re < 10 8 and 0 < ε/D < 0.05.
Regarding the normalization of the input parameter a = log 10 (Re), we hoped also to use the fact that the practical domain of the Reynolds number Re, is from 4000 to 10 8 which has as a consequence log 10 (Re) ≈ log 10 (1 + Re), where for ln(1 + y) numerous approximations are available, but mostly for the argument z around 0. This is because some computer algebra systems and programming languages provide a special natural logarithm plus 1 function alternatively named to give more accurate results for values of x close to zero compared to using ln(1 + y) directly. In our case this is not of interest, knowing that for y = Re; y >> 1. Of course, to use only numbers between 1 and 10, we can use the rule ln(Re) = ln(z10 n ) = ln(z) + nln(10) where n = len(int(z)) and z = Re/10 n ; len is a function which calculates number of digits in a number while int is a function which gives a number down to the nearest integer; ln(10) = 2.30258509. A similar method can be used for the normalized parameter b = −log 10 (ε/D) where ε/D is between 0 and 0.05.
A fast but still reliable Padé approximation useful for the Colebrook equation has been recently introduced in [10]. The logarithm term log 10 (z) = ln(z) ln (10) , where argument z~1 is approximated by the rational function: In this case, only one logarithm is computed and is stored in the computer memory: log 10 (y 1 ), whereas log 10 (y i+1 ) is approximated by log 10 (y 1 ) and by Equation (9) as log 10 (y 1+1 ) = log 10 (y 1 ) − log 10 (z) (10) where z = y 1 y i+1 is defined in Equation (2b). When the Padé approximation (9) is applied to Equation (2a) and the starting point is estimated by Equation (2), the maximum relative error of Equation (10) is negligible: 5 × 10 −10 %. Thus, Equation (10) is very accurate when the iterative process defined by Equation (2a) is initiated by a good starting point (A comparison of iterative methods for solving the Colebrook equation is given in [65]). Consequently, the argument z of Equation (9) is very close to one in all cases within the domain of interest of the Colebrook equation [10].

Conclusions
Evaluation of hydraulic resistance, i.e., computation of flow friction factor, λ is one of the main tasks encountered in engineering practice wherever flow of fluid through closed conduits occur. Up to now, the empirical Colebrook Equation (1) which is implicitly given by a flow friction factor, λ is still accepted as an informal standard after 80 years. Obvious disadvantages of an implicit relationship inspired numerous efforts to derive as accurate possible explicit equivalent; λ ≈ f (Re, ε/D) to the original Colebrook equation; λ = f (Re, ε/D, λ). Nowadays, the requirement is not only to develop accurate, but also computationally efficient approximations to the Colebrook equation, which inspired us to use the ability of artificial intelligence to recognize patterns not knowing their true physical laws. Using genetic programming we developed a few accurate approximations to the Colebrook equation avoiding extensive use of computationally expensive logarithmic forms on which Colebrook's relation is based. The most accurate approximations presented here have a relative error of up to 0.13% with only two and three logarithmic forms used which makes it very balanced in the ratio between accuracy and computational efficiency. The polynomial expression Equation (2) accelerated through the two steps of fixed-point iterative procedure; Equation (2a) introduced the relative error up to 0.13% using only two logarithmic functions. Practically the same error is introduced, when only one logarithmic function is computed, whereas the second logarithm is approximated by the Padé approximation. On the other hand, Equation (6a) reaches approximately the same accuracy using three logarithmic functions (two for normalization of input parameters and one for fixed-point acceleration) and using also one sinus trigonometric function. Therefore, a polynomial function can be recommended because it is cheaper for computation and because acceleration through the fixed-point iterative procedure is more efficient compared with the normalization of input parameters.
Our genetic approximations are valid only for a turbulent regime, i.e., for a simulation of the Colebrook equation. A transition from a laminar to a turbulent regime is rapid and it is not covered by the Colebrook formula. Our previous work with artificial neural networks shows that this transition cannot be simulated easily using artificial intelligence techniques. We found that the Colebrook equation can be simulated extremely accurately by a neural network with 50 neurons in the hidden layer. However, with the transition from laminar to turbulent regime included, even such complex network with 50 neurons introduces a very high error in this critical zone even after long lasting and complex strategies of training. Similar findings are also valid for the genetic approach.