Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning

Krechowicz, Maria; Krechowicz, Adam

doi:10.3390/en14020289

Open AccessArticle

Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning

by

Maria Krechowicz

^1,*

and

Adam Krechowicz

²

¹

Faculty of Management and Computer Modelling, Kielce University of Technology, Al. 1000-lecia Państwa Polskiego 7, 25-314 Kielce, Poland

²

Faculty of Electrical Engineering, Automatic Control and Computer Science, Kielce University of Technology, Al. 1000-lecia Państwa Polskiego 7, 25-314 Kielce, Poland

^*

Author to whom correspondence should be addressed.

Energies 2021, 14(2), 289; https://doi.org/10.3390/en14020289

Submission received: 2 December 2020 / Revised: 30 December 2020 / Accepted: 5 January 2021 / Published: 7 January 2021

Download

Browse Figures

Versions Notes

Abstract

:

Nowadays we can observe a growing demand for installations of new gas pipelines in Europe. A large number of them are installed using trenchless Horizontal Directional Drilling (HDD) technology. The aim of this work was to develop and compare new machine learning models dedicated for risk assessment in HDD projects. The data from 133 HDD projects from eight countries of the world were gathered, profiled, and preprocessed. Three machine learning models, logistic regression, random forests, and Artificial Neural Network (ANN), were developed to predict the overall HDD project outcome (failure free installation or installation likely to fail), and the occurrence of identified unwanted events. The best performance in terms of recall and accuracy was achieved for the developed ANN model, which proved to be efficient, fast and robust in predicting risks in HDD projects. Machine learning applications in the proposed models enabled eliminating the involvement of a group of experts in the risk assessment process and therefore significantly lower the costs associated with the risk assessment process. Future research may be oriented towards developing a comprehensive risk management system, which will enable dynamic risk assessment taking into account various combinations of risk mitigation actions.

Keywords:

risk assessment; pipeline installation; Horizontal Directional Drilling; energy infrastructure; machine learning

1. Introduction

Pro–ecological trends in the European Union’s energy policy are reflected in the increasing popularity of sustainable development idea and the related increase in the demand for energy from natural gas resulted in a growth in the demand for the construction of new gas pipelines in Europe. According to the data of Eurostat in 2019, natural gas inland consumption in the European Union increased by 4.2% compared with 2018, reaching a level not seen since 2010 [1]. According to Statistics Poland data the consumption of natural gas in Poland (without taking into account the consumption for technological purposes of the gas sector) in 2017 reached 628.5 PJ [2], and in 2018 it increased to 660.3 PJ [3]. Due to the intensive expansion of the gas pipelines network, the length of the active gas transmission network in Poland increased from 21,139.631 km in 2015 to 21,264.629 km in 2019 [4]. A large part of new gas pipelines is installed using trenchless construction technique called Horizontal Directional Drilling (HDD) technology. Its growing popularity is connected not only with lower installation cost and the possibility of steering around natural or man-made obstacles, but also with lower negative environmental impact. In the case of trenchless technologies, greenhouses gases emissions are lower due to: shorter project durations, less equipment requirements and smaller footprint of excavation area used, compared with open-cut pipeline installation methods [5]. However, this technology as each trenchless or open-cut construction technique is associated with certain risk, which should be assessed. It is important to stress that risk assessment is imperative for each construction project.

1.1. Principles of HDD Technology

Nowadays Horizontal Directional Drilling is one of the most popular trenchless construction techniques used to install underground utilities in congested urban environment, under various obstacles and in environmentally sensitive areas. Energy infrastructure pipes transporting gas, oil, heating pipes, casings for electrical and telecommunication cables as well as water and sewage pipelines are commonly installed using this technology. Compared to traditional open–cut technique, HDD usually offers advantages such as lower installation costs, reducing the negative impact on the environment, reduced land use and faster project timeline [6]. Since its first implementation in 1971, the technology has become more and more complex and the equipment as well as contractor capabilities have improved, enabling successfully drilling in requiring geological conditions, install larger pipelines diameters and lengths. Currently, HDD is a multi–billion dollars annual industry worldwide. The typical HDD installation process consists of three phases: drilling a pilot hole along the desired directional path, enlarging the pilot hole to the desired diameter to accommodate the product pipe, and pulling the product pipe back. Steering around natural or anthropogenic obstacles is possible thanks to changing the direction of the drilling head in the pilot hole drilling phase. More details about the HDD process were provided by Willoughby [7], Bennett and Ariaratnam [8], and Najafi [9].

1.2. Risk in HDD Technology

In the case of HDD technology, significant risk level and uncertainty result from the variability of geotechnical conditions, often limited access to specialized tools and machines being deployed underground, dynamic natural environment, technical problems, human factor and changing economic environment. It must be stressed that HDD project failure could lead not only to significant economic loss, but also increased environmental or social impact of construction, as well as accidents on the building site or fatalities. Contractors, designers and owners engaged in the HDD projects stress the need to carry out risk assessment before starting the investment realization, as thoroughly estimated risk level is an entry point for carrying out project feasibility study and cost estimation. All in all, carefully conducted risk assessment allows avoiding several significant economic and legal HDD failure consequences, such as for example damaging adjacent existing underground utilities or ground infrastructure, damaging costly HDD down–hole equipment or the product pipe. It is important to stress that in the case of complex and innovative construction projects properly carried out risk assessment process in projects preparation stage enhances desired project course [10,11].

It is crucial to pay attention to a proper design of the HDD trajectory [12], as well as its optimization [13], taking into account adjacent existing urban infrastructure and technical feasibility of the design. It is particularly dangerous when the designed HDD trajectory collides with the existing elements of the underground or terrestrial urban infrastructure transporting oil, gas and power cables. The risk of striking a working energy infrastructure is increased especially in congested cities and in cases where the localization of the existing elements of urban infrastructure was carelessly made or not done at all. In December 2019, the existing gas pipeline was damaged during the trenchless drilling in Szczyrk (Poland). The gas evaporated for 20 min and as a result, a fire occurred and a neighbouring tenement house collapsed, killing eight people. That is why quality design of the HDD trajectory in relation to the on–site geological conditions, and adjacent urban infrastructure, as well as project specificity are important. Such a situation could have been foreseen and, as a result, would probably have been avoided if a risk analysis had been carried out at the investment preparation stage. Such an analysis would reveal potential problems connected with the quality of HDD design, problems related to unfavourable geological conditions and adjacent urban infrastructure that may arise during the investment.

1.3. Contribution of the Proposed Approach

The aim of this paper is to contribute to new models of risk prediction for HDD technology and comparing their performance. Two alternative models using random forests and artificial neural networks were developed for risk prediction for the occurrence of the identified unwanted events in HDD technology (“occurred”/“did not occur”) and for the prediction of the overall HDD project effect (failure–free installation or installation likely to fail). Additionally, a model using logistic regression to predict risk was developed for several unwanted events. Machine learning application in the proposed models enabled eliminating expert involvement in the risk assessment process and significantly lower the costs associated with risk assessment process. The main contributions to the body of knowledge of this paper include: collecting essential data from HDD projects, identifying attributes relevant for the risk analysis in HDD projects (dimensions of the data set), developing three machine learning models for predicting risk in HDD projects, which allow removing the drawbacks and limitations of the risk assessment models previously described in the literature, identifying the most important metrics for risk assessment in HDD projects, selection of the model with the best performance from the three proposed machine learning models.

The outline of this paper is as follows. Section 2 describes a review of literature and limitations of the previous work. Section 3 presents the proposed approach including data collection from HDD projects carried out in eight countries, data profiling and preprocessing, as well as the way of development of three machine learning models and metrics which were used for their evaluation. The experimental results and the discussion of results, showing the main outcomes of the proposed three machine learning models and comparative analysis of the results produced using the proposed three models are presented in Section 4 and Section 5. Section 6 summarizes the paper.

2. Review of Literature and Limitations of the Previous Work

The subject of risk identification in HDD projects has been discussed by several authors. The most important risks in HDD technology were described in [7,8,14,15,16]. Various risk assessment models in HDD projects have been developed in recent years by researchers and practitioners in different projects. Some important issues on this topic have been presented in the author’s previous works [17,18,19], in which an expert system with Fuzzy Fault analysis was applied for risk evaluation. In such an approach it was necessary to gather a group of experts, who, after familiarizing themselves with the HDD project documentation and specificity, assessed each of 22 risk factors individually. One of the advantages of those models was the possibility of taking into account the specific and dynamic conditions in which the analysed HDD project is carried out. On the other hand, the involvement of an experienced group of experts was required, which was sometimes problematic due to high costs associated with their participation and deficit of qualified specialists on the market. Moreover, special diligence needs to be drawn to an appropriate selection of the experts, because their years of experience in HDD projects of a particular size, expertise and practical skills are indispensable. Besides this, it is vital to draw separate membership functions for each group of specialists, which depicts the way in which they understand the certain linguistic term describing a possibility of unwanted event occurrence (e.g., medium risk). Ma et al. proposed a risk assessment model dedicated for MAXI HDD projects, in which the fuzzy comprehensive evaluation method and analytical hierarchy process were used [20]. Combination of those two methods gave an improved theoretical basis for risk assessment of MAXI HDD projects. Five risk factors were identified: natural, technical, economic, environmental and management. They were dependent on 17 subfactors. This model also required inviting relevant experts to analyse each index and to evaluate the relative significance of the factors and subfactors. In [21] a model based on the Failure Model and Effect Analysis (FMEA) was developed, which was dedicated for making a preliminary evaluation of the risk in HDD projects, especially those with a modest budget, in which a group of experts could not be involved in a risk assessment process. However, that model is based on the statistical approach and is not sufficiently accurate to assess the risk of larger and more comprehensive HDD projects. That is why it is necessary to develop a new risk assessment model for HDD projects, in which the need to employ a group of HDD experts will be limited. Machine learning offers the possibility of overcoming the inconvenience of the need to involve experts in the risk evaluation process, as well as the inaccuracy of the adaptation of statistical models to dynamic and specific conditions on the construction site.

“Machine learning”, sometimes referred as a branch of artificial intelligence, is a multidisciplinary term which concerns a set of soft computing techniques and algorithms that deal with complex natural systems and improve automatically through experience. Artificial neural networks, fuzzy logic, support vector machines, generic algorithm, and hybrid systems are regarded as the most popular machine learning tools [22]. Machine learning can be applied in real-life construction industry problems to improve quality of design, create a safer jobsite, assess and mitigate risk, increase the project’s lifecycle, as well as to estimate a project’s profitability.

Due to neural networks ability to compensate for the inseparable uncertainties and imperfections, which are present in geotechnical engineering, they can be successfully implemented in the area of geotechnical engineering and building construction projects [23] and trenchless technology. In [24], ANNs have been used to predict surface heave caused by shallow subsurface utility installations carried out using Horizontal Directional Drilling. It is one of important risk factors in HDD technology. ANNs have been also successfully used for prediction of the rate of penetration while drilling carbonate reservoirs [25]. Pollock et al. [26] have used machine learning algorithms to improve the efficiency of directional drilling (rate of penetration optimization, lowering tortuous borehole, lowering the number personnel on board and improving consistency across operations). Bayesian network (BN) and ANN have been successfully used for risk assessment in trenchless construction projects applying tunnelling technology e.g., risk assessment of road tunnels [27], risk analysis of construction of Porto Metro tunnel [28], risk assessment of damage to existing surface properties caused by tunnelling [29], safety risk assessment for metro construction projects [30], as well as evaluation of jamming risk of the shielded tunnel boring machines in adverse ground conditions such as squeezing grounds [31].

In [32] a risk assessment model for Box Jacking Technique of installing rectangular box culverts under existing facilities was proposed. In this approach, the influence of various parameters on surface settlement risk was determined for Box Jacking installations in sandy soil using artificial neural network and multiple linear regression analysis with finite-element modelling. It was found that soil cohesion, box culvert depth, and overcut size were the most important determinants of a surface settlement.

In [33] a new model for predicting the condition of un–inspected sanitary sewer pipes using Gradient Boosting Tree was presented. The prediction model was built based on thirteen independent variables. It achieved 87% accuracy in predicting condition of un–inspected sewer pipes. It enabled forecasting the conditions of sewer pipes, which have not been inspected so far, and therefore eliminate the costs associated with carrying out the Closed–circuit television (CCTV) inspections and overcome problems connected with limited portion of an entire sewer system. This model is helpful especially for utility companies and municipalities in forecasting condition of sanitary sewer pipes, estimating schedule inspection times, and making cost-effective decisions.

Machine learning was also successfully used for identification of the significant factors that impact the prediction of remaining useful life of water pipelines. In [34] Artificial Neural Networks and Adaptive Neuro–Fuzzy Models were applied to predict remaining useful life of water pipelines. The presented approach could be also adjusted to be useful for other types of pipelines, e.g., gas pipelines.

In [35] the conception of using artificial neural networks in the phase of the organizational and technological planning of engineering projects, particularly the building works was presented. Juszczyk and Leśnak [36] have used combined Artificial Neural Networks to develop a model able to predict a construction site cost index.

The presented literature study showed that the use of machine learning allows for effective prediction of various risks in trenchless technologies and construction industry. This paper is in line with the trend of modern risk assessment using machine learning. It seems that machine learning is more commonly applied in order to predict the occurrence of a specific risk, rather than risks of several unwanted events and the overall project outcome (failure free installation or installation likely to fail). Application of machine learning in trenchless technologies and construction industry supports cost effective planning and risk prediction, allowing eliminating several inconveniences related to the need to involve a group of experts in or to conduct a series of inspections. Literature analysis showed that machine learning was used for HDD technology once to predict surface heave caused by shallow subsurface utility installations. However, no model was found in the literature in which machine learning was used for comprehensive risk assessment in HDD projects to predict the overall project outcome and the occurrence of the most important unwanted events.

3. Proposed Approach

Figure 1 outlines the proposed approach for predicting the overall HDD project outcome (failure free installation, installation likely to fail), as well as the occurrence of the identified unwanted events (“occurred”/“did not occur”). The proposed approach includes 6 steps: data gathering, data profiling, data preprocessing, machine learning models development, models evaluation and comparison. In this paper, three popular methods of machine learning were applied. For the simplest cases, where the correlation between features and unwanted events was high (>0.85 and <−0.85), a logistic regression model was used. Then, for all cases, artificial neural network and random forests models were applied. All three models were evaluated with commonly applied metrics such as accuracy, precision, recall,

f_{1}

score, AUC score. A detailed description of the individual steps is provided in the subsections below.

3.1. Data Gathering

The data were obtained during the authors’ participation in HDD projects, as well as visits and discussions with HDD contractors. The database included data from HDD projects which were completed by HDD contractors in various countries of the world (Poland, Mexico, Australia, Thailand, the Netherlands, Bulgaria, Saudi Arabia, and Russia). It allowed gathering professional experiences and feedback from HDD projects carried out in various countries and avoid commitment to one country, its specific geotechnical conditions, finally allowing developing a model suitable for worldwide use. The data from 133 HDD projects (84 MINI, 9 MIDI, and 40 MAXI HDD) is not a huge data set, but due to the specificity of the HDD industry (data from individual projects are not widely available, a single installation in the case of complex and MAXI HDD can last for many months, collecting data on 22 unwanted events and 145 installation’s attributes is very time consuming) more data could not be obtained. However, the gathered data set turned out to be sufficient to develop and verify risk assessment models, what was reflected in the obtained results.

Unwanted events in HDD installations were identified based on the analysis of the surveys, which were conducted in five different countries and are the same as described in the author’s previous work [17]. Table 1 shows the list of unwanted events and their symbols. HDD installation’s attributes were identified based on scenario analysis and the information obtained during the brainstorm sessions and meetings with experienced HDD contactors, owners, as well as manufacturers of drill rigs, drill rods, steering systems, drilling fluids, and product pipes. Moreover, some observations of various HDD installations run were also valuable for the identification of attributes of HDD installations. In addition to the basic attributes that characterize a given HDD installation, such as pipeline diameter, borehole length, maximal depth etc., the attributes also included detailed information about the installation (such as number of the test holes, the depth of the geotechnical tests carried out, parameters related to the designer’s, driller’s, chief superintendent engineer’s and supervisor’s experience, their certification, as well as the most important risk mitigation actions planned to be used (e.g., drilling fluid additives, trial drilling, emergency procedures). Due to their length, Table A1 with 145 attributes of HDD installations have been included in the Appendix A.

3.2. Data Profiling

The factors differentiating the individual projects from which the data were derived are presented in Table 2. The analysed installations differed in terms of geographic area, specificity of area, geometric parameters of the drilling trajectory, installation size, pipe material, the type of installed utility, the type of steering system, ground conditions, season, and were carried out with the use of various machines and devices. Therefore, their geographic, technological, geotechnical and equipment diversity was clearly visible.

Table 3 shows the structure of the analysed data set in terms of the number of occurrences of particular unwanted events (

e_{1}

–

e_{22}

). The analysis of data in Table 3 shows that the events

e_{12}

,

e_{7}

,

e_{10}

,

e_{11}

,

e_{9}

,

e_{21}

were the least frequent in the collected data set. This is consistent with the results of the survey carried out in the author’s previous work [16], according to which low frequency of occurrence was also obtained for these events. In the case of other events, the proportion between the number of installations, in which those events occurred and did not occur, is satisfactory.

Parson’s correlation coefficient was applied to identify correlated dimensions. Correlated dimensions (between unwanted events and features, as well as between features themselves). Careful analysis of the correlated dimensions allowed to find repeating regularities in the analysed HDD projects and identify correlation causes.

Principle Component Analysis (PCA) was used to find data patterns in data of high dimension in order to argue that failure–free installations can be distinguished from failed ones. It is a tool for data analysis that works on the whole data set (it is a matrix of all dimensions and all data samples). It allows visualizing the dominant data patterns and is widely used for data simplification, dimensionality reduction, outlier detection and classification.

3.3. Data Preprocessing

The data has been preprocessed. Due to the difficulties in obtaining data, it was not possible to obtain information on all features for all cases. In the case where the number of shifts was unknown, one shift was adopted as this is the most common in HDD drilling. In the cases of HDD installations where the percentage of clay–sized particles and plasticity index were not tested (e.g., no geotechnical tests), the values of 50% and 50 were assumed for these parameters, as they were the maximum, most pessimistic values in the data set. In the case of HDD projects for which a preliminary risk assessment was not carried out, the value of the parameter “risk level” (in a scale of 1–5) was not known. In such cases the maximum risk level of 5 was assumed, i.e., the maximum, most pessimistic values in the data set.

Due to the fact that the obtained values for certain parameters (pipeline diameter, bore length, maximum depth, percentage of clay size particles, plasticity index, no. of working hours) had a large dispersion, the z–score technique was used to make the model results independent of large absolute values. Z–score is a popular value standardization technique, which is widely used in machine learning applications. It indicates how far the analysed value is from the mean in terms of standard deviation. For example, for

x_{2}

, z-score is defined by the formula:

z (x_{2}) = \frac{x_{2} - \bar{x_{2}}}{\hat{x_{2}}}

(1)

where:

$z (x_{2})$ —value of $x_{2}$ after applying z–score
$\bar{x_{2}}$ —mean of $x_{2}$
$\hat{x_{2}}$ —standard deviation of $x_{2}$

To prepare data for use with machine learning models, the categorical values were converted to the one–hot vector representation as shown in Figure 2. The primary purpose of this retrieval was to encode the categorical values into the appropriate numerical form. Therefore, an individual dimension was introduced for each categorical value. In the case of the

x_{1}

attribute, three dimensions were introduced depending on the installation size

x_{1} = MINI

,

x_{1} = MIDI

and

x_{1} = MAXI

. For example, if the analysed installation was of the MINI size, the dimension

x_{1} = MINI

takes the value 1, and the remaining dimensions take the value 0. Vectorization was performed in the same way for the attributes such as steering system, pipe material, obstacle being crossed, and the type of the area. That is why the overall number of data dimensions is larger than the actual number of installation attributes.

3.4. Logistic Regression Model

Linear regression allows making a very simple prediction in situation when there is a linear correlation between the installation attributes and expected outcome. In this work we applied regression model for events that show some significant correlation with installation dimensions. The results of the linear regression output may take values in the range

(- \infty, + \infty)

, while the probability in the range <0,1>. In order to convert a linear value to a probability value the logistic regression was used. The resulting probability needs to be mapped to the final binary outcome (failure-free or likely to fail) using a threshold value (t). In the case of the Logistic Regression model value of t was experimentally selected as

0.5

.

3.5. Random Forest Model

Decision tree is an interesting classifier that tries to determine the importance of the analysed features and their impact on the classification results. However, when using decision trees, there are several problems, such as sensitiveness to the applied training data form (e.g., a change in the order of data may result in obtaining different results). Moreover, subsequent branches of decision trees are burdened with an increasing degree of uncertainty due to the fact that they are created on the basis of ever smaller data set [37]. Random forests are often applied to solve some such problems. The use of random forests makes it possible to combine the results from multiple trees that were created on a randomly selected subset of the training data. In this work random forests were prepared for each of the 21 unwanted events separately and for the result of the entire HDD installation. The maximum tree depth was experimentally chosen to be 4. At this depth, the best results were obtained. The end result of the use of random trees was to find such a division of the data set on the basis of given features that allows obtaining the most uniform results in these subsets (e.g., the vast majority of the data in one set are installations, in which the analysed event occurred or the vast majority of the data in another set are installations, in which the analysed event did not occur). It allowed detecting dependencies that occur in the training set, even if the connection was accidental. Figure 3 and Figure 4 show the way of making decisions based on an exemplary decision tree that was used.

Figure 3 shows an example tree developed for the event

e_{19}

. It is an example tree because in the case of random forests we can deal with many trees generated for the same event. When analysing the tree structure it can be seen that the parameter

x_{140}

“the exchange rate in the case of carrying out works abroad and paying for materials and equipment in foreign currency” divides the input data set into two subsets. In the case when exchange rate was ordinary (=0), the event

e_{19}

occurred in three installations and did not occur in 75 installations. In the case when the exchange rate was high (=1), the event

e_{19}

occurred in 18 installations. Analysing the left branch it can be seen that in the case when Occupational Health and Safety procedures for ballasting system (

x_{65}

) were not prepared (=0) the event

e_{19}

did not occur in 75 installations and occurred in 2 installations. In the case when they were prepared, the event e₁₉ occurred once, so the event

e_{19}

was classified by the tree as “occurred”. In the case when the drilling rig was not equipped with full automation system (

x_{105} = 0

), the event

e_{19}

occurred twice and did not occur in 20 projects. If the drilling rig was equipped with a full automation system, the event

e_{19}

did not occur in all analysed 55 projects. In the case when the drilling depth (

x_{4}

) exceeded 2.9 m, the event

e_{19}

did not occur in 11 projects and if the depth of 2.9 m was not exceeded, it occurred twice and did not occur in 9 installations, so the event

e_{19}

was classified by the tree as did not occur. Analysing the right branch, it can be seen that in the case when the number of site investigation methods (

x_{71}

) was 0 (e.g., in the rural area) the event

e_{19}

did not occur in the analysed nine installations. If it was (>=1) it occurred 18 times and did not occur once. In the case when the applied materials were not certified (

x_{118}

), the event

e_{19}

occurred in 18 installations, so the event

e_{19}

was classified by the tree as “occurred”. If they were certified it did not occur, so the event

e_{19}

was classified by the tree as “did not occur”. The presented numerical values refer to the training data sample.

Figure 4 shows an example tree developed for the event

e_{22}

. When analysing the tree structure it can be seen that the parameter

x_{140}

“the exchange rate in the case of carrying out works abroad and paying for materials and equipment in foreign currency” divides the input data set into two subsets. In the case where the exchange rate was ordinary (=0), the event

e_{22}

occurred in 10 installations and did not occur in 68 installations. In the case when exchange rates were high (=1) it occurred in 27 installations and did not occur in 1 installation. Analysing the left branch it can be seen that in the case when the drilling rig was equipped with protection system against failures (

x_{40} = 1

), the event

e_{22}

did not occur in 48 installations and occur in 4 installations, so the event

e_{22}

was classified by the tree as “did not occur”. If the rig was not equipped with it (

x_{40} = 0

), the event

e_{22}

did not occur in 20 installations and occurred in six installations. If the cheapest contractor was chosen (

x_{133} = 1

), the event

e_{22}

occurred in all analysed six installations, so the event

e_{22}

was classified by the tree as “occurred”. If the cheapest contractor was not chosen (

x_{133} = 0

), the event

e_{22}

did not occur in all analysed 20 installations, so the event

e_{22}

was classified by the tree as “did not occur”. Analysing the left branch it can be seen that in the case when geotechnical investigations were carried out at least to the maximum depth of the drilling (

x_{73} = 1

), the event

e_{22}

did not occur in one installation and occurred in 27 installations. If geotechnical investigations were not carried out at least to the max depth of the drilling (

x_{73} = 0

), the event

e_{22}

occurred in all analysed 24 installations, so the event

e_{22}

was classified by the tree as “occurred”. When gyro steering system was used (

x_{15} = GYRO

) the event

e_{22}

did not occur in 1 installation, so the event

e_{22}

was classified by the tree as “did not occur”. If gyro system was not applied (

x_{15} \neq GYRO

), the event

e_{22}

occurred in all three analysed installations, so the event

e_{22}

was classified by the tree as “occurred”. The presented numerical values refer to the training data sample.

3.6. ANN Model

The inspiration for the development of neural networks were the biological information processing processes taking place in the brain. The proposed ANN automatically learns classification of HDD projects in terms of the unwanted events’ occurrence. It requires a proper number of training data samples (HDD installations) to generalize patterns occurring in them. ANN consists of several layers. The basic building block of a network is a neuron that performs elementary information processing. The network consists of a series of neurons connected into successive layers and additional auxiliary layers.

Figure 5 presents an artificial neural network architecture that was developed to classify HDD projects. The input layer contains 178 neurons due to the number of dimensions of the HDD installation after applying one–hot vectorization to HDD installations’ attributes. The next dense layer means that each node is connected to each previous layer. Exponential linear unit was used as an activation function for neurons in this layer. To prevent over-fitting, next layer consists of dropout unit, which aims to ignore parts of information from neurons and consequently allow better data generalization and over fitting avoidance. The next layer is dense, consisting of 23 neurons (output layer), which contains neurons responsible for individual events, equipped with a sigmoid activation function. Due to the binary nature of the event (“occurred”/“did not occur”), the output is assigned to the classification threshold function as in case of logistic regression. The structure of the network and its hyper-parameters, such as the number of neurons in the hidden layer, the dropout value, and the classification threshold were selected experimentally. Model training was carried out using the first order gradient-based optimization of stochastic objective algorithm based on adaptative estimates of lower order moments commonly known as “Adam”. The learning was aimed at minimizing the binary cross entropy function.

3.7. Models Evaluation

Models evaluation was conducted in such a way that it validates the quality of the binary classification of HDD installations (“occurred”/“did not occur”) in relation to the selected risk events. Table 4 presents the typical confusion matrix that was a basis for calculations of the metrics.

True Positive (TP) depicts the number of correctly predicted HDD installations for which a certain risk event occurred. False positive (FP) represents the number of HDD installations for which a certain risk event was predicted but it did not occur. False Negative (FN) depicts the number of HDD installations for which a certain risk event occurred although it was not predicted. True Negative (TN) represents the number of HDD installations for which a certain risk event was not predicted and did not occur. In this work recall was defined as the ratio of the number of HDD projects for which a certain event was correctly predicted as “occurred” to the total number of the projects in which a certain event really occurred:

r e c a l l = \frac{T P}{T P + F N}

(2)

It describes the ability of the system to properly classify occurrence of a certain event in HDD projects, but does not consider the number of projects in which a certain event did not occur. Precision was calculated as the ratio of the number of HDD projects for which a certain event was correctly predicted as “occurred” to the total number of the HDD projects in which a certain event was predicted as “occurred”:

p r e c i s i o n = \frac{T P}{T P + F P}

(3)

It describes the ability of the system to correctly predict the occurrence of a certain event, but does not include cases that are classified as “did not occur”. Accuracy was defined as the ratio of the number of HDD installations in which a certain event was correctly predicted by the system, to the total number of HDD projects:

a c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(4)

f_{1}

is defined as a harmonic mean of precision and recall for a certain risk event being analysed.

f_{1} = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(5)

Receiver Operating Characteristic (ROC) curve shows the ratio of true positive to false positive in the full range of possible classification thresholds (t). AUC score is the integral of this curve over all t values. It depicts the correctness of the classification regardless of the adopted threshold.

4. Experimental Results

4.1. Correlations

Parson’s correlation coefficient was applied to identify correlated dimensions. Correlated dimensions (between unwanted events and features, as well as between features themselves) and their Parson’s correlation coefficients were presented in Table 5. Careful analysis of the correlated dimensions allowed to find repeating regularities in the analysed HDD projects and identify correlation causes. In good HDD contracts attention was paid to engagement of both certified supervisor (

x_{103} = 1

) and superintendent engineer (

x_{101} = 1

), as it was suggested in [38]. In the cases when mud motor was applied, contractors were aware of the need to carry out periodical inspections (

x_{43} = 1

) and knew good practices that its elastomeric elements had to be changed after each downhole trip (

x_{44} = 1

). Bore hole collapse often occurred (

e_{15} = 1

) if the drilling crossed any sand layer with homogenous grain size distribution (

x_{81} = 1

), as it was stated in [39]. Improper calculations for the investment often occurred (

e_{22} = 1

) when the cheapest contractor was chosen without paying particular attention to quality of the services offered (

x_{133} = 1

). If the HDD designer did not have suitable knowledge and experience (

x_{7} = 0

), problems related to improper calculations of loads and stresses during the installations (

e_{1} = 1

), as well omitting to consider allowable bending radius of drill or product pipe occurred (

e_{2} = 1

). Problems with supply and quality usually did not happen (

e_{19} = 0

) if a supplier was certified (

x_{116} = 1

) and all materials had adequate corresponding quality certificates (

x_{118} = 1

). Preliminary estimated risk level for the HDD project (

x_{136}

) increased with the decrease of the no. site investigation methods applied (

x_{71}

), no. of working shifts (

x_{106}

), with proximity of backfills which could act as a drainage (

x_{121}

), if works were carried out in spring (

x_{128}

) or winter (

x_{131}

) which posed many risks (floods, low temperatures, strong winds), if no additives to the drilling fluids reducing collapse risks were used (

x_{144}

) and if other steering systems than gyroscope were used (

x_{15} \neq GYRO

). Some dimensions were randomly correlated.

Due to the small amount of data, the correlated dimensions were not eliminated. There was no need to optimize the calculation speed due to the small group of data.

4.2. PCA

Figure 6 presents the results of the applied PCA method on the whole data set. It illustrates three dominant components of PCA for the whole data set. The analysis showed that failure–free installations were clustered close to each other, and it predisposes to find differences in the data set. That allowed to discriminate those two subsets using machine learning models. As a result of the analysis, four significant outliers were identified. The main reason for these installations being different from the others was a very large borehole diameter or borehole length. It should be noted that long and large diameter HDD installations are usually more complex and technologically complicated than MINI and MDI HDD installations.

4.3. Logistic Regression

Due to the fact that after calculating the correlation coefficients, it turned out that some events are closely correlated with some features, it was possible to estimate the occurrence of some unwanted events using logistic regression. Table 6 shows the evaluation of a logistic regression model using recall, precision, accuracy, f1 and AUC score.

The proposed approach using logistic regression shows quite good recall, precision, accuracy,

f_{1}

and AUC score values in a test dataset. For the events

e_{9}

,

e_{10}

,

e_{11}

,

e_{19}

a full compliance with the test set was achieved. The evaluation results for

e_{15}

and

e_{5}

are poor. For the rest it is satisfactory. The poor recall values for

e_{15}

and

e_{5}

indicate that the system properly classifies those events when they did not occur and poorly when they occurred. It indicates that for these events one should look for other, more effective methods of classification.

4.4. Random Forests Model

Table 7 shows the evaluation of a random forests model using recall, precision, accuracy,

f_{1}

and AUC score.

The proposed approach using random forests shows quite good recall, precision, accuracy, F1 and AUC score values. For the events

e_{7}

,

e_{9}

,

e_{10}

,

e_{11}

,

e_{16}

,

e_{19}

,

e_{22}

and the final outcome of the HDD project (depicted as OK) full compliance with the test set was achieved. The event

e_{12}

occurred only twice in the input file, which made it impossible to train the forest correctly. Additionally, the test set did not contain any case in which this event occurred. It should be added that the frequency of this event occurrence assessed thanks to the survey described in the author’s previous work [16] was only 2% in the analysed 5940 HDD projects. The worst results were obtained for the event

e_{21}

. This is because this event is related to weather conditions, which is a specific parameter, so the random forest was unable to learn the correct prediction. It should be added that the frequency of this event occurrence assessed thanks to the survey described in the author’s previous work [16] was 7% (for severe weather conditions) and 2% (for flood) in the analysed 5940 HDD projects. For the event

e_{1}

,

e_{2}

,

e_{4}

,

e_{3}

,

e_{6}

recall ranging from 0.600 to 0.667 was obtained, which leaves room for improvement. For the remaining events, satisfactory compliance with the test set was obtained at the level of 0.750–1.000. The presented model evaluation results shows that the proposed method is satisfactory, but could be significantly improved using more advanced classification methods.

4.5. ANN Model

Figure 7 shows the learning history of the proposed ANN. The chart shows that the loss decreases with the increase in the number of epochs, while the applied metrics (precision, recall, accuracy, and AUC score) improve systematically. Ultimately, the number of eras was selected to be 100 to prevent overfitting the network.

Table 8 presents performance results of the proposed ANN model in a test dataset.

The proposed approach using ANN shows very good recall, precision, accuracy,

f_{1}

and AUC score values. For the events

e_{1}

,

e_{2}

,

e_{7}

,

e_{9}

,

e_{10}

,

e_{11}

and the final outcome of the HDD project (depicted as OK) full compliance with the test set was achieved. The event

e_{12}

occurred only twice in the input file, which made it impossible to train the network correctly, similarly to the case of random forests. Additionally, the test set did not contain any case in which this event occurred. The worst results were produced for the event

e_{21}

. This is because this event is related to weather conditions, which is a specific parameter, so the network was unable to learn the correct prediction. For the event

e_{3}

, recall of 0.667 was obtained, and for

e_{16}

, precision of 0.667 was obtained. For the event

e_{4}

,

e_{14}

and

e_{20}

recall around 0.8 was achieved, but in those cases high accuracy was obtained (≥0.926). This means that in these cases the system is worse at detecting that a given event occurred than that it did not. For the remaining events, very high compliance with the test set was obtained at the level from 0.857 to 1.000. The presented model evaluation results show that the proposed method is effective in predicting the overall HDD project outcome, as well all 21 identified sub–risks occurrence (“occurred”/“did not occur”).

In Figure 8, Figure 9, Figure 10 and Figure 11, ROC curves for the chosen unwanted events are presented. For the final outcome of the project (depicted as OK), it can be seen that a very good result was achieved, because regardless of the selected classification threshold, the proposed ANN model correctly classifies HDD installations. Figure 9, Figure 10 and Figure 11 show the most interesting ROC curves, where the AUC score is less than 1, which means that the classification depends to some extent on the selected classification threshold. Despite the fact that the curves presented in Figure 9, Figure 10 and Figure 11 indicate the poorest results among all obtained, they are satisfactory, because obtained AUC scores for them are more than 0.9. The worst results were obtained for event

e_{21}

, which was related to unfavourable weather conditions. The fault that occurs in this plot (in Figure 11) shows that in certain situations the proposed model may inaccurately classify the event

e_{21}

as “occurred”.

5. Discussion of the Results

In the case of risk assessment, it is the most important to minimize the cases where a particular event was not predicted as “occurred” but actually occurred. Such situations belong to the “false negative” group in Table 4. However, if a risk assessment system predicts that a given event will occur, and in fact it will not occur, its consequences, such as introducing risk mitigation strategies, are less serious than in the case of not predicting this event. Not predicting the occurrence of a given event (which is in fact likely to occur) may lead to not introducing any risk mitigation strategies, finally resulting in the actual occurrence of this event in the HDD project. Therefore, from the point of view of risk assessment in HDD projects, the most important indicators are those that take into account the “false negative” group, thus recall and accuracy should be analysed first.

To further discuss the achieved results and to compare the effectiveness of the proposed models predicting unwanted events’ occurrence in HDD projects, the authors carried out a comparative analysis regarding the predicting performance between the proposed models. Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 clearly present the comparison of the results of three applied machine learning models in terms of recall, precision, accuracy,

f_{1}

and AUC score. Analysing the Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 it can be concluded that the best prediction method for HDD projects is the proposed ANN model. Random forests are second, and logistic regression, thanks to which only eight events could be predicted, is third. The proposed ANN model outperforms the rest of the models in terms of recall for seven events’ prediction, precision for one event prediction, accuracy for six events’ prediction,

f_{1}

for seven events’ prediction and AUC score for 14 events prediction. The proposed ANN model turned out to produce the best results (or no better result was obtained by other methods) in terms of recall for the prediction of 21 events (

e_{1}

,

e_{2}

,

e_{3}

,

e_{4}

,

e_{5}

,

e_{6}

,

e_{7}

,

e_{8}

,

e_{9}

,

e_{10}

,

e_{11}

,

e_{13}

,

e_{15}

,

e_{16}

,

e_{17}

,

e_{18}

,

e_{19}

,

e_{20}

,

e_{21}

,

e_{22}

and a final outcome of the HDD project depicted as OK). The developed ANN model outperformed the rest of the models (or no better result was obtained with other methods) in terms of accuracy for the prediction of 19 events (

e_{1}

,

e_{2}

,

e_{3}

,

e_{4}

,

e_{5}

,

e_{6}

,

e_{7}

,

e_{8}

,

e_{9}

,

e_{10}

,

e_{11}

,

e_{13}

,

e_{14}

,

e_{15}

,

e_{17}

,

e_{18}

,

e_{20}

,

e_{21}

, and a final outcome of the HDD project depicted as OK). It allowed predicting all analysed events with accuracy greater than or equal to 0.926. It also allowed predicting 72.72% of the assessed events with recall greater than or equal to 0.875.

Analysing the performance of three proposed machine learning models for predictions of particular unwanted events’ occurrence, the following conclusions can be drawn. For the events

e_{1}

,

e_{2}

,

e_{4}

and

e_{17}

, the best results for all metrics were obtained for the proposed ANN model (only in terms of precision random forests gave equally satisfactory results). For the final result of the project depicted as OK the best results for all metrics were obtained for both the proposed random forest model and the ANN. For the events

e_{9}

,

e_{10}

,

e_{11}

, the same results were obtained for all the proposed three models in terms of all metrics. For the event

e_{8}

the best results were obtained for the proposed random forest model and ANN in terms of recall, and in terms of precision all the proposed three models gave the same results. For the event

e_{19}

the best results were obtained for all three methods in terms of recall and precision. For the events

e_{5}

and

e_{6}

the best results in terms of recall were obtained for the proposed ANN model, and in terms of precision, for the proposed random forest model. This means that for these events the proposed model of random forests is better at predicting whether the positive identification of an event was actually true. For the event

e_{22}

, the same results were obtained in terms of recall for all three proposed models, while in terms of precision, the logistic regression model was the best, and in terms of accuracy, the random forest model was the best. For the remaining events, the best results were obtained for the ANN model and random forests, with different priorities for individual metrics.

In the previous and only work found in the literature on the application of machine learning in HDD technology [24], a surface heave prediction model was proposed. It can be considered as a risk assessment model for one unwanted event in HDD technology. The models proposed in this paper are novel, as they enable efficient, fast and robust risk predictions for 21 most important unwanted events in HDD technology and the overall project outcome.

6. Conclusions

This study proposes three new models for predicting risks in HDD projects. To develop those models, the data from 133 HDD projects from eight countries of the world was gathered, profiled and preprocessed. Three models based on the following methods of machine learning: logistic regression, random forests and Artificial Neural Network were developed and their performance was assessed. The developed ANN model demonstrates significant performance in the field of the HDD project outcome prediction, as well as the occurrence of 21 identified unwanted events despite relatively small dataset for learning. It outweighs random forests. The proposed logistic regression model could be applied to properly predict only 8 events.

The results show that the proposed ANN model proved to be the most efficient, fastest, and most robust in predicting risks in HDD projects. Moreover, the running time of the proposed ANN model architecture is much less than carrying out traditional risk assessment, in which a group of HDD experts must be involved. The proposed approach is an accurate prediction model, as it makes efficient predictions of unwanted events’ occurrence, showing minor deviations between the real and the predicted values. Since the results of risk assessment are not only critical when assessing the project feasibility and making the project costing, but also a starting point for introducing the risk management strategy, this model becomes very useful to accurately determine risk levels for important unwanted events. It is expected that the practical application of the proposed model will lead to the quality improvement of the installed energy transmission infrastructure and reduction of the number of unsuccessful installations of gas, oil and electricity pipelines. Moreover, it can contribute to avoiding of strikes on the existing elements of underground infrastructure of cities, which may lead to fatal accidents (e.g., hitting a gas pipeline or oil pipeline).

This paper is the first one to propose machine learning models dedicated for assessing the overall risk of HDD project, as well as the occurrence of 21 unwanted events in this technology. The main contributions to the body of knowledge include: collecting essential data from 133 HDD projects from eight countries, identifying 145 attributes relevant for the risk analysis in HDD projects (dimensions of the data set), developing three machine learning models for predicting risk in HDD projects, which allow removing the drawbacks and limitations of the risk assessment models previously described in the literature, identifying recall and accuracy as the most important metrics for risk assessment in HDD projects, selection ANN model as the one with the best performance from the three proposed models.

Additionally, thanks to applying machine learning in the proposed models, the need to engage a group of HDD experts was eliminated. It also contributes to the reduction of the imperfections, with which the traditional risk assessment expert systems have struggled, such as: being based on the opinion of individual experts and their knowledge (not always properly matched to the project size and specificity), lack of required experts’ experience and knowledge, limited project budget not allowing employment of well qualified experts, difficulties in engaging quality industry specialists. All in all, in this work widely available models were developed, for which the costs and problems connected with involving experts are not a barrier. This work is helpful in making the right decision about starting the HDD project. Moreover, it supports creating realistic projects delivery and performance options by HDD owners, engineers and contractors.

Future research is oriented towards integrating the proposed approach into a comprehensive holistic risk management system. Such a system will additionally include the risk response options for individual unwanted events. It will enable a dynamic risk assessment taking into account various combinations of planned risk responses. A similar approach is also planned to be used in the future for risk assessment in various trenchless construction methods and choosing that one with the lowest risk.

Author Contributions

Conceptualization, M.K.; methodology, M.K.; software, A.K.; data gathering: M.K.; data profiling M.K.; data preprocessing M.K.; models’ development: M.K. and A.K.; models’ evaluation: M.K. and A.K.; writing—original draft preparation: M.K. and A.K.; supervision: M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

HDD	Horizontal Directional Drilling
ANN	Artificial Neural Network
BN	Bayesian network
FMEA	Failure Mode and Effects Analysis
CCTV	Closed-circuit television
AUC	Area Under the Curve
$e_{i}$	i-th unwanted event
$x_{i}$	i-th attribute of HDD installation
OK	the overall project result
$f_{1}$	harmonic mean of precision and recall for a certain risk event
PCA	Principle Component Analysis
TP	True Positive
FP	False Positive
FN	False Negative
TN	True Negative
ROC	Receiver Operating Characteristics
GYRO	optical gyro steering system
t	threshold
$z (x_{i})$	value of $x_{i}$ after applying z–score
$\bar{x_{i}}$	mean
$\hat{x_{i}}$	standard deviation

Appendix A

Table A1. Dimensions of the data set.

Symbol	Features
$x_{1}$	Installation size (MINI, MIDI, MAXI)
$x_{2}$	Pipeline diameter (mm)
$x_{3}$	Bore length (m)
$x_{4}$	Maximum depth (m)
$x_{5}$	Pipeline material
$x_{6}$	Crossed obstacle
$x_{7}$	Does the designer have min 3 years of experience in HDD projects in a certain size (MINI, MIDI, MAXI)?
$x_{8}$	Does the designer have positive references from similar projects (project size, ground condition, natural environment specificity, season)
$x_{9}$	Was the correctness of the calculations of the designer checked using the appropriate computer program (e.g., Horizon, HDD Designer, D–Geo Pipeline)
$x_{10}$	Urban area
$x_{11}$	Posti–ndustrial area
$x_{12}$	Was an assessment of the expected geological conditions used to determine the most appropriate coating for the pipe?
$x_{13}$	Were erosion and corrosion protection coatings designed for steel pipes?
$x_{14}$	In a case of HDPE pipes: was crack and gouge allowance considered?
$x_{15}$	Steering system type (gyro, wireline, walkover)
$x_{16}$	Was the identification of interferences carried out?
$x_{17}$	If yes, if it revealed any interferences?
$x_{18}$	If yes, were interferences temporarily disabled?
$x_{19}$	The wireline length (m)
$x_{20}$	Are spiders planned to be applied in case of the long wireline systems?
$x_{21}$	If yes, is distance between spiders max. 150 m?
$x_{22}$	Is quality wireline coating designed e.g., XHHW (Cross–Linked High Heat Water Resistant Insulated Wire)?
$x_{23}$	If drilling in rock formations—are there any tools chosen to damp transmitters vibrations?
$x_{24}$	Is there any drilling planned in abrasive soils, rocks or cobbles, in which there is large amount of heat transfer from the drill head to the transmitter housing?
$x_{25}$	Is there drilling planned in gravel and the grounds containing boulders, where steering problems or unresponsive steering may occur?
$x_{26}$	The drilling depth in the case of applying the walkover systems (m)
$x_{27}$	Does the manufacturer of steering system have quality certificate ISO 9001 or adequate given by a third party?
$x_{28}$	In the case of applying the walkover system was the time of the drilling and the battery capacity considered in plans?
$x_{29}$	In the case of applying the walkover system is it a problematic crossing of big rivers, rivers with a strong current, highway or railway crossings where is usually a problematic need that the receiver should be positioned directly over the transmitter?
$x_{30}$	Were stress limits defined based on the type and material of the drill pipe, establishing tension and torque limits, defining drilling radii and deviations?
$x_{31}$	Were geotechnical investigation carried out at the planning stage taken into consideration in determining the type of equipment needed
$x_{32}$	Presence of salty water or acidity soil
$x_{33}$	Is excessive wear anticipated after analysis of the geotechnical conditions (difference with the strength of soil layers or rock, resulting in applying the force only to the part of the reamer)?
$x_{34}$	Were the drilling tools repaired previously?
$x_{35}$	Does the manufacturer of drill tools have quality certificate ISO 9001 or API?
$x_{36}$	Are OHS procedures for drill tool failure prepared?
$x_{37}$	Are the periodical drill rig inspections carried out according to schedule?
$x_{38}$	Was the drill rig previously repaired?
$x_{39}$	If yes, were original parts used for reparation?
$x_{40}$	Has a drill rig protection system against failures (e.g., the automatic supervision during standard operation)?
$x_{41}$	Does the manufacturer of drill rig has quality certificate ISO 9001 or adequate given by a third party and is it in conformity with National Machine Guidelines derived from European Machine Guidelines?
$x_{42}$	Are OHS procedures for the case of the drill rig breakdown prepared?
$x_{43}$	Are the mud motor periodical inspections carried out according to schedule?
$x_{44}$	Will be mud motor components that have elastomeric elements new?
$x_{45}$	Is high solids or sand content in the drilling fluid expected?
$x_{46}$	Was the mud motor previously repaired?
$x_{47}$	If yes, were original spare parts used for mud motor reparation?
$x_{48}$	Does the manufacturer of mud motor have quality certificate ISO 9001 or adequate given by a third party and is mud motor in conformity with National Machine Guidelines derived from European Machine Guidelines?
$x_{49}$	Are OHS procedures for the case of the mud cleaning system breakdown prepared?
$x_{50}$	Was the mud cleaning system previously repaired?
$x_{51}$	If yes, were original parts used for mud cleaning system reparation?
$x_{52}$	Does the manufacturer of mud cleaning system have quality certificate ISO 9001 or adequate given by a third party?
$x_{53}$	Are OHS procedures for the case of the mud cleaning system breakdown prepared?
$x_{54}$	Are the periodical inspections of roller blocks carried out according to schedule?
$x_{55}$	Was any of the planned to use roller block repaired and were original spare parts used? (no, original, not original)
$x_{56}$	Are OHS procedures for the case of the roller blocks breakdown prepared?
$x_{57}$	Does the manufacturer of roller blocks have quality certificate ISO 9001 or adequate given by a third party?
$x_{58}$	Are the periodical inspections of roller cradles carried out according to schedule?
$x_{59}$	Does the manufacturer of roller cradles has quality certificate ISO 9001 or adequate given by a third party?
$x_{60}$	Are OHS procedures for the case of the roller cradles breakdown prepared?
$x_{61}$	Are the periodical inspections of side cranes carried out according to schedule?
$x_{62}$	Does the manufacturer of side cranes have quality certificate ISO 9001 or adequate given by a third party?
$x_{63}$	Are OHS procedures for the case of the side cranes breakdown prepared?
$x_{64}$	Are the periodical inspections of the ballasting system carried out according to schedule?
$x_{65}$	Are OHS procedures for the case of the ballasting system breakdown prepared?
$x_{66}$	Does the manufacturer of ballasting system have quality certificate ISO 9001 or adequate given by a third party?
$x_{67}$	The size of the previously realized installation (1–3 pts., 1–MINI, 2–MIDI, 3–MAXI)
$x_{68}$	The complexity and challenges connected with previously realized installation (0–typical installation, 1–challenging length, diameter for the contractor or challenging grounds)
$x_{69}$	Were there any delays indicated in the references of the contractor from similar projects that have been carried out so far?
$x_{70}$	Does the geotechnical surveying company have references from similar projects (project size, ground condition, natural environment specificity)?
$x_{71}$	In the case of urban or post–industrial areas: the number of site investigation methods used
$x_{72}$	No. of test holes
$x_{73}$	Are geotechnical investigations carried out at least to the max depth of the drilling?
$x_{74}$	Are geotechnical tests only archive or prepared for another project purposes?
$x_{75}$	Were literature research, historical data, interviews with residents carried out?
$x_{76}$	Is an experienced geotechnician (with certificates and references from similar projects) employed to properly interpret of the results of geotechnical survey?
$x_{77}$	Is a trial drilling planned (form MAXI and complex HDD)?
$x_{78}$	For urban and post–industrial areas in which underground infrastructure was identified: is exposing and monitoring the existing underground infrastructure located close to the planned alignment?
$x_{79}$	Are emergency procedures for an utility strike prepared?
$x_{80}$	For urban or post–industrial areas— Are plans with underground utilities localization available and was the inspector asked if all changes in urban infrastructures were put on the map?
$x_{81}$	Is there any sand layer with homogeneous grain size distribution that will be crossed?
$x_{82}$	Is there any layer that consists of pure sands, gravel or loose rock?
$x_{83}$	Does the ground contain oversize materials (cobbles and boulders), heavy, large grains that gravitationally fall to the bore hole bottom?
$x_{84}$	If yes, not many—0, many 1, much 2
$x_{85}$	Percentage of clay sized particles (smaller than 0.075 mm) (for clay and silt layers) for the layer of max plasticity index
$x_{86}$	Soil plasticity index of a soil sample (for clays and silts)
$x_{87}$	Are there considerable elevation differences between the entry and exit points or points along the alignment
$x_{88}$	Is the any area situated along the alignment with the depth cover less than 12 m or 8.5 borehole diameter or 2.5 borehole diameter under rivers?
$x_{89}$	Is there area with significant changes in density or composition of ground conditions that will be drilled through?
$x_{90}$	Is there any layer of drilling in the clear, coarse–grained, permeable soils (e.g., in sands, gravels containing less than 12% of fine or in fractious rocks)?
$x_{91}$	Is there any area where the HDD alignment is close to existing utilities located in backfills, which were filled with trench backfill materials, which could act as a drainage for the drilling fluid?
$x_{92}$	Is strong groundwater inflow indicated in geotechnical survey?
$x_{93}$	Were drilling fluid pressure calculations carried out?
$x_{94}$	If yes, do they indicate drilling fluid seepage?
$x_{95}$	In the case when the 1st section is problematic—is a casing pipe designed to protect the first hole section?
$x_{96}$	Does the driller have min 3 years of professional experience with drill rigs of a designed pulling force?
$x_{97}$	Is the driller certified by a third party for the designed drill rig force (e.g., Drilling Contractors Association, International Society for Drilling Contractors)?
$x_{98}$	Does the contractor’s company have references from similar projects (size, ground conditions, specificity)?
$x_{99}$	No. of working hours
$x_{100}$	Does the chief superintendent engineer have min 3 years of professional experience?
$x_{101}$	Is the chief superintendant engineer certified by a third party for drilling operations with the designed pulling force
$x_{102}$	Does the supervisor have min 3 years of professional experience in drilling operations with the designed pulling force?
$x_{103}$	Does the supervisor have a certificate given by a third party for drilling operations with the designed pulling force?
$x_{104}$	Is augmented reality planned to be used to increase the drill rig operator’s awareness of underground utilities?
$x_{105}$	Are drilling rigs with full automation of the process planned to be used?
$x_{106}$	No. of shifts
$x_{107}$	Is the pressure test planned before product pipe installation?
$x_{108}$	Is a strain gauge or a load cell planned to be used to measure the stress that the pipe is subjected to during the pullback together with establishing limits?
$x_{109}$	If there is a mud service: is it certified and has it references from similar projects?
$x_{110}$	Are the Occupational Health and Safety certificates of workers valid
$x_{111}$	Are OHS procedures planned for the case of accidents caused by improper employee training program, improper operation and maintenance of machines, or improper supervision?
$x_{112}$	Will water for drilling fluid preparation be tested?
$x_{113}$	Will the mixture of drilling fluid and ground be tested?
$x_{114}$	Is pressure module planned to be used?
$x_{115}$	Does the supplier have references from similar projects?
$x_{116}$	Has the supplier ISO 0991 or 14001 Quality Management introduced?
$x_{117}$	Does the contractor have positive previous experience with cooperation with suppliers?
$x_{118}$	Do all of the materials have quality certificates (bentonite, additives to drilling fluid, pipe)?
$x_{119}$	Do all machines and equipment have a corresponding conformity declaration and the associated CE–symbol and safety certificates?
$x_{120}$	Do all materials have valid expiration dates?
$x_{121}$	Is the building site situated close to one of the following areas: environmentally sensitive areas such as wetlands, river banks, intermittent drainage, channels, endangered plants, a wildlife habitat, a sensitive habitat or a housing estate that are connected with the special requirements of noise or contaminated area?
$x_{122}$	In case of urban or post–industrial areas: Was a photographic documentation of existing elements of terrestrial infrastructure made?
$x_{123}$	If urban or close to housing estate areas or environmentally sensitive areas: Are the machines planned to used equipped with a noise reduction system?
$x_{124}$	Were all the required permits gained?
$x_{125}$	Is there any underground and terrestrial infrastructure or natural habitat that could be damaged and lead to legal claims?
$x_{126}$	Are low temperatures or strong winds, heavy rainfalls or snowfalls expected?
$x_{127}$	Is the project realization planned in close proximity to rivers, the risk is increased due to possible flooding or ice melting in this season?
$x_{128}$	Season–spring
$x_{129}$	Season–summer
$x_{130}$	Season–autumn
$x_{131}$	Season–winter
$x_{132}$	Are OHS procedures for the case of severe weather conditions including evacuation plans prepared?
$x_{133}$	Was the cheapest contractor chosen?
$x_{134}$	Was the quality of works and engineering creativity taken into account when choosing contractors?
$x_{135}$	Was a preliminary risk assessment carried out for this HDD project?
$x_{136}$	If yes, what is the risk level for this project? (1–5 pts.)
$x_{137}$	Was a risk pool included in the project budget?
$x_{138}$	Is the project carried out abroad, or/and are equipment, materials, or salary of HDD crew paid in foreign currency, if yes:
$x_{139}$	the inflation rate (in the country abroad) (0–N/A, ordinary, 1–high)
$x_{140}$	the exchange rate in the case of carrying out works abroad and paying for materials and equipment in foreign currency (0 pt.– ordinary, 1–high)
$x_{141}$	has the contractor taken a loan?
$x_{142}$	the interest rate in the case of a loan, (0–N/A, ordinary, 1–high)
$x_{143}$	the inflation rate for contracts carried out in the origin country (0–N/A, ordinary, 1–high)
$x_{144}$	Are there any drilling fluid additives preventing bore hole collapse planned to be used?
$x_{145}$	Are there any drilling fluid additives preventing ground swelling planned to be used?

References

Eurostat. Natural Gas Supply Statistics, Gross Inland Consumption Natural Gas EU. 1990–2019. Available online: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Natural_gas_supply_statistics&oldid=500422 (accessed on 27 November 2020).
Statistics Poland (Główny Urząd Statystyczny). Consumption of Fuels and Energy Carriers in 2017. Statistical Information. 2018. Available online: https://stat.gov.pl/obszary-tematyczne/srodowisko-energia/energia/zuzycie-paliw-i-nosnikow-energii-w-2017-roku,6,12.html (accessed on 27 November 2020).
Statistics Poland (Główny Urząd Statystyczny). Consumption of Fuels and Energy Carriers in 2018. Statistical Information. 2019. Available online: https://stat.gov.pl/obszary-tematyczne/srodowisko-energia/energia/zuzycie-paliw-i-nosnikow-energii-w-2018-roku,6,13.html (accessed on 27 November 2020).
Statistics Poland (Główny Urząd Statystyczny). The Length of the Active Transmission Network in Poland. Gas Network, Długość Czynnej Sieci przesyłOwej w Polsce. Sieć Gazowa. Available online: https://bdl.stat.gov.pl/BDL/metadane/cechy/1603 (accessed on 27 November 2020).
Kaushal, V.; Najafi, M.; Serajiantehrani, R. Environmental Impacts of Conventional Open-Cut Pipeline Installation and Trenchless Technology Methods: State-of-the-Art Review. J. Pipeline Syst. Eng. Pract. 2020, 11, 03120001. [Google Scholar] [CrossRef]
Allouche, E.N.; Ariaratnam, S.T.; Lueke, J.S. Horizontal directional drilling: Profile of an emerging industry. J. Constr. Eng. Manag. 2000, 126, 68–76. [Google Scholar] [CrossRef]
Willoughby, D. Horizontal Directional Drilling (HDD): Utility and Pipeline Applications: Utility and Pipeline Applications; McGraw Hill Professional: New York, NY, USA, 2005. [Google Scholar]
Bennett, D.; Ariatnam, S. Horizontal Directional Drilling (HDD) Good Practices Guidelines-2017; North American Society for Trenchless Technology: Cleveland, OH, USA, 2017. [Google Scholar]
Najafi, M. Trenchless Technology: Planning, Equipment, and Methods; McGraw Hill Professional: New York, NY, USA, 2013. [Google Scholar]
Krechowicz, M. Effective Risk Management in Innovative Projects: A Case Study of the Construction of Energy-efficient, Sustainable Building of the Laboratory of Intelligent Building in Cracow. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2017; Volume 245, p. 62006. [Google Scholar]
Krechowicz, M. Risk management in complex construction projects that apply renewable energy sources: A case study of the realization phase of the Energis educational and research intelligent building. In IOP Conference Series: Material Science and Engineering; IOP Publishing: Bristol, UK, 2017; Volume 245, pp. 1–10. [Google Scholar]
Wiśniowski, R.; Skrzypaszek, K.; Łopata, P.; Orłowicz, G. The Catenary Method as an Alternative to the Horizontal Directional Drilling Trajectory Design in 2D Space. Energies 2020, 13, 1112. [Google Scholar] [CrossRef] [Green Version]
Wiśniowski, R.; Łopata, P.; Orłowicz, G. Numerical Methods for Optimization of the Horizontal Directional Drilling (HDD) Well Path Trajectory. Energies 2020, 13, 3806. [Google Scholar] [CrossRef]
Tabesh, A.; Najafi, M.; Kohankar, Z.; Mohammadi, M.M.; Ashoori, T. Risk Identification for Pipeline Installation by Horizontal Directional Drilling (HDD). In Pipelines 2019: Multidisciplinary Topics, Utility Engineering, and Surveying; American Society of Civil Engineers: Reston, VA, USA, 2019; pp. 141–150. [Google Scholar]
Woodroffe, N.J.; Ariaratnam, S.T. Cost and risk evaluation for horizontal directional drilling versus open cut in an urban environment. Pract. Period. Struct. Des. Constr. 2008, 13, 85–92. [Google Scholar] [CrossRef]
Gierczak, M. The qualitative risk assessment of MINI, MIDI and MAXI horizontal directional drilling projects. Tunn. Undergr. Space Technol. 2014, 44, 148–156. [Google Scholar] [CrossRef]
Gierczak, M. The quantitative risk assessment of MINI, MIDI and MAXI horizontal directional drilling projects applying fuzzy fault tree analysis. Tunn. Undergr. Space Technol. 2014, 43, 67–77. [Google Scholar] [CrossRef]
Krechowicz, M. Comprehensive Risk Management in Horizontal Directional Drilling Projects. J. Constr. Eng. Manag. 2020, 146, 04020034. [Google Scholar] [CrossRef]
Krechowicz, M. The hybrid Fuzzy Fault and Event Tree analysis in the geotechnical risk management in HDD projects. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2020. [Google Scholar] [CrossRef]
Ma, B.; Najafi, M.; Shen, H.; Wu, L. Risk evaluation for maxi horizontal directional drilling crossing projects. J. Pipeline Syst. Eng. Pract. 2010, 1, 91–97. [Google Scholar] [CrossRef]
Krechowicz, M.; Gierulski, W.; Loneragan, S.; Kruse, H. Human and equipment risk factors evaluation in Horizontal Directional Drilling technology using Failure Mode and Effects Analysis. Manag. Prod. Eng. Rev. 2000, in press. [Google Scholar]
Deka, P.C. A Primer on Machine Learning Applications in Civil Engineering; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
Kim, C.; Bae, G.; Hong, S.; Park, C.; Moon, H.; Shin, H. Neural network based prediction of ground surface settlements due to tunnelling. Comput. Geotech. 2001, 28, 517–547. [Google Scholar] [CrossRef]
Lueke, J.S.; Ariaratnam, S.T. Numerical characterization of surface heave associated with horizontal directional drilling. Tunn. Undergr. Space Technol. 2006, 21, 106–117. [Google Scholar] [CrossRef]
Al-AbdulJabbar, A.; Elkatatny, S.; Abdulhamid Mahmoud, A.; Moussa, T.; Al-Shehri, D.; Abughaban, M.; Al-Yami, A. Prediction of the Rate Penetration while Drilling Horizontal Carbonate Reservoirs Using a Self-Adaptive Artificial Neural Network Technique. Sustainability 2020, 12, 1376. [Google Scholar] [CrossRef] [Green Version]
Pollock, J.; Stoecker-Sylvia, Z.; Veedu, V.; Panchal, N.; Elshahawi, H. Machine learning for improved directional drilling. In Proceedings of the Offshore Technology Conference, Houston, TX, USA, 30 April–3 May 2018. [Google Scholar]
Schubert, M.; Høj, N.P.; Ragnøy, A.; Buvik, H. Risk assessment of road tunnels using Bayesian networks. Procedia-Soc. Behav. Sci. 2012, 48, 2697–2706. [Google Scholar] [CrossRef] [Green Version]
Sousa, R.L.; Einstein, H.H. Risk analysis during tunnel construction using Bayesian Networks: Porto Metro case study. Tunn. Undergr. Space Technol. 2012, 27, 86–100. [Google Scholar] [CrossRef]
Wang, F.; Ding, L.Y.; Luo, H.; Love, P.E. Probabilistic risk assessment of tunneling-induced damage to existing properties. Expert Syst. Appl. 2014, 41, 951–961. [Google Scholar] [CrossRef]
Wang, Z.; Chen, C. Fuzzy comprehensive Bayesian network-based safety risk assessment for metro construction projects. Tunn. Undergr. Space Technol. 2017, 70, 330–342. [Google Scholar] [CrossRef]
Hasanpour, R.; Rostami, J.; Schmitt, J.; Ozcelik, Y.; Sohrabian, B. Prediction of TBM jamming risk in squeezing grounds using Bayesian and artificial neural networks. J. Rock Mech. Geotech. Eng. 2020, 12, 21–31. [Google Scholar] [CrossRef]
Mamaqani, B.; Najafi, M.; Kaushal, V. Developing a Risk Assessment Model for Trenchless Technology: Box Jacking Technique. J. Pipeline Syst. Eng. Pract. 2020, 11, 04020035. [Google Scholar] [CrossRef]
Malek Mohammadi, M.; Najafi, M.; Salehabadi, N.; Serajiantehrani, R.; Kaushal, V. Predicting Condition of Sanitary Sewer Pipes with Gradient Boosting Tree. In Pipelines 2020; American Society of Civil Engineers: Reston, VA, USA, 2020; pp. 80–89. [Google Scholar]
Tavakoli, R.; Sharifara, A.; Najafi, M. Artificial Neural Networks and Adaptive Neuro-Fuzzy Models to Predict Remaining Useful Life of Water Pipelines. In World Environmental and Water Resources Congress 2020: Water, Wastewater, and Stormwater and Water Desalination and Reuse; American Society of Civil Engineers: Reston, VA, USA, 2020; pp. 191–204. [Google Scholar]
Skorupka, D. Neural Networks in the Risk Management of a Project. AACE Int. Trans. 2004, RI151–RI157. [Google Scholar]
Juszczyk, M.; Leśniak, A. Modelling construction site cost index based on neural network ensembles. Symmetry 2019, 11, 411. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Association, D.C. (Ed.) Information and Recommendations for the Planning, Construction and Documentation of HDD–Projects; DCA: Aachen, Germany, 2015. [Google Scholar]
Gelinas, M.M.; Mathy, D.C. Designing and interpreting geotechnical investigations for horizontal directional drilling. In Pipeline Engineering and Construction: What’s on the Horizon? American Society of Civil Engineers: Reston, VA, USA, 2004; pp. 1–10. [Google Scholar]

Figure 1. The proposed approach for predicting the overall HDD project outcome and the occurrence of the identified unwanted events.

Figure 2. One–hot encoding for

x_{1}

attribute.

Figure 2. One–hot encoding for

x_{1}

attribute.

Figure 3. An example tree developed for the event

e_{19}

.

Figure 3. An example tree developed for the event

e_{19}

.

Figure 4. An example tree developed for the event

e_{22}

.

Figure 4. An example tree developed for the event

e_{22}

.

Figure 5. ANN architecture.

Figure 6. The results of PCA analysis for the whole data set.

Figure 7. The history of ANN learning procedure.

Figure 8. ROC curve for the overall project outcome depicted as “OK”.

Figure 9. ROC curve for the event

e_{5}

.

Figure 9. ROC curve for the event

e_{5}

.

Figure 10. ROC curve for the event

e_{15}

.

Figure 10. ROC curve for the event

e_{15}

.

Figure 11. ROC curve for the event

e_{21}

.

Figure 11. ROC curve for the event

e_{21}

.

Figure 12. Comparison of the results of 3 applied methods machine learning in terms of recall.

Figure 13. Comparison of the results of 3 applied methods machine learning in terms of accuracy.

Figure 14. Comparison of the results of 3 applied methods machine learning in terms of precision.

Figure 15. Comparison of the results of 3 applied methods machine learning in terms of

f_{1}

.

Figure 15. Comparison of the results of 3 applied methods machine learning in terms of

f_{1}

.

Figure 16. Comparison of the results of 3 applied methods machine learning in terms of AUC score.

Table 1. The list of unwanted events and their symbols.

Symbol	Unwanted Events
$e_{1}$	Incorrect calculations of loads and stresses for the installed pipeline
$e_{2}$	Failure to consider the allowable bend–radius of drill pipes or the installed product pipe
$e_{3}$	Incorrect choice of the external product pipe coating
$e_{4}$	Problems with steering and communications with the drill rig
$e_{5}$	Drill tool breakdown caused by the material’s fatigue
$e_{6}$	Drill rig failure
$e_{7}$	Mud motor failure
$e_{8}$	Mud cleaning system failure
$e_{9}$	Roller blocks breakdown
$e_{10}$	Roller cradles breakdown
$e_{11}$	Side cranes failure
$e_{12}$	Ballasting system failure
$e_{13}$	Downtime in the installation due to lack of required tools and machines
$e_{14}$	Unexpected natural or anthropogenic underground obstacles
$e_{15}$	Borehole collapse
$e_{16}$	Swelling of the ground leading to the drilling pipe or product pipe blockage in the borehole
$e_{17}$	Drilling fluid runoff
$e_{18}$	Contractor’s mistake
$e_{19}$	Quality or supply issues
$e_{20}$	Problems with permissions or legal issues
$e_{21}$	Unfavourable weather conditions
$e_{22}$	Improper cost calculations for the project
OK	The overall project result

Table 2. The factors differentiating the individual projects from which the data were derived and their parameters.

Differentiating Factor	Parameters
Country	Poland, Mexico, Australia, Thailand, the Netherlands, Bulgaria, Saudi Arabia, Russia, Greece
Continents	Europe, Asia, Australia, South America
The type of the area	urban, post–industrial, rural, environmentally sensitive
Maximum depth
Installation size	MINI (84), MIDI (9), MAXI (40)
Pipe diameter (mm)	100–1400 mm
The total length (m)	18–3048 m
Obstacle being crossed	river, channel, landfill, railway embankment, railroad tracks, water reservoir, street, highway, harbour floor, wetlands, sea, lack of obstacle
Pipe material	steel (105 installations), polyethylene (27 installations), flexwell (1 installation)
The type of installed utility	pressure sewer, gravity sewer, gas pipeline, oil pipeline, water pipeline, energy cables, telecommunications cables, fiber optic cables
Steering system	walkover (19), wireline (49), Optical Gyro (65)
Ground conditions	sands, clay, silt, bedrock, cobbles, boulders, gravels, anthropogenic land, etc.
Season	Spring, Summer, Autumn, Winter
Optional possibility of applying tools and machines	mud motor, mud leaning system, ballasting system, roller blocks, roller cradles, side cranes

Table 3. The structure of the analysed data set in terms of the number of occurrences of particular unwanted events.

No. of HDD Projects	Event	$e_{1}$	$e_{2}$	$e_{3}$	$e_{4}$	$e_{5}$	$e_{6}$	$e_{7}$	$e_{8}$	$e_{9}$	$e_{10}$	$e_{11}$	$e_{12}$	$e_{13}$	$e_{14}$	$e_{15}$	$e_{16}$	$e_{17}$	$e_{18}$	$e_{19}$	$e_{20}$	$e_{21}$	$e_{22}$
Did not occur (0)	78	110	110	108	103	95	98	122	96	113	115	114	131	94	86	78	105	76	90	109	105	119	94
Occurred (1)	55	23	23	25	30	38	35	11	37	20	18	19	2	39	47	55	28	57	43	24	28	14	39

Table 4. The summarization of a confusion matrix.

		Event Occurrence
		True	False
Event predicted	True	True positive (TP)	False positive (FP)
Event predicted	False	False negative (FN)	True negative (TN)

Table 5. Chosen correlated dimensions and their Parson’s correlation coefficients.

Dimension	Correlated Dimension	Parson’s Correlation Coefficient
$e_{15}$	$x_{81}$ (Sand layer with homogeneous grain size distribution)	0.85
$e_{22}$	$x_{133}$ (Cheapest contractor chosen)	0.95
$e_{2}$	$e_{1}$	0.89
$e_{5}$	$x_{138}$ (Project abroad)	0.85
$e_{8}$	$x_{22}$ (Quality cable coating)	0.86
$e_{9}$	$x_{34}$ (Tool repaired)	0.94
$e_{9}$	$x_{117}$ (Positive previous experience with a supplier)	−0.86
$e_{9}$	$e_{10}$	0.94
$e_{9}$	$e_{11}$	0.97
$e_{10}$	$x_{X 34}$ (Tool repaired)	1.0
$e_{10}$	$x_{117}$ (Previous positive experience with a supplier)	−0.91
$e_{10}$	$x_{118}$ (Materials quality certificate)	−0.87
$e_{10}$	$e_{11}$	0.97
$e_{11}$	$x_{34}$ (Tool repaired)	0.97
$e_{11}$	$x_{117}$ (Previous positive experience with a supplier)	−0.88
$e_{11}$	$e_{19}$	0.87
$e_{15}$	$x_{81}$ (Sand layer with homogeneous grain size distribution)	0.85
$e_{19}$	$x_{116}$ (Certified supplier)	−0.88
$e_{19}$	$x_{117}$ (Previous positive experience with a supplier)	−0.87
$e_{15}$	$x_{118}$ (Materials quality certificate)	−0.87
$x_{103}$ (Supervisor certified)	$x_{101}$ (Chief superintendent engineer certified)	0.91
$x_{48}$ (Mud motor certificate)	$x_{43}$ (Mud motor inspections)	0.88
$x_{48}$ (Mud motor certificate)	$x_{44}$ (New elastomeric elements)	0.95
$x_{49}$ (Mud motor OHS)	$x_{43}$ (Mud motor inspection)	0.90
$x_{136}$ (Risk level)	$x_{71}$ (No. of site investigation methods)	−0.94
$x_{136}$ (Risk level)	$x_{121}$ (Proximity to existing utilities–acting as a drainage)	−0.88
$x_{136}$ (Risk level)	$x_{106}$ (No. of shifts)	−0.91
$x_{136}$ (Risk level)	$x_{128}$ (Season spring)	−0.87
$x_{136}$ (Risk level)	$x_{131}$ (Season winter)	−0.87
$x_{136}$ (Risk level)	$x_{144}$ (Drilling fluid additives against collapse)	−0.93
$x_{136}$ (Risk level)	$x_{6} = Rail crossing$ (Crossed obstacle)	−0.86
$x_{136}$ (Risk level)	$x_{15} = GYRO$ (Steering system type)	−0.89

Table 6. Performance results for the logistic regression model testing.

Event	Recall	Precision	Accuracy	$f_{1}$	AUC
$e_{5}$	0.750	0.857	0.889	0.800	0.849
$e_{8}$	0.889	1.000	0.963	0.941	0.944
$e_{9}$	1.000	1.000	1.000	1.000	1.000
$e_{10}$	1.000	1.000	1.000	1.000	1.000
$e_{11}$	1.000	1.000	1.000	1.000	1.000
$e_{15}$	0.600	1.000	0.852	0.750	0.800
$e_{19}$	1.000	1.000	1.000	1.000	1.000
$e_{22}$	0.857	1.000	0.963	0.923	0.929

Table 7. Performance results for the developed random forests model in a test dataset.

Event	Recall	Precision	Accuracy	$f_{1}$	AUC
$e_{1}$	0.600	1.000	0.926	0.750	0.800
$e_{2}$	0.600	1.000	0.926	0.750	0.800
$e_{3}$	0.667	1.000	0.926	0.800	0.833
$e_{4}$	0.600	1.000	0.926	0.750	0.800
$e_{5}$	0.750	1.000	0.926	0.857	0.875
$e_{6}$	0.625	1.000	0.889	0.769	0.813
$e_{7}$	1.000	1.000	1.000	1.000	1.000
$e_{8}$	0.778	1.000	0.926	0.875	0.889
$e_{9}$	1.000	1.000	1.000	1.000	1.000
$e_{10}$	1.000	1.000	1.000	1.000	1.000
$e_{11}$	1.000	1.000	1.000	1.000	1.000
$e_{12}$	—	—	—	—	—
$e_{13}$	0.857	0.857	0.926	0.857	0.904
$e_{14}$	0.889	0.889	0.926	0.889	0.917
$e_{15}$	0.900	0.900	0.926	0.900	0.921
$e_{16}$	1.000	1.000	1.000	1.000	1.000
$e_{17}$	0.818	0.900	0.889	0.857	0.878
$e_{18}$	1.000	0.778	0.926	0.875	0.950
$e_{19}$	1.000	1.000	1.000	1.000	1.000
$e_{20}$	0.600	1.000	0.926	0.750	0.800
$e_{21}$	0.333	1.000	0.926	0.500	0.667
$e_{22}$	1.000	1.000	1.000	1.000	1.000
$OK$	1.000	1.000	1.000	1.000	1.000

Table 8. Performance results of the proposed ANN model in a test dataset.

Event	$Recall$	$Precision$	$Accuracy$	$f_{1}$	$AUC$
$e_{1}$	1.000	1.000	1.000	1.000	1.000
$e_{2}$	1.000	1.000	1.000	1.000	1.000
$e_{3}$	0.667	1.000	0.926	0.800	1.000
$e_{4}$	0.800	1.000	0.963	0.889	0.982
$e_{5}$	0.875	0.875	0.926	0.875	0.967
$e_{6}$	0.875	0.875	0.926	0.875	0.980
$e_{7}$	1.000	1.000	1.000	1.000	1.000
$e_{8}$	0.889	1.000	0.963	0.941	1.000
$e_{9}$	1.000	1.000	1.000	1.000	1.000
$e_{10}$	1.000	1.000	1.000	1.000	1.000
$e_{11}$	1.000	1.000	1.000	1.000	1.000
$e_{12}$	—	—	—	—	—
$e_{13}$	0.857	0.857	0.926	0.857	0.971
$e_{14}$	0.778	1.000	0.926	0.875	0.975
$e_{15}$	0.900	0.900	0.926	0.900	0.965
$e_{16}$	1.000	0.667	0.926	0.800	1.000
$e_{17}$	0.909	0.909	0.926	0.909	0.994
$e_{18}$	1.000	0.778	0.926	0.875	0.993
$e_{19}$	1.000	0.800	0.963	0.889	1.000
$e_{20}$	0.800	1.000	0.963	0.889	0.982
$e_{21}$	0.333	1.000	0.926	0.500	0.944
$e_{22}$	1.000	0.875	0.963	0.933	0.993
$OK$	1.000	1.000	1.000	1.000	1.000

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krechowicz, M.; Krechowicz, A. Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning. Energies 2021, 14, 289. https://doi.org/10.3390/en14020289

AMA Style

Krechowicz M, Krechowicz A. Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning. Energies. 2021; 14(2):289. https://doi.org/10.3390/en14020289

Chicago/Turabian Style

Krechowicz, Maria, and Adam Krechowicz. 2021. "Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning" Energies 14, no. 2: 289. https://doi.org/10.3390/en14020289

APA Style

Krechowicz, M., & Krechowicz, A. (2021). Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning. Energies, 14(2), 289. https://doi.org/10.3390/en14020289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning

Abstract

1. Introduction

1.1. Principles of HDD Technology

1.2. Risk in HDD Technology

1.3. Contribution of the Proposed Approach

2. Review of Literature and Limitations of the Previous Work

3. Proposed Approach

3.1. Data Gathering

3.2. Data Profiling

3.3. Data Preprocessing

3.4. Logistic Regression Model

3.5. Random Forest Model

3.6. ANN Model

3.7. Models Evaluation

4. Experimental Results

4.1. Correlations

4.2. PCA

4.3. Logistic Regression

4.4. Random Forests Model

4.5. ANN Model

5. Discussion of the Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

Nomenclature

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI