Prediction of risk delay in construction projects using a hybrid artificial intelligence model

: Project delays are the major problems tackled by the construction sector owing to the associated complexity and uncertainty in the construction activities. Artificial Intelligence (AI) models have evidenced their capacity to solve dynamic, uncertain and complex tasks. The aim of this current study is to develop a hybrid artificial intelligence model called integrative Random Forest classifier with Genetic Algorithm optimization (RF ‐ GA) for delay problem prediction. At first, related sources and factors of delay problems are identified. A questionnaire is adopted to quantify the impact of delay sources on project performance. The developed hybrid model is trained using the collected data of the previous construction projects. The proposed RF ‐ GA is validated against the classical version of an RF model using statistical performance measure indices. The achieved results of the developed hybrid RF ‐ GA model revealed a good resultant performance in terms of accuracy, kappa and classification error. Based on the measured accuracy, kappa and classification error, RF ‐ GA attained 91.67%, 87% and 8.33%, respectively. Overall, the proposed methodology indicated a robust and reliable technique for project delay prediction that is contributing to the construction project management monitoring and sustainability.


Research Background
The construction sector has a crucial role in improving the economics of developed countries [1]. The success of this sector is measured by time, cost and quality performance of construction projects. Prediction of construction durations represents a problem for both researchers and project managers. The construction process is subject to many factors and unpredicted variables that result from many sources. These sources prevent the completion of projects within the specified time and lead to a delay risk in the construction process [2,3].
Delay risk is considered as one of the major challenges tackled by construction firms [3,4]. Delay can be defined as an action or event that extends the time required to complete the project identified in a contract [3]. Project delay has an adverse effect on the project performance, which leads to cost overruns and productivity reduction. Its effect extends to include the owner, consultant and contractor in terms of litigation, dispute and arbitration [5]. Delays are caused by many sources and partnerships in infrastructure projects [26]. The method performed with good accuracy in the prediction process for public and private projects. Heravi and Eslamdoost (2015) investigated the potential of an ANN model for the prediction of labor productivity in construction projects [27]. The results discovered that the ANN model showed better modeling of labor productivity. Gerassis et al. (2016) applied Bayesian networks to analyze the causes of accidents in embankment construction [28]. The study revealed that this method provided an accurate identification of embankment stability in civil engineering projects. By recalling the related literature review studies, AI model application is still a new methodology in the field of construction management research and delay risk prediction [29]. Few studies used AI models in risk prediction and classification. Asadi et al. (2015) used a decision tree and a Naive Bayes model based on a questionnaire survey to predict delay in construction logistics. The authors evidenced the capacity of the decision tree has higher accuracy by 79.41% over the Naive Bayes model, which showed a lower accuracy value of 73.52% [30]. Naji et al. (2018) used a Bayesian decision tree model to predict the impact of contract changes on the time and quality performance of construction projects [31]. The model performed with good accuracy in the prediction process and caused an improvement in the project performance. Gondia et al. (2020) utilized Naive Bayes and decision tree models to predict the delay risk in construction projects. The study revealed the power of AI models in delay risk prediction and improving risk management strategies [11]. Based on the reported studies in the literature, the current research is established with the aim of providing a reliable methodology for delay risk prediction that will contribute to the baseline knowledge of construction management. Owing to the fact that standalone AI models experienced some limitations on tuning their internal parameters for an optimal learning process [32], the current study is adopted based on the integration of a natureinspired optimization algorithm called Genetic Algorithm (GA) with a Random Forest (RF) model. The GA optimization approach was demonstrated as a reliable technique in tuning AI models for multiple engineering applications and thus it was selected for the current study [33][34][35].

Research Objectives
In the current study, the authors aim to explore and develop an effective tool to predict delay risk problems based on delay sources using previous construction projects data. The main contribution of the current investigation is to provide an accurate methodology that can assist in the prediction of future durations and monitor risk levels, based on these projects. This work can enhance a proactive approach in risk management. To achieve this aim, sources and factors of delay risk are extracted from the literature, and then related data to the delay risk problems are collected from previous construction projects. A questionnaire survey is adopted to measure the impact of various sources on the delay level in construction projects. Based on the complex nature of the construction process and the associated uncertainties of the delay sources, a hybrid model based on the integration of the Random Forest and Genetic Algorithm (RF-GA) is developed in order to analyze the data of completed previous projects. The performance of the developed model is studied statistically and discussed comprehensively. The potential of the proposed RF-GA model is validated against the classic RF model.

Random Forest Model
The RF model was first developed by Breiman (2001) based on the combination of decision tree classifiers [36]. Each tree provides a prediction for the class label and the algorithm selects the classes that have the most choices. Random Forest is a very popular tool that uses the bootstrapping method to train dataset samples and construct multiple random trees [37]. The algorithm gained significant important because it is invariant under scaling and it is robust to the inclusion of irrelevant features [38]. Several studies examined the application of Random Forest in engineering applications and demonstrated its feasibility in prediction processes [26,34,35]. Under the bootstrapping method, the data during the training phase are selected randomly and independently to develop an RF model, and the data that are not involved in the selection process are named "out-of-bag" [41]. During this process, the out-of-bag data are changed and the prediction error is measured to estimate the importance of input variables [41,42]. In the RF algorithm, overfitting does not occur due to large numbers of trees and the choice of the right type of random variables leads to accurate classification. Random Forests contain several parameters that need to be optimized, such as number of trees, minimum gain and maximum tree depth. In this study, these parameters were optimized by a genetic algorithm.

The Hybrid RF-GA Model
Genetic algorithm is a popular technique used to optimize problems in complex systems based on natural selection [43]. To solve a problem in a GA algorithm, random solutions are generated, and then the selection of the population is done to develop the model. The new solutions are developed by using selection, crossover and mutation. A string of bits or chromosomes is used to represent the solutions in the GA model. The position of bits is called the gene, and the gene contains many values that are named alleles. GA has been widely applied in different fields, such as image processing, pattern recognition and controlling systems [44]. The GA model was used in different research in construction management such as resources optimization [45], project scheduling [46,47], optimizing of time and cost in construction projects [48] and dispute classification [49]. In the present study, a GA model is presented to optimize parameters of the RF model, including number of trees, minimum gain and maximum tree depth. The description of the hybrid FR-GA model is illustrated in Figure 1.

Identification of Delay Sources and Factors in Construction Projects
The most important factors that affect delays in construction projects were identified from a literature survey, and these factors were categorized into different sources. These sources included owner, designer, contractor, project, material, equipment, labor and external factors. To obtain more information on the delay problems and their factors in construction projects, interviews were held with 15 experts in construction work [11]. By this interview, the identified sources and factors and their relevance to the construction industry were confirmed. Based on the reviews and literature review, the most important delay factors and their sources were identified as shown in Table 1.

Data Collection
The compiled data included 40 completed projects that had different degrees of time overrun. These projects were executed in Diyala city, Iraq. The collected data included historical documents of previous projects that were investigated to extract the measure of risk delay in construction projects. These documents included contract documents, specifications, change orders records and schedule baselines. To complete data collection, a questionnaire survey was arranged and constructed. Each questionnaire form contained a construction project and another nine variables.
The first variable represented the delay level, and the other eight variables referred to the risk delay sources in the construction project. Each risk source was given scores depending on two scales. The first scale was the probability of risk to occur in the construction project and the second related to the impact of sources on the delay of the construction project, as shown in Table 2. The overall risk impact was evaluated by multiplying the two scales [3,43,44].
The probability and impact of the variables were measured by using a five-point Likert scale with measures form very low to very high level. The input variables were classified as: very low, low, medium, high and very high. The output variable (delay level) was also classified into three class measures. This method resulted in three categories of delay level that reduced the bias during the execution of the artificial intelligence model. Delay level was categorized as: ˂50% delay, 50%-100% delay and ˃100% delay. The questionnaire was allocated to a pilot study to measure the questionnaire reliability and to investigate the problems and determine the items that are more confusing than the others. The authors selected 40 parties for the pilot study as the size of the study was ranged between 30 and 50 parties [52]. To confirm the questionnaire reliability, Cronbach's alpha was adopted, and in this study the value of the alpha coefficient is 91.8%. The result of Cronbach's alpha confirms the reliability of the questionnaire.

Model Development Procedure
The questionnaire was distributed to 300 experts who worked in the collected projects. The experts were involved in different parties, which include client, engineer and the other experts of these projects. The collected projects were divided into two phases: 70% of the total projects (28 projects) were used for the training phase and 30% (12 projects

Model Performance Measures
The performance of the predicted model was evaluated by using class performance and overall performance measures. Class performance was measured by precision, sensitivity and specificity [53,54]. The overall performance of the predicted model was evaluated by accuracy, classification error and kappa statistics. The kappa coefficient (k) was used in statistics to measure the quality of an item based on inter classifier agreement [55,56]. The equations of performance measures are explained as follows: (1) where: TP means the number of positive classes that are correctly recognized by the algorithm; FP represents the number of positive classes that are incorrectly classified by the algorithm; TN means the number of negative classes that are correctly predicted by the algorithm; FN represents the number of negative classed that are incorrectly recognized by the algorithm; Po means the observed agreement between rates; and Pe represents the probability of chance agreement.

Results and Discussion
Analysis of collected data based on 40 projects was conducted to identify the sources of delay problems effectively. The properties of the complied data and the distribution of delay sources among the construction project are presented in Figures 3 and 4.  Based on the reported results, Figure 3 shows the counts of projects with a ˂50% delay, 50%-100% delay and ˃100% delay were 10 (25%), 14 (35%) and 16 (40%), respectively. It can be seen that a high percentage of projects belongs to the class of ˃100% delay. Figure 4 demonstrates the distribution of delay sources among each class of delay problem, which was obtained from the historical records, pilot study and distributed questionnaire. These outcomes revealed the delay sources values of contractor, owner, designer, project and external factors have a higher impact that the other delay sources. Owner, designer, contractor and project are represented as the internal risk sources that have an impact on the project delay. External factors can be discussed by the special circumstances that are experienced in the studied region "Iraq" in a manner that severely affected the construction industry. These conditions have an enormous impact on the project stockholder and project performance. These conditions resulted in the stumbling and failure of many projects in the construction sector. On the other hand, the application of a robust predictive model can contribute to estimating an accurate duration in construction projects and analyzing delay risk sources that arise from the complex and dynamic nature of construction sector.
The statistical performance of the training and testing datasets of the proposed hybrid RF-GA model were evaluated based on the model performance measures against the classical Random Forest classifier. The performance measure metrics were evaluated based on the confusion matrix of the two classifiers. The confusion matrix is described in the performance of the classification model. The confusion matrix of the RF and RF-GA are displayed in Tables 3 and 4. The columns in the confusion matrix represent the actual classification within each class, while the rows correspond to the number of the predicted class. The correct predictors are located on the diagonal of the matrix. The confusion matrix of a high-performance model contains large numbers in its diagonal and the zero numbers outside the diagonal. The performance of the hybrid RF-GA and RF models during the training and testing phases was evaluated. Precision, sensitivity, specificity, accuracy, classification error and kappa statistics were computed and are presented in Tables 5 and  6. With regards to performance measures, the RF-GA model exhibited a good performance in the prediction of delay in the construction sector. Based on the training phase, RF-GA achieved the minimum values of precision, sensitivity and specificity of 87.5, 90 and 95.2, respectively. The lowest values of RF in terms of precision, sensitivity and specificity were 87.5, 83.33 and 94.44, respectively. Based on the comparison between the two classifiers, it can be concluded that the RF-GA model outperformed the feasibility of the classical RF model in both the training and testing performances. Tables 5 and 6 revealed the superiority of the RF-GA classifier in terms of accuracy, classification error and Kappa statistics. This can be explained as due to the potential of the integration of the nature-inspired optimization algorithm (i.e., GA) that assisted in providing reliable hyperparameters optimization and thus attained a reliable learning process. The RF-GA model also provided higher values of precision, sensitivity and specificity in comparison with the RF model.
The RF-GA classifier showed an impressive performance in terms of overall and class measure indices. These results can be discussed by the ability of the genetic algorithm in solving optimization problems depending on the chromosome approach, and its capacity to solve the problems while dealing with multiple solutions [57]. It is even better to validate the current research results with the reported research over the literature. As compared with the previous results, it can be inferred that the RF-GA model demonstrated remarkable prediction superiority in comparison with the previous established studies as reported in Table 7. The capacity of the RF-GA model was compared with the best outcomes. The RF-GA model exceeded all of the reported related literature. To summarize, a proactive management approach involves the identification of new risk delay sources and the monitoring of the sources that arise during the project lifecycle. As a result, the proposition of a reliable and robust methodology as an analysis tool that is able to mimic and comprehend the dynamic input variables is highly needed for this purpose. Hence, and based on the established methodology of the current research, the potential of the RF-GA model to be modified and set up for project duration prediction though the project lifecycle was evidenced. The RF-GA model was successfully developed for the investigated dynamic project delay risk prediction.

Conclusions
In this present study, an analysis tool that is capable of predicting the delay level in construction projects based on delay sources was proposed. To meet this goal, two approaches were adopted in this study. First, delay sources and factors were collected from a literature review and identified by an expert meeting. Data that are related to delay levels were compiled from 40 construction projects that are located in Diyala city, Iraq. The collected data included historical records of previous projects that were investigated, and in order to extract the measure of delay risk in construction projects a questionnaire was prepared and distributed to 300 experts so as to extract the information about delay sources in construction projects. Risk sources were measured by computing the probability and the impact of each source. An analysis of data results and distribution of delay sources among the collected previous projects was implemented in order to better understand delay factors in the construction sector. Secondly, a hybrid RF-GA model was developed to deal with the complex and dynamic nature of data in the construction sector. The RF-GA model was evaluated by performance measure indices and compared with the classical RF model. Based on the analysis results, RF-GA revealed a better performance than the RF model. The RF-GA provided values of accuracy, classification error and Kappa were 91.67%, 8.3%, and 87%, respectively. These results reflect the ability of the model to handle the nonlinearity and complexity of data in the construction sector. The results also revealed the capability of the genetic algorithm in solving problems with multiple solutions.