Artiﬁcial Neural Network Model to Predict Final Construction Contract Duration

: Forecasting the ﬁnal construction contract duration at an early stage plays a vital role in the progress of a project. An inaccurate project duration prediction may lead to the project’s beneﬁts being lost. It is essential to precisely predict the duration due to the presence of several different factors. This paper contributed to developing a model to predict ﬁnal construction contract duration ( FCCD ) in the early stages based on parameters characterized as few and shared for any contract. (contract cost, contract duration, and sector). This paper developed an Artiﬁcial Neural Network (ANN) model based on 135 Saudi construction project data. The development model has three stages. The ﬁrst stage was standardization and augmentation using Zavadskas and Turskis’ logarithmic and Pasini methods. The second and third stages were the ﬁrst and second analyses of the ANN models, respectively. The ﬁrst analysis aimed to promote the used data and integrate them into the second analysis to develop the ANN model. The ANN models were compared with three linear regression (LR) models (LR1, LR2, and LR3) and other models in the literature. The results revealed that the accuracy of the ANN model provides reasonable accuracy with an average mean absolute percentage error (MAPE) of 12.22%, which is lower than the LR3 (cid:48) s MAPE by 27.03%. The accuracy of the ANN model is similar to that of earned value management (EVM) in the previous study. This paper supports research to deal with relatively little data and integrate them into a neural network. The ANN model assists the stakeholder in making appropriate decisions for the project during the pre-tendering phase by predicting the actual contract duration based on the CC , CD , and project ector.


Introduction
The construction industry significantly contributes to countries' economic progress. Construction is an industry that contributes significantly to the overall Gross Domestic Product (GDP) and is expected to expand. Delays in construction projects have become a widespread issue due to the complexity of the construction industry. Despite having a positive impact on the economy and technological improvements in the sector [1,2], construction delays have a wide range of social and economic repercussions. These delays negatively impact sustainability's social, environmental, and financial triple bottom lines [3]. Delays can lead to schedule and cost overruns, decreased contractor earnings, additional losses for the owner's capital due to an extended construction phase, mistrust between the owner and contractor, legal battles involving many parties, and outright project abandonment. Gebrehiwet and Luo [4] noted that cost overruns, contract cancellation, arbitration, and litigation are some of a delay's crucial effects. According to Khattri et al. [5], a delay can result in disagreements, cost overruns, time overruns, abandonment, negotiation, legal action, litigation, and complete desertion. Numerous studies have been carried out over the years to address this significant issue, especially to identify the underlying factors that increase the probability of a building delay and its adverse effects.

Literature Review
The forecast construction contract duration studies have been distributed into three types of models. The first type of model is to predict the final duration based on the causes of the project delay. The second type of model is based on historical data, such as earned value management (EVM). The third type of model depends on the characteristics of a contract (artificial neural network, regression, hybrid models) or on information from the last completed contract that had the same condition as the required contract (case-based reasoning, CBS).

Forecast Model Based on the Causes of the Time Delays
In this section, the researchers tried to find the significant factors that affected the time delay and integrated them as input data to forecast the final construction contract duration. In a general project, Al-Gahtani et al. [14] used ten previously discovered criteria to construct a simulation forecasting model for the delay duration in Saudi projects using system dynamics. In order to consider the ten factors that influence project delay, they carried out a systematic, integrated approach using the DEMATEL methodology and system dynamics (SD). This work solved the challenge of methodically creating a causal loop diagram inside the SD modeling process using the DEMATEL technique. Next, consistency and extreme conditions were tested on the generated SD model. Then, it was implemented and validated using three case studies in KSA by contrasting each case study's real and fitted progress curves. In addition, Ajayi and Chinda [2] developed a model to examine the impact of the factors on the final construction project time. The model combined two mathematical decision-making techniques, DEMATEL and SD modeling. The simulation findings highlight the significance of avoiding design errors at the project's beginning (or preconstruction stage) to reduce project delays.
For highway projects, Pewdum et al. [15] evolved models to project the final cost and duration of a highway construction project while it was still in the planning stages. Before designing the forecasting models, project data were gathered and examined to determine the variables influencing the project's ultimate budget and duration. The research for these models was based on the ANN. Han et al. [16] examined the influence of the non-valueadding effort generated from design errors and changes in design on the time delay of the project using system dynamics.
In order to facilitate reliable project delay risk analysis and forecasting using objective data sources, Gondia et al. [17] refined and built machine learning (ML) algorithms (decision tree and naïve Bayesian algorithms). As a result, the relevant delay risk sources and components were first found. A multivariate data set of past project timeliness and delayinducing risk sources was assembled. Exploratory data analysis was then used to reveal the system's intricacy and interconnectivity. In order to anticipate the extent of project delays, the two appropriate algorithms were found and trained using the data set. These models used decision trees and naive Bayesian classification algorithms. Finally, cross-validation tests were performed on both models to assess their predictive abilities. The models were then contrasted using performance metrics pertinent to ML.

Forecast Model Based on Characteristics of a Contract or Project
Although the earned value management (EVM) approach is a successful project oversight and management strategy in terms of foretelling the cost performance index and other cost indicators, the technique may require more improvements to be more effective at estimating the project's completion time [18,19]. Vanhoucke and Vandevoorde [19] assumed that project activities and precedence relations were known to predict the final contract duration (FCCD). Urgilés et al. [20] examined the adequate EVM and value schedule to forecast the final duration of hydroelectric power generation projects. Sackey et al. [21] also developed a new method based on the EVM to forecast the final construction contract duration (FCCD). They used the actual time spent on each activity. One of the challenges faced by the users of the management method in predicting the actual duration of the contract is that the method requires historical data for the project. In other words, EVM also needs accurate information from a project, such as its cost, earned value, and planned value, at any given time, and it may not be possible to predict it at an early stage of the contract.
On the other hand, the case-based reasoning method is mainly used to forecast the construction project cost. However, Jin et al. [22] established a CBR model that can correctly predict the FCCD at the planning stage.

Regression and ANN Models
Several studies utilized regression and ANN models to estimate the FCCD. For example, Skitmore and Ng [23] developed a regression model based on cross-validation. The model parameters were project type, sector, contractor selection, and the 93 Australian building project model. Thomas and Thomas [24] developed a regression model to predict the building project duration based on 51 historical data. The model parameters were the area of the building, estimated duration, and estimated cost. The model did not consider the project sectors, and the model cannot be used for different types of projects, such as electrical or mechanical projects. The artificial neural network method proved more advanced and performed better than the regression model [25]. The ANN model developed by [26] was to forecast the duration of building projects. The input data included the number of floors, foundation type, activities, contractor class and client class, and floor area. The mean absolute error was 25.9%. Moreover, Gab-Allah et al. [27] established an ANN model for predicting the building project. The parameters were the type of clients, construction quality, project location, the total height of the project, client coordination with contractor staff, contract type (unit price contract/lump sum), contactor selection method, and quality of project documentation. The maximum error of the model was 20%.
The previous models required specific information, which varies from one project to another, such as the model developed by Al-Gahtani et al. [14], Ajayi and Chinda [2], Pewdum et al. [15], or contract data that should be available through the construction stage, such as Sackey et al. [23]. Although the CBR method has proven its effectiveness in predicting the duration of the contract, it requires the availability of a previously completed project similar to the one required in terms of characteristics and operational conditions, which may be difficult to provide. In terms of the regression and ANN models, the above model was utilized for building project duration and cannot be generalized to other projects. Therefore, there is a need to develop a predictive model using the ANN model, which is used for different projects and is based on common and available data. This paper contributed to the development of a model to predict FCCD in the early stages based on a few common node criteria (CC, CD, and sector). Those parameters are characterized as being few and common for any contract. However, the ANN models need extensive data to provide a reliable and adequate forecast model. In this paper, the available data were relatively small (135), making it challenging to deal with them using an ANN model. To address the issue, the relatively small data issue was solved using the two methods that regard standardization (Zavadskas and Turskis' logarithmic) and augmentation data by utilizing the method introduced by [28], which was then utilized in developing the ANN model. The input data tested consists of contract cost, contract duration, and project sectors. This study presented a reasonable-accuracy prediction model based on KSA project data. The analysis approach used to create the model can also be applied to projects from other areas of the globe.

Artificial Neural Network (ANN) Model
This section provides an important and simple introduction to ANN model structures. The Artificial Neural Network (ANN) is an ML method that emerged from the concept of biological neural networks in the human brain. In situations where the actual process is complex and we are unsure of the nature of each phenomenon involved, it is one of the most excellent tools for value prediction [29]. Civil Engineering is one of the areas that benefits from ML. Such applications include earthquake engineering, structural health monitoring, damage identification and detection, and structural design. The VULMA ML-based tool is an example of such application automation, establishing a seismic vulnerability score for building structures [30,31].
A model known as an ANN builds an algorithm from any function to estimate the outcome [29]. The structures of the ANN consist of three layers: input layer, hidden layer (one or more than one), and hidden layer, as shown in Figure 1. The main aim of the hidden layer is to extract some of the most relevant patterns from the inputs and send them on to the next layer for further analysis. The mechanism of each hidden neuron consists of two sequence functions, S and a, as shown in Figure 2. The hidden layer also accelerates and improves the efficiency of the network by recognizing just the most essential information from the inputs and discarding redundant information. The benefit of using an ANN in Statistical Package for Social Sciences software is its simplicity and ability to handle small amounts of data, like the sector in this article. Moreover, the IBM SPSS Statistics 20 program can provide a neat illustration of using the ANN model with the strong relationships among neurons and the bias values. Additionally, SPSS makes it simple to choose the percentage of training and testing processes and provides the relative errors of the two processes together with the expected result values. Moreover, the program can provide the weighting values among the connections of the neurons and the output computed by the ANN model. On the other hand, the activation function allows the model to capture nonlinear relationships between the inputs. In addition, the activation function contributes to converting the input into a more usable output. The types of activation functions are hyperbolic tangent functions (Equation (1)), sigmoid functions (Equation (2)), Appl. Sci. 2023, 13, x FOR PEER REVIEW 5 of 19

Methodology
The methodology is mainly comprised of three stages: data initialization, ANN development, and an evaluation model. The data initialization consists of data collection, standardization, and augmentation, while the ANN development includes the first and second analyses. The first analysis stage represented normalizing and maximizing data using Zavadskas and Turskis' logarithmic method and the method introduced by [32]. These methods overcome the issue of relatively small data. Then, the accuracy of the developed ANN model was determined using the mean absolute percentage error (MAPE). The second analysis stage is developing the ANN models on the modified data generated from the first stage, as shown in Figure 3. The ANN models were evaluated by comparing the results with several other past models in the literature.

Methodology
The methodology is mainly comprised of three stages: data initialization, ANN development, and an evaluation model. The data initialization consists of data collection, standardization, and augmentation, while the ANN development includes the first and second analyses. The first analysis stage represented normalizing and maximizing data using Zavadskas and Turskis' logarithmic method and the method introduced by [32]. These methods overcome the issue of relatively small data. Then, the accuracy of the developed ANN model was determined using the mean absolute percentage error (MAPE). The second analysis stage is developing the ANN models on the modified data generated from the first stage, as shown in Figure 3. The ANN models were evaluated by comparing the results with several other past models in the literature.

Methodology
The methodology is mainly comprised of three stages: data initialization, ANN development, and an evaluation model. The data initialization consists of data collection, standardization, and augmentation, while the ANN development includes the first and second analyses. The first analysis stage represented normalizing and maximizing data using Zavadskas and Turskis' logarithmic method and the method introduced by [32]. These methods overcome the issue of relatively small data. Then, the accuracy of the developed ANN model was determined using the mean absolute percentage error (MAPE). The second analysis stage is developing the ANN models on the modified data generated from the first stage, as shown in Figure 3. The ANN models were evaluated by comparing the results with several other past models in the literature.

Data Initialization
The section mainly aimed to collect and prepare data to make them suitable for use in artificial neural networks. It can be achieved by performing three steps: collecting, standardizing, and augmenting the standardized data. Gebrehiwet and Luo [4] and Khattri et al. [5] pointed out that the time delay had an influence on the cost and time overrun, which were reflected in the contract by CC and CD, respectively. Therefore, the collected data parameters were CD, CC, FCCD, contract sector (public, semi-public, private), and contract type (building, electric, mechanic).

Data Collection
The information from previously finished projects was required to create models forecasting the final construction duration. A survey was created and sent to numerous organizations to gather information from the finished building projects in KSA (Appendix A).
The project's scheduled start and finish dates were compared with the actual dates to determine the projected and real project durations. The data were collected, reviewed, and the invalid ones were eliminated. Models for predicting real construction duration were developed and validated using data from 135 projects completed in KSA. The frequencies of public, semi-public, and private were 80 (59.26%), 49 (36.30%), and 5 (4.44%), respectively. In addition, the CC ranges from 18,200 SAR to 650,000,000 SAR. On the other hand, the CD varied from 0.47 months to 138.30 months, while the FCCD changed from 0.37 months to 146.00 months. The frequencies of building, electric, and mechanic were 66 (48.89%), 14 (10.37%), and 55 (40.74%), respectively.

Sample Size Examine
The sample size of 135 projects can be examined by calculating the minimum size that follows the normal distribution using Equation (3), which is based on the confidence level (95%), probability value choice (p), which is set at 0.5, and confidence interval, which should be less than 0.2 [33].

Data Initialization
The section mainly aimed to collect and prepare data to make them suitable for use in artificial neural networks. It can be achieved by performing three steps: collecting, standardizing, and augmenting the standardized data. Gebrehiwet and Luo [4] and Khattri et al. [5] pointed out that the time delay had an influence on the cost and time overrun, which were reflected in the contract by CC and CD, respectively. Therefore, the collected data parameters were CD, CC, FCCD, contract sector (public, semi-public, private), and contract type (building, electric, mechanic).

Data Collection
The information from previously finished projects was required to create models forecasting the final construction duration. A survey was created and sent to numerous organizations to gather information from the finished building projects in KSA (Appendix A).
The project's scheduled start and finish dates were compared with the actual dates to determine the projected and real project durations. The data were collected, reviewed, and the invalid ones were eliminated. Models for predicting real construction duration were developed and validated using data from 135 projects completed in KSA. The frequencies of public, semi-public, and private were 80 (59.26%), 49 (36.30%), and 5 (4.44%), respectively. In addition, the CC ranges from 18,200 SAR to 650,000,000 SAR. On the other hand, the CD varied from 0.47 months to 138.30 months, while the FCCD changed from 0.37 months to 146.00 months. The frequencies of building, electric, and mechanic were 66 (48.89%), 14 (10.37%), and 55 (40.74%), respectively.

Sample Size Examine
The sample size of 135 projects can be examined by calculating the minimum size that follows the normal distribution using Equation (3), which is based on the confidence level (95%), probability value choice (p), which is set at 0.5, and confidence interval, which should be less than 0.2 [33].
where Z is a value corresponding to a 95% confidence level equal to 1.96. By setting the C and p to 0.10 and 0.50, respectively, the minimum sample was 49 projects, which was less than the collected projects (135). Therefore, the collected data was adequate.

Data Parameters Examination
This section aims to examine the impact of CC, CD, project sector, and project type on the FCCD by performing the correlation test between the FCCD and the other parameters. The results of the test are shown in Table 1. The CD, CC, and project sectors correlated with FCCD with Pearson coefficients of 0.784, 0.424, and 0.520, respectively. However, the project type did not correlate with FCCD, where the p-value of the test and Pearson coefficient were 0.666 (more than 0.05) and 0.037 (very weak correlation), respectively. Therefore, the parameters considered as the ANN model's input layer were CD, CC, and project sector (public, semi-public, and private). Anysz et al. [28] stated that several approaches to standardizing input data for ANN result in varied values for accuracy metrics. They examine the six normalized methods (vector, Manhattan, maximum, Weitendorf's linear, Peldschus' nonlinear, Zavadskas and Turskis' logarithms, and Jüttler-Korth linear). Anysz et al. concluded that the Zavadskas and Turskis' logarithms provide tiny errors between the actual and computed output. Therefore, the method was considered to standardize the 135 data sets. The standardized formula is shown in Equation (4).
where x i is the standardized variable, x i is variable, and n represents the total data sets (135). The x i was set as either CC, CD, or FCCD. However, the sector variable cannot be dealt with in the standardized method because it is a nominal data type. Therefore, the sector components (public, semi-public, and private) were considered in the ANN as factors. The names of the public, semi-public, and private were set as PUB, SPUB, and PRI, respectively. Depending on the data collection, the factors' values were changed to zero or one.

Data Augmentation
The ANN methods involve vast data to obtain a reliable forecasted model. However, the data sets were relatively small and may need to be improved. Pasini [28] innovated an "all frame" method to overcome small data issues. The method maximizes the small data by dividing them into N sets. The total data is divided into ten subsets. One is considered testing data, indicated as a blue element in Figure 4. At the same time, the remainder subsets represent training data, shown as white elements in Figure 2. The ten-training data set (ten groups) was generated by the relative positions of the testing data set, shown in Figure 4. The ten ANN models were developed with the same ANN structure based on the number of training data sets. Appl. Sci. 2023, 13, x FOR PEER REVIEW 8 of 19 Figure 4. The ten training data sets (groups).

ANN Model Development
The ANN model was developed by performing two analyses (first and second), as shown in the following section.

First Analysis
The first analysis aims to detect the data that had significant differences between the observed and computed FCCD. It was accomplished by running ANN models and promoting the data, as detailed in the following section.

Running ANN Models
As illustrated in the main components of the ANN in the previous section, the input layer consists of two patterns of data: scale data ( , and ) and factors data (PUB, SPUB, and PRI). The and represent the standardized contract cost and standardized contract duration, respectively. The number of hidden layers was set to two, the greatest option in the SPSS-IBM program. In terms of the number of neurons in the hidden layers, they were set as (2m + 1) [34], where m is the number of the input layer neurons, which were five. Therefore, the number of neurons per hidden layer was eleven. Because the hyperbolic activation function is better than the sigmoid activation function [35], the activation function was set as hyperbolic. The output was set to one neuron ( ). The ANN model' structure is shown in Figure 5.
Ten sets (groups) were considered for the ANN model based on the number of training data sets. Therefore, ten ANN models were developed. Each ANN model was run several times to monitor the Relative Error (RE). It should be close to a constant value to avoid overfitting [36,37]. The RE depended on the observed and computed . Table 2 shows a portion of the input and output data used in ANN model 2.

ANN Model Development
The ANN model was developed by performing two analyses (first and second), as shown in the following section.

First Analysis
The first analysis aims to detect the data that had significant differences between the observed and computed FCCD. It was accomplished by running ANN models and promoting the data, as detailed in the following section.

Running ANN Models
As illustrated in the main components of the ANN in the previous section, the input layer consists of two patterns of data: scale data (CC i , and CD i ) and factors data (PUB, SPUB, and PRI). The CC i and CD i represent the standardized contract cost and standardized contract duration, respectively. The number of hidden layers was set to two, the greatest option in the SPSS-IBM program. In terms of the number of neurons in the hidden layers, they were set as (2m + 1) [34], where m is the number of the input layer neurons, which were five. Therefore, the number of neurons per hidden layer was eleven. Because the hyperbolic activation function is better than the sigmoid activation function [35], the activation function was set as hyperbolic. The output was set to one neuron (FCCD i ). The ANN model' structure is shown in Figure 5.
Ten sets (groups) were considered for the ANN model based on the number of training data sets. Therefore, ten ANN models were developed. Each ANN model was run several times to monitor the Relative Error (RE). It should be close to a constant value to avoid overfitting [36,37]. The RE depended on the observed and computed FCCD i Table 2 shows a portion of the input and output data used in ANN model 2.

Enhance the Used Data Accuracy
Some abnormal data have a detrimental effect on the accuracy of ANN models. There are tools to quality-check used data, such as residual error or APE. This paper identified abnormal data detection using an APE value of more than 35%. The modified training data set was generated and will be used in the second analysis by deleting these data from the training data set. The APE can be computed using Equation (5).
where  Figure 6a,b shows the distribution frequency of the APE for the first and second analyses, respectively. Although the data for APE greater than 35% were deleted in the first analysis, there were some data for APE greater than 35% in the second analysis. In addition, the positive promotes data not only on essential data with high error (APE > 35%) but also on data with low error, as shown in Figure 6b. The trend of APE's frequency

Second Analysis
The ten ANN models-the same ANN model in the first analysis-were run several times based on the modified training data sets (ten modified data groups).

Evaluation Model
The results of the ANN models that were carried out on the modified data were utilized to measure the accuracy of each model using the mean absolute percentage error (MAPE). It can be computed using Equation (6) where n m is the number of modified data used in the ANN model. After that, the average of the ten MAPEs was computed. In addition, each model was tested with its test data to check the validity of the models for new data by measuring MAPE for each model. Figure 6a,b shows the distribution frequency of the APE for the first and second analyses, respectively. Although the data for APE greater than 35% were deleted in the first analysis, there were some data for APE greater than 35% in the second analysis. In addition, the positive promotes data not only on essential data with high error (APE > 35%) but also on data with low error, as shown in Figure 6b. The trend of APE's frequency decreased with the APE value increase in the second analysis. In the first analysis, some data had a greater difference between the observed and computed FCCD than the observed FCCD, as shown in Figure 6a. This difference means that the data may contain a significant number of abnormal data. Figure 5 shows the MAPE of the ten ANN models for the first and second analyses. The MAPE ranged from 27.5% to 32% for the first analysis. On the other hand, it ranged from 9.68% to 15.84% for the ANN models in the second analysis. The percentage removing data that had a significant difference between observed and computed FCCD (APE > 35%) was significant, with minimum and maximum values of 28% and 42%, respectively, as shown in Figure 7. Although the removing data percentage was high, the MAPE value decreased from 29.9% for the first analysis to 12.22% for the second analysis on average, which was close to the MAPE of the high accuracy models (MAPE ≤ 10%), as illustrated by [25,32,38]. data had a greater difference between the observed and computed FCCD than the observed FCCD, as shown in Figure 6a. This difference means that the data may contain a significant number of abnormal data.

Results and Discussions
(a) (b) Figure 6. Frequency of APE of the first and second analysis for ten ANN models. (a) for first analysis (b) for second analysis. Figure 5 shows the MAPE of the ten ANN models for the first and second analyses. The MAPE ranged from 27.5% to 32% for the first analysis. On the other hand, it ranged from 9.68% to 15.84% for the ANN models in the second analysis. The percentage removing data that had a significant difference between observed and computed FCCD (APE > 35%) was significant, with minimum and maximum values of 28% and 42%, respectively, as shown in Figure 7. Although the removing data percentage was high, the MAPE value decreased from 29.9% for the first analysis to 12.22% for the second analysis on average, which was close to the MAPE of the high accuracy models (MAPE ≤ 10%), as illustrated by [25,32,38]. Figure 8 shows the accumulative frequency of the APE; the data that had APE less than 10% varies from 48% to 65% among the ten ANN models. Moreover, the percentage of data that had APE less than 20% ranged from 75% to 81%. The difference ranges of the accumulative frequency decreased with increasing APE.    Figure 8 shows the accumulative frequency of the APE; the data that had APE less than 10% varies from 48% to 65% among the ten ANN models. Moreover, the percentage of data that had APE less than 20% ranged from 75% to 81%. The difference ranges of the accumulative frequency decreased with increasing APE.  The MAPE results of the ten models for testing data are shown in Figure 9. The MAPE value ranges from 3.47% (ANN-8) to 26.91% (ANN-3). The average MAPE of the ten models was 14.92%, less than the allowable standard value of 20% [27].
The MAPE of the three types of linear regression models (LR1, LR2, and LR3) was compared with the average MAPE of the ANN models in the first and second analyses. The LR1 represents linear regression performed on the 135 data sets without any transformed functions. LR2 and LR3 are linear regressions on the 135 data sets transformed by the square root and logarithmic function, respectively. The equations for LR1, LR2, and LR3 are shown in Table 3. The results indicate that the LR1 provides a high MAPE value of 143%, which indicates a low-accuracy model. On the other hand, the LR2 and LR3 models gave MAPEs of 43.6% and 39.23%, respectively, as shown in Figure 10. The results agreed with the results of [23,39]. They stated that the regression model of transformed data by the logarithmic function provides more accuracy than the other function. However, the three linear regressions had low accuracy due to the high value of the MAPE, as shown in Figure 10. The average value of the MAPE for the first and second analyses was The MAPE results of the ten models for testing data are shown in Figure 9. The MAPE value ranges from 3.47% (ANN-8) to 26.91% (ANN-3). The average MAPE of the ten models was 14.92%, less than the allowable standard value of 20% [27]. 29.9% and 12.22%, respectively. They are lower than the LR3 by 9.3% and 27.03%, respectively. The paper's contribution is to increase the accuracy of the ANN model in predicting the FCCD based on relatively small data by using the three methods of preparation (standardize, augment, and promote data).    The MAPE of the three types of linear regression models (LR1, LR2, and LR3) was compared with the average MAPE of the ANN models in the first and second analyses. The LR1 represents linear regression performed on the 135 data sets without any transformed functions. LR2 and LR3 are linear regressions on the 135 data sets transformed by the square root and logarithmic function, respectively. The equations for LR1, LR2, and LR3 are shown in Table 3. The results indicate that the LR1 provides a high MAPE value of 143%, which indicates a low-accuracy model. On the other hand, the LR2 and LR3 models gave MAPEs of 43.6% and 39.23%, respectively, as shown in Figure 10. The results agreed with the results of [23,39]. They stated that the regression model of transformed data by the logarithmic function provides more accuracy than the other function. However, the three linear regressions had low accuracy due to the high value of the MAPE, as shown in Figure 10. The average value of the MAPE for the first and second analyses was 29.9% and 12.22%, respectively. They are lower than the LR3 by 9.3% and 27.03%, respectively. The paper's contribution is to increase the accuracy of the ANN model in predicting the FCCD based on relatively small data by using the three methods of preparation (standardize, augment, and promote data). Table 3. Description of the different regression models.

Model
Regression Formula FCD = Exp(−5.898 + 1.227 ln CD + 0.085 ln CC + 7.532Sector) Figure 9. MAPE of ANN models for test data.  To study the performance of FCCD, the ANN-5 was considered because the MAPE of the model for training and testing stages for the second analysis provides an average To study the performance of FCCD, the ANN-5 was considered because the MAPE of the model for training and testing stages for the second analysis provides an average value among the models. Figure 11 shows the variation of the FCCD with CD for public, semipublic, and private. The FCCD value of the public and semi-public has been overestimated. However, the FCCD value for private is underestimated. To compare these results with a study by Skitmore and Ng [23] for residential Australian construction projects, the type of contract is a lump sum, similar to Saudi construction. The Skitmore and Ng curve was between the public and semi-public curves for CD for more than ten months. On the other hand, the previous study curve was close to the semi-public curve for CD for less than ten months.
The accuracy of ANN models was compared with different studies. The CBR model developed by Jin et al. [22] considered geometry, building information, foundation system, subsoil condition, and roof type. The average APE was 5.74%. The value was smaller than the MAPE of the ANN model. The CBR model was used for building projects, while the ANN model utilized projects such as building, highway, electric, and mechanic projects. Therefore, the margin of error may be increased. To compare the accuracy of the ANN model with the EVM developed by [21], the EVM provides a MAPE value of 12.96%, which is close to the MAPE of the ANN model (12.22%) for this paper. On the other hand, the MAPE of the ANN model performed by Pewdum et al. [15] was 6.2% on average. It was less than the developed ANN' MAPE. It is assumed that the Pewdum ANN model's input parameters represent the working start date, CD, % actual completion, evaluating date, and % planned completion. These parameters were closely related to the FCCD. However, some parameters were not available in the pre-tendering phase. value among the models. Figure 11 shows the variation of the FCCD with CD for public, semi-public, and private. The FCCD value of the public and semi-public has been overestimated. However, the FCCD value for private is underestimated. To compare these results with a study by Skitmore and Ng [23] for residential Australian construction projects, the type of contract is a lump sum, similar to Saudi construction. The Skitmore and Ng curve was between the public and semi-public curves for CD for more than ten months. On the other hand, the previous study curve was close to the semi-public curve for CD for less than ten months. The accuracy of ANN models was compared with different studies. The CBR model developed by Jin et al. [22] considered geometry, building information, foundation system, subsoil condition, and roof type. The average APE was 5.74%. The value was smaller than the MAPE of the ANN model. The CBR model was used for building projects, while the ANN model utilized projects such as building, highway, electric, and mechanic projects. Therefore, the margin of error may be increased. To compare the accuracy of the ANN model with the EVM developed by [21], the EVM provides a MAPE value of 12.96%, which is close to the MAPE of the ANN model (12.22%) for this paper. On the other hand, the MAPE of the ANN model performed by Pewdum et al. [15] was 6.2% on average. It was less than the developed ANN' MAPE. It is assumed that the Pewdum ANN model's input parameters represent the working start date, CD, % actual completion, evaluating date, and % planned completion. These parameters were closely related to the FCCD. However, some parameters were not available in the pre-tendering phase.

Conclusions
Early-stage estimation of the final construction contract duration is crucial for the progress and success of a project. The 135 data sets from Saudi projects were used to create the ANN models in the paper. The ANN model's development consists of three phases. The first phase was to collect the data, process the data using Zavadskas and Turskis' logarithmic standardization method, augment the data using the method introduced by [33], and generate ten training data sets. In the first analysis, ten ANN models were developed to detect the data with a significant value of APE. These data were then deleted to obtain the modified ten training data sets used in the second analysis. The main findings revealed that the average MAPE of the ANN models in the second analysis was 12.22%, and the model accuracy was high to good. In addition, the ANN model provides better performance than the LR

Conclusions
Early-stage estimation of the final construction contract duration is crucial for the progress and success of a project. The 135 data sets from Saudi projects were used to create the ANN models in the paper. The ANN model's development consists of three phases. The first phase was to collect the data, process the data using Zavadskas and Turskis' logarithmic standardization method, augment the data using the method introduced by [33], and generate ten training data sets. In the first analysis, ten ANN models were developed to detect the data with a significant value of APE. These data were then deleted to obtain the modified ten training data sets used in the second analysis. The main findings revealed that the average MAPE of the ANN models in the second analysis was 12.22%, and the model accuracy was high to good. In addition, the ANN model provides better performance than the LR model, especially the linear model that transforms data by a logarithmic function. The paper supports the research using comparatively little data and neural network integration.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The raw data supporting the findings of this paper are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1.