1. Introduction
Infrastructure construction is essential for improving quality of life; however, the planning of such projects faces significant challenges due to discrepancies in time and cost relative to the established baseline [
1,
2]. The construction sector is inherently exposed to risks and unforeseen events. Therefore, meticulous planning is required, incorporating a range of contingencies to prevent delays in project delivery. Failure to do so may result in additional costs and even project suspension [
3]. Then, addressing time and cost discrepancies in infrastructure construction requires robust planning strategies that anticipate risks and integrate contingencies. This proactive approach is essential to ensure timely delivery, cost control, and project continuity.
Infrastructure projects frequently experience significant cost and time overruns across different regions and over time. Several studies have reported the magnitude of these overruns. For instance, in developing countries such as Ethiopia, cost overruns in infrastructure projects range from 2% to 248%, with an average of 35% [
4]. In the case of educational buildings in Ghana, there are notable differences between the actual completion cost and the initial contract award cost, with a mean cost overrun of 23.7%, ranging from 0.2% to 95.9% [
5]. Similarly, public infrastructure projects in Portugal show an average cost overrun of 19% [
6]. In the United Kingdom, time overruns were reported at 48.5%, while cost overruns reached 41.2% [
7]. In the Philippines, a study of 85 road projects found a mean cost overrun of 5.4% [
8]. In Tanzania, road projects showed an average cost overrun of 44% [
9]. In Colombia, researchers analyzed rural road construction projects and identified a mean cost deviation of 8% and a mean time deviation of 19% [
10]. Other studies focusing on frequency found that in the United States, 55% of projects experienced cost overruns [
11]. In Ghana, a study found that out that 70% faced delays [
12]. Another study also highlighted that nine of ten projects typically experience cost overruns, underscoring the global nature of the issue. Notable examples include the Channel Tunnel in the UK (80% overrun), the Great Belt Link in Denmark (54%), and several Korean megaprojects with an average cost increase of 122.4% [
13]. Analyzing data from 662 energy infrastructure projects across 83 countries, the results show that more than three-fifths of the projects experienced cost overruns [
14]. These findings underscore the global and persistent nature of cost and time overruns in infrastructure projects, highlighting the urgent need for improved planning, risk management, and accountability mechanisms across regions.
Several methodologies have been employed to analyze the issue, many of which have been replicated across multiple studies. Notably, a significant portion of these approaches rely on stakeholder perceptions and expert opinions [
15,
16]. The repeated use of these methods, along with the reliance on stakeholder opinions as the primary source of information, has become a common practice across various research frameworks. Based on stakeholder opinions, a ranking of the causes is created and/or the responsible parties are analyzed [
17,
18,
19,
20]. However, this approach may present issues of subjectivity or even conceal or distort reality due to potential conflicts of interest. It could also lead to perception-related problems, given that responses depend on the quality of the questions and the stakeholder’s willingness to respond assertively [
21].
Therefore, other approaches have been implemented, such as the analysis of empirical information of project’s documentation. For international projects, studies have identified several root causes for poor performance, including: Complex stakeholder environments, Inadequate planning and coordination, Cultural and institutional differences between donors and host countries, and Lack of routine maintenance and follow-up mechanisms [
22]. Another study, based on a data source for highway construction projects, used regression analysis and Principal Component Analysis—PCA. The research results indicate a correlation between the reciprocal of project budget size and the percentage of cost overrun [
3]. A more recent study presents a robust quantitative analysis based on a large dataset of 1091 public transport infrastructure projects in Portugal, spanning from 1980 to 2012. The methodology builds on existing approaches but introduces a more ambitious and innovative framework by incorporating exogenous determinants—namely political, legal and regulatory, and economic factors. The analysis reveals that these exogenous variables, previously undervalued in the literature, play a significant role in explaining cost deviations. The study employs endogenous models, including OLS, GLM, Tobit, and Probit [
23]. Using machine learning techniques, a study introduces a methodology for predicting cost overrun levels in public-sector construction projects. In recent years, Machine Learning (ML) techniques have been used to uncover complex and non-linear relationships between variables across various fields of study, rank the importance of variables in complex problems, and reduce computational costs [
24,
25,
26,
27,
28,
29,
30,
31]. The recent literature highlights the versatility of machine learning (ML) across diverse domains. For instance, one study presents advanced traffic modeling techniques, including macroscopic models enhanced by ML to better understand congestion dynamics [
32]. Similarly, another study applies ML methods—such as bag-of-words and SVM classifiers—for automatic detection and tracking of cells in biology [
33]. Although these studies address different fields, they demonstrate how ML can uncover complex patterns and support data-driven decision-making.
Likewise, machine learning techniques have been increasingly used in recent years to address various problems related to construction management [
34,
35]. By applying a stacking ensemble learning approach—recognized for its robustness in classification tasks—Williams and Gong [
36] proposed a model that integrates multiple classifiers to enhance prediction accuracy. The ensemble achieved an average accuracy of 61.41% across five runs, outperforming individual models such as RIDOR and K-Star. This methodology enables the early identification of high-risk projects during the bidding phase, thereby supporting more informed decision-making prior to contract finalization. Hamdan et al. [
37] also used machine learning techniques to predict cost overruns in construction projects in Jordan. Exploring the effectiveness of these techniques in forecasting cost overruns in public construction projects, the study applied 15 regression-based ML algorithms to a dataset of 836 projects in Jordan. Among the tested models, CatBoost achieved the highest predictive performance. The most influential features in predicting cost overruns were change orders (41.16%), excessive quantities (21.86%), and budgeted costs (20.96%). These findings demonstrate the potential of ML models to support proactive cost management and decision-making in construction project planning and execution.
Regarding the causes of cost and time overruns, researchers analyzed 73 articles and confirmed that this issue is global in nature. The study reveals that the project execution phase is the primary stage where cost and time deviations occur, highlighting the critical role of project management professionals. A total of 73 causes were identified and grouped into six categories: design, project management, owner management, resources, external factors, and government and society [
16]. A comprehensive study aggregated findings from a selection of 40 journal articles and identified 173 distinct causes of cost and time overruns. These were grouped into categories such as communication, financial problems, management, materials, organizational, project-related, psychological factors, and weather conditions. Additionally, after reviewing 405 articles, researchers identified seven critical drivers of cost overruns in the global construction industry: planning and scheduling issues, project estimation inaccuracies, design inefficiencies, adverse weather conditions, scope definition challenges, contractual ambiguities, and unforeseeable site conditions [
38]. Focusing specifically on cost overruns, an analysis of 48 journal articles identified 79 causes. The top causes include: design problems and incomplete designs, inaccurate estimation, poor planning, adverse weather, poor communication, stakeholder skills, experience and competence, financial problems/poor financial management, price fluctuations, contract management issues, and ground/soil conditions [
39]. The extensive body of research confirms that cost and time overruns in construction projects are driven by a wide range of factors. These causes span technical, managerial, financial, environmental, and human dimensions, underscoring the need for comprehensive and proactive project management strategies to mitigate their impact.
The previously presented arguments demonstrate that infrastructure construction faces multiple deficiencies that require effective mitigation measures. Over time, deviations in timelines and costs have been observed across various types of projects and regions. Infrastructure development is essential for a country’s progress and for achieving the Sustainable Development Goals—particularly Goal 4, “Quality Education,” which aims to leave no one behind and promote access to education [
40]. In this context, educational infrastructure projects are fundamental, especially in developing countries. Within this framework, educational infrastructure projects play a fundamental role. Researchers around the world have identified various factors contributing to poor performance in the construction of educational facilities. In Nigeria, for example, the key causes of project abandonment in public tertiary institutions have been studied to enhance the delivery of teaching and research infrastructure [
41]. In South Africa, researchers have found that delays in educational infrastructure projects are contributing to the existing backlog in the delivery of basic education facilities. Based on their findings, they propose that a one-stop center would improve coordination among government systems and agencies, helping to prevent delays in preconstruction project planning [
42]. As mitigation measures, data-driven models have been proposed to address the causes of cost overruns in educational infrastructure, using structural equation modeling (PLS-SEM). These models offer practical insights for policymakers and project managers to reduce cost overruns and enhance project performance in the education sector. Specifically, bid evaluation and project planning improve efficiency and reduce design-related issues, while project initiation and contractor selection help mitigate claim management issues, estimation and scheduling problems, and contract management inefficiencies [
43]. In this context, although numerous studies have addressed time and cost deviations in infrastructure projects, most have focused on sectors such as transportation and relied on expert-based methodologies. In the specific case of educational infrastructure—particularly in developing countries—there is a lack of empirical research using open data to identify patterns and causes of deviations. This study aims to fill that gap through a quantitative approach based on Colombia’s public datasets. The research problem centers on identifying the factors that explain time and cost deviations in educational infrastructure projects. The central hypothesis posits that variables associated with different stages of the project cycle significantly influence these deviations.
The present study will focus on analyzing the open data available on educational infrastructure in Colombia, with the aim of identifying the factors that generate time and cost deviations in educational infrastructure projects. This analysis will allow for the comparison of results with international studies and the proposal of concrete measures to mitigate these impacts in projects that are essential for the country. The study followed the CRISP-DM methodology [
44], structuring the data science project through the following phases: First, the business understanding phase defined the objective of analyzing the contracting of educational infrastructure projects in Colombia, collecting and exploring open datasets to assess their relevance and quality. In the data preparation phase, the information was cleaned and organized for analysis. The modeling phase included the development of bivariate and multivariate statistical models to identify patterns and relationships between time and cost deviations and the different factors identified. Finally, the evaluation and deployment phases focused on interpreting the results and drawing conclusions to guide future decision-making and propose strategies to mitigate time and cost deviations in educational infrastructure projects.
2. Materials and Methods
The data for this study were sourced from SECOP II, Colombia’s public procurement platform, which provides open access to contractual information. The research focused on 175 finalized and settled contracts related to educational infrastructure projects executed between 2017 and 2022. The dependent variables (time deviation and cost deviation) were not directly available in the SECOP II database and were therefore calculated using the following Equations (1) and (2) [
45]:
Regarding the independent variables, eleven were identified, corresponding to different project’s characteristics. Each variable was classified as either numerical or categorical. In total, six numerical variables and four categorical variables were identified.
Table 1 lists the variables obtained from the Colombian Open Data platform. The numerical variables are presented first, followed by the categorical ones. Additionally, the variables project intensity and growth rate were calculated based on the original data.
This study employed a structured quantitative research methodology to investigate the relationships between cost and time deviations and various project performance indicators following the CRISP–DM approach [
44]. The stages were grouped int three main analytical stages: Exploratory Data Analysis (EDA), bivariate analysis, and multivariate combined with machine learning analysis. All analyses were conducted using R software (R Core Team, 2025) [
47]. Colombia’s Open Data platform allows users to filter for construction projects that have been completed. Then, using keywords, it is possible to identify educational infrastructure construction projects. For the analysis, only those projects that showed at least one deviation—either in time and/or cost—were selected. The research method included the following stages: (
Section 2.1) Exploratory Data Analysis, (
Section 2.2) Bivariate analysis, (
Section 2.3) Multivariate Analysis (
Section 2.4) Results Analysis. A summary of the analytical methods is included in
Table 2.
2.1. Exploratory Data Analysis
The initial stage involved an Exploratory Data Analysis to gain a comprehensive understanding of the dataset and ensure its suitability for further statistical modeling [
48,
49]. This process included: (i) Descriptive statistics to summarize central tendencies and dispersion for key numerical variables such as Contract Value, Cost and Time Deviations; (ii) Visualizations such as boxplots were generated to explore the distribution of both numerical and categorical variables; (iii) Boxplots were used to visualize distributions and identify extreme outliers based on the interquartile range (IQR); observations outside the bounds defined by Q1 − 3 × IQRQ1 − 3 × IQR and Q3 + 3 × IQRQ3 + 3 × IQR were excluded to improve model robustness [
50]; (iv) Spearman correlation matrices were computed; statistically significant correlations (
p ≤ 0.05) were visualized in a heatmap to identify relevant relationships among variables; (v) Variables were classified as numerical or categorical, guiding the selection of appropriate statistical tests in subsequent analyses.
2.2. Bivariate Analysis
In the second stage, a bivariate analysis was conducted to examine the individual relationships between the independent variables—cost deviation and time deviation—and each dependent variable. The Lilliefors [
51] test was applied to assess the normality of numerical variables. Given the lack of normality (
p < 0.05), non-parametric tests were used.
Spearman correlation analysis was performed to identify monotonic relationships among numerical variables [
52]. This was implemented using correlation functions, and the results were visualized through correlation plots, highlighting statistically significant associations. Cost and time deviations were compared against all numerical variables, and additionally, these two dependent variables were also compared with each other to explore potential interdependencies. Associations were considered statistically significant when the
p-value was less than 0.05.
Kruskal–Wallis Tests [
53] were applied to evaluate differences in cost and time deviations across categories. Significant results (
p < 0.05) indicated meaningful group differences. Pairwise comparisons were conducted using the Wilcoxon Mann–Whitney test [
54] with Bonferroni adjustment to identify specific group differences. Minimum, maximum, mean, median, and standard deviation values were calculated for each category to support interpretation.
2.3. Machine Learning
To enhance predictive accuracy and explore non-linear relationships, a Random Forest regression model was implemented using the Random Forest package in R [
55]. The process began with data preparation, which involved refining the dataset by excluding extreme outliers [
50] and selecting relevant predictors.
While Random Forest is commonly employed as a predictive modeling technique, in this study it was used primarily as a tool for ranking variable importance, rather than for building a high-performing predictive model. Given the relatively small sample size (150 finalized contracts), the objective was not to achieve predictive accuracy but to uncover non-linear relationships and assess the relative influence of each variable on cost and time deviations. The variable importance measures derived from Random Forest—such as Mean Decrease in Accuracy—are widely recognized for their robustness in exploratory analyses and feature selection, even when predictive performance is not the primary goal [
56,
57]. This approach allowed us to complement traditional statistical methods and identify key drivers of project performance in a data-driven and interpretable manner.
Following data preparation, a custom function was developed to perform hyperparameter tuning. This function evaluated the out-of-bag mean squared error (OOB-MSE) across different values of the mtry parameter, which determines the number of predictors considered at each split in the Random Forest. The configuration that yielded the lowest OOB-MSE was selected for model training.
The Random Forest model was then trained using 199 trees and the optimal mtry value. The dependent variable in this model were Cost and Time Devation, while all other selected variables served as predictors. To assess the model’s performance, OOB error plots were generated, and variable importance metrics were calculated [
58]. The varImpPlot() function was used to visualize the most influential predictors contributing to time deviation.
This machine learning approach provided a robust and flexible framework for understanding the multifactorial nature of project deviations, complementing the insights obtained from classical statistical methods.
2.4. Results Analysis
The final stage of the methodology involved synthesizing the findings from the exploratory, bivariate, and machine learning analyses to derive meaningful insights into the factors influencing cost and time deviations in educational infrastructure projects and identifying the most significant variables. This integrative approach allowed for a comprehensive understanding of relationships within the dataset. Additionally, the results were compared with findings from previous studies to validate the consistency of observed patterns and highlight contextual differences specific to Colombia’s public procurement environment.
3. Results
This section presents the results obtained from each stage of the research methodology used to address the research questions. First, an Exploratory Data Analysis (EDA) was conducted (
Section 3.1). This was followed by a bivariate analysis (
Section 3.2), in which cost deviation and time deviation were treated as independent variables in relation to the dependent variables. Finally, a multivariate analysis was performed (
Section 3.3), using the same dependent variables and incorporating all independent variables with their interactions.
3.1. Exploratory Data Analysis
An initial Exploratory Data Analysis (EDA) was conducted to gain a comprehensive understanding of the dataset, including the structure and characteristics of each variable. This stage aimed to detect potential biases or anomalies that could influence the validity of the results, and to classify the variables according to their statistical nature and role within the analytical framework [
59]. The Exploratory Data Analysis (EDA) conducted in R involved a systematic examination of the dataset to uncover underlying patterns, detect anomalies, and assess data quality. Key steps included summarizing descriptive statistics, visualizing distributions through histograms and boxplots, identifying that no missing values were found, although outliers were reported, and analyzing correlations between variables using scatter plots and correlation matrices. Additionally, variable types were classified to inform subsequent modeling decisions. These procedures were implemented using R software [
47].
Additionally, variable types were classified to inform subsequent modeling decisions. These procedures were implemented using R software [
47].
For the numerical variables, statistical measures such as minimum, maximum, average, and median were calculated. Additionally, histograms and boxplots were created to analyze their distribution. This analysis enabled the identification of outliers that could potentially skew the results; these outliers were examined individually for each variable. Regarding the presence of outliers, the research team decided to eliminate the extreme outliers in the contract value variable, specifically those observed above the red dotted line. This decision was made considering the importance of including similar projects and the significant differences in project sizes. See
Figure 1. To avoid bias, 25 projects with extreme contract values that could affect the results were excluded.
The univariate analysis for dependent variables is included in
Table 3. In Colombia, legislation does not permit cost deviations greater than 50% of the project’s original value. As a result, the values for this variable are considerably lower than those for time deviations. The average cost deviation is 22.17%, while the average time deviation is 77.73%. Time deviations exhibit significant outliers; some projects reach values as high as 800%. Boxplot diagrams were used to analyze the behavior of these variables and to identify outliers. A significant difference between these two variables can be observed, particularly in the presence of outliers for time deviation. Some projects have extremely short estimated durations, such as nine days, which affects their execution. These short timelines can create the need for additional resources or extensions. See
Table 3.
Histograms and Boxplot diagrams were used to analyze the behavior of these variables and to identify outliers. A significant difference between these two variables can be observed, particularly in the presence of outliers, only for time deviation. The plots for cost deviation illustrate the distribution of cost variations across educational infrastructure projects. Two prominent peaks are observed at 0% and 50% deviation, indicating that a large number of projects either strictly adhered to their original budgets or reached the maximum legally permitted deviation in Colombia. Intermediate values (10%, 20%, 30%, and 40%) show significantly lower frequencies, which may reflect the influence of regulatory constraints and budgetary control practices. The plots for time deviation show the distribution of percentage deviations across the dataset, ranging from 0% to 800%. The data reveal a strong concentration of observations within the 0–100% deviation range, suggesting that most time-related measurements remained relatively close to their expected values. This pattern indicates generally consistent performance, with only a minority of cases exhibiting significant deviations. See
Figure 2.
Considering that only projects reporting either cost or time deviation—or both—were included, and that suspension time was one of the variables analyzed, the project’s performance in relation to these variables was assessed. The results are presented in
Figure 3.
The COST_SCHEDULE category (66 projects) is the most frequent, indicating that a majority of projects experienced both cost and schedule deviations. The SCHEDULE category (32 projects) shows that a significant number of projects had schedule deviations only, even when costs remained within expected limits. The COST category (28 projects) represents projects with cost deviations alone, which were also prevalent, though slightly less than schedule-only deviations. The COST_SCHEDULE_SUSP category (14 projects) includes those that faced all three issues—cost overruns, schedule delays, and suspension periods—highlighting severe performance problems. Finally, the SCHEDULE_SUSP category (10 projects) is the smallest group, consisting of projects that experienced schedule delays along with suspension time, but without cost deviations. The high number of projects with both cost and schedule deviations suggests that these issues often occur together, possibly due to poor planning, underestimation, or external factors. Suspension-related categories (SCHEDULE_SUSP and COST_SCHEDULE_SUSP) are less frequent but still notable, indicating that interruptions in execution contribute to performance problems. Projects with only one type of deviation (COST or SCHEDULE) are common, but less so than those with combined issues.
Then the univariate analysis for the independent variables were obtained. For the variables related to cost, MS* is used, meaning minimum legal wages in Colombia. To standardize contract values and account for inflation, all monetary figures were converted into Colombian legal monthly minimum wages (SMMLV) based on the year of contract signing. This allowed for consistent comparison across contracts. Variables were classified according to the project life cycle and categorized as either numerical or categorical. The univariate analysis for all the numerical variables is presented in
Table 4. Considerable differences can be observed in the initial contract values and durations. Regarding award growth, this variable only presents negative values, as it analyzes the difference between the estimated contract value during the public procurement stage and the awarded value of the contract to the winning bidder. Project intensity measures the amount of money estimated to be invested per unit of time—in this case, days. Since these are similar projects, such large differences in this variable should not occur. Likewise, for suspension time, both the minimum value and the median are zero. This variable was estimated by considering the percentage of time the project was suspended relative to its initial duration. Finally, the variable number of bidders reflects the level of competition in the procurement process. In the dataset, this number ranges from 1 to 165. Notably, fifteen procurement processes had only one bidder.
Next, the univariate analysis for categorical variables was performed through bar plot graphs.
Figure 4 illustrates the variability in frequencies across these variables, revealing significant differences. The categorical variables encompass elements of diverse nature, including year of execution, region, type of procurement, and type of project. Regarding the year of execution, there is a peak in 2020 with 63 projects, followed by 2021 with 42, which may reflect an increase in investment in educational infrastructure during that period. As for the type of process, most projects were awarded through abbreviated selection (89), followed by public bidding (56), while direct contracting was infrequent (6), suggesting a preference for more competitive mechanisms. In Colombia contracting processes included competitive bidding, in which a contractor is selected in equal opportunities; abbreviated selection, a simplified process carried out after a public tender has been declared void; and minimum contract, the quickest and most straightforward procedure for low contract values [
60]. In terms of region, the Andina Region concentrates the majority of projects (140), which could be related to population density and educational demand in that area. Other regions were grouped. Finally, the project type chart indicates that most projects involve improvements to existing infrastructure (123), while new construction projects are less common (27), suggesting a focus on rehabilitation and optimization of existing facilities.
3.2. Bivariate Analysis
This section includes the results of the bivariate analysis comparing time and cost deviations with each of the independent variables. Considering that the dependent variables are numerical, Spearman’s Rho was first calculated for the numerical independent variables. Then, the Kruskal–Wallis test was applied to the categorical variables, complemented by the Wilcoxon- Mann–Whitney test for those that showed statistical significance.
The matrix correlation is presented in
Figure 5 including positive correlations in blue and negative correlations in red. Only significant variables are included (
p-value << 0.05) A positive correlation is observed between time and cost overruns, suggesting that project delays are often accompanied by cost overruns, highlighting the interconnected nature of time and budget management. Overall, the matrix highlights complex interdependencies between cost and time deviations and aspects of project planning and execution, offering valuable insights for improving planning and oversight in public infrastructure development.
Evaluating the correlations for COST_DEVIATION with other variables helps identify factors potentially associated with cost overruns. Positive Correlations included: INITIAL_DURARION with 0.39, representing the highest one, suggesting that projects with longer initial durations tend to experience greater cost deviations. CONTRACT_VALUE with 0.27, suggesting that projects with higher contract values tend to have greater cost deviations. NUMBER_BIDDERS shows a correlation of 0.24. This could imply that greater competition in the bidding process is associated with higher cost deviations. AWARD_GROWTH shows a negative correlation of −0.36; however, it is important to note that this variable only contains negative or zero values. Therefore, this result suggests that contracts awarded far from the estimated value during the procurement process are associated with higher final costs. When analyzing AWARD_GROWTH in relation to the number of bidders, a similar negative correlation of −0.36 is observed. This means that the higher the number of bidders—that is, the greater the competition in the contracting process—the AWARD_GROWTH value tends to decrease. Since this variable only has negative values, it indicates that as the number of bidders increases, the contract is awarded at a price further from the estimated value established during the bidding stage.
Regarding TIME_DEVIATION, a positive correlation is observed with SUSPENDED_TIME_P, showing the highest value (0.43). This suggests that longer suspension periods are associated with greater delays in project completion, indicating that suspensions significantly disrupt project timelines and lead to extended durations beyond the original schedule. A positive correlation is also found with PROJECT_INTENSITY (0.37), suggesting that projects with higher financial intensity per unit of time tend to experience greater delays. This may imply that high-pressure projects—those requiring significant execution within tight timeframes—are more prone to time overruns.
Finally, CONTRACT_VALUE shows a positive correlation with time deviation, indicating that higher-value contracts tend to experience greater delays.
Then, a bivariate analysis was conducted to examine the relationship between cost and time deviation and the categorical variables. For COST_DEVIATION, the Kruskal–Wallis test identified significant variables listed in
Table 5, including process type and project type (
p-value < 0.05). To complement this analysis, the Wilcoxon Mann–Whitney test was applied to compare pairs of categories, providing a more detailed understanding of where the differences lie.
In the case of process type, three categories were considered: simplified selection, public bidding, and direct hiring. The Wilcoxon Mann–Whitney test revealed that public bidding (highlighted in red) exhibited a different behavior compared to the other two categories, which showed similar patterns and could be grouped together. Specifically, public bidding reported higher values for both the median (40.05) and mean (31.84) of cost deviation. Regarding project type, construction improvement projects reported higher values (highlighted in red) for both the median (26.59) and mean (25.05).
Graphically, the differences in the medians for COST DEVIATION in the significant variables can be observed in
Figure 6.
For TIME DEVIATION, the significant categorical variables are presented in
Table 6, which confirms the same variables as previously identified. Regarding process type, the data shows that public bidding contracts (highlighted in red) have a median time deviation of (49.45) and a mean of (83.89)., Regarding PROJECT TYPE, new construction projects reported substantially higher values (highlighted in red), with a median time deviation of (51.80) and a mean of (133.41), indicating a greater tendency for delays. This result contrasts with the cost deviation pattern, where construction improvement projects showed higher overruns.
Graphically, the differences in the medians for TIME DEVIATION in the significant variables can be observed in
Figure 7.
The results confirm the central hypothesis, demonstrating that the significant variables are associated with different stages of the project life cycle. These include key factors related to cost and time deviations, as summarized in
Table 7. Inherent project characteristics—such as project size, type, and planning-related variables like project intensity—are included. Additionally, procurement-stage variables such as award growth, number of bidders, and the procurement model used are also relevant. This classification reinforces the idea that both planning and procurement decisions play a critical role in determining project performance, thereby strengthening the connection between the statistical findings and the study’s hypothesis. Variables such as contract value, process type, and project type influence both cost and time. Cost-specific drivers—such as award growth and initial duration—underscore the importance of accurate cost estimation and scope control. Time-specific drivers—such as project intensity and time suspended—highlight the need for effective planning, scheduling, and risk mitigation.
3.3. Machine Learning
The Random Forest Models allowed for determining the effect of considering variables that interact simultaneously. First, a comparison of the error reduction versus the number of trees determines an optimal number of trees; initiating with 100 trees, an optimal number for cost deviation was 61, and for time deviation, 75. After running this number of trees, the five most important predictors were determined according to the reduction in the Out-of-bag error, see
Figure 8a for COST DEVIATION and
Figure 8b for TIME_DEVIATION.
The analysis using the Random Forest model to analyze cost deviation reveals that the INITIAL_DURATION variable is the most influential, showing the highest percentage increase in mean squared error (%IncMSE). This suggests that the initial duration of the project significantly impacts the accuracy of cost deviation predictions. It is followed in importance by the variables NUMBER_BIDDERS and AWARD_GROWTH, indicating that both competition during the bidding process and growth in the awarded value play relevant roles. The variables PROJECT_TYPE and PROJECT_INTENSITY show a relatively lower contribution, though they still provide valuable information to the model.
The Random Forest model used to predict time deviation highlights SUSPENDED_TIME as the most influential variable, with the highest percentage increase in mean squared error (%IncMSE). This indicates that the proportion of time a project was suspended significantly affects the accuracy of time deviation predictions. INITIAL_DURATION also plays a key role, followed by PROJECT_INTENSITY, suggesting that both the planned duration and the complexity or workload of the project contribute meaningfully to deviations in schedule. CONTRACT_VALUE and PROCESS_TYPE show lower importance, though they still offer useful insights for the model.
4. Discussion
This study presents a comprehensive analysis of time and cost deviations in educational infrastructure projects in Colombia, using a data-driven approach based on open government datasets. It addresses a research gap by applying data science techniques to empirical data, aiming to generate actionable insights for decision-makers—particularly in the context of education-related infrastructure. It is important to note that the analysis was limited to finalized projects due to data availability. Colombia’s public procurement platform only publishes complete contractual information once a project is closed and settled. As such, execution-phase data for ongoing projects is not accessible, preventing early-stage deviation analysis.
The findings identify key variables that significantly contribute to project delays and cost overruns and highlight a lack of research in Europe and the Americas, suggesting regional gaps [
16].
Unlike traditional studies that rely heavily on expert opinions and stakeholder perceptions—often subject to bias or conflicts of interest—this research adopts a data-driven approach grounded in publicly available procurement records. By leveraging open government data from Colombia’s platform, the study ensures transparency, reproducibility, and objectivity in analyzing cost and time deviations. This methodological shift enables the examination of actual project outcomes across a broad sample, providing a more robust and evidence-based understanding of the factors influencing project performance.
The findings reveal a substantial average time deviation of 77.73% and a cost deviation of 22.17%, highlighting critical inefficiencies in project execution. It is important to note that, by law, cost deviations in Colombia must not exceed 50% of the project’s original value, which may explain the relatively lower average observed. These results underscore the need for improved planning and monitoring mechanisms in public infrastructure development. The literature shows considerable variability in cost overruns: in Ethiopia, they range from 2% to 248% with an average of 35% [
4]. in Portugal, the average is 19% in infrastructure projects [
6]; and in the Philippines, road projects report a mean overrun of 5.4% [
8]. In Colombia, rural road projects show lower deviations, with a mean cost overrun of 8% and a time deviation of 19% [
10].
The analysis reveals a significant correlation between cost and time overruns, suggesting that delays are often accompanied by increased expenditures. This interdependence highlights the importance of integrated project management strategies that address both dimensions simultaneously. Earlier studies reported similar findings [
61,
62], and more recent research confirms this relationship in highway construction projects [
63].
Variables related to the bidding process, including award growth, number of bidders, and process type, have been previously identified as influential during procurement. Errors in bidding and awarding have been linked to project underperformance [
64,
65,
66]. Studies using machine learning have shown that large-scale projects under the Design-Bid-Build/Low-Bid (DBB/LB) method are particularly prone to frequent change orders [
67]. A notable finding in this study is the relationship between the number of bidders and award growth. Greater competition, while generally positive, may lead to underestimation and aggressive pricing, increasing the risk of cost and schedule deviations [
68]. This suggests that greater competition during the bidding process, while generally beneficial, may also introduce risks related to underestimation and aggressive pricing strategies, which can ultimately lead to cost and schedule deviations [
69]. In this study, initial duration was not a significant predictor of time deviation, aligning with findings that longer projects tend to have more minor schedule overruns [
70].
Optimism bias during planning [
71] and “optimism blindness” which describes a state of mind driven by optimism that blinds people to realities [
72], can also lead to unrealistic expectations and underestimated risks—factors likely reflected in the project intensity variable [
73]. Other studies emphasize that planning and estimation issues are major contributors to cost overruns globally [
38]. In Peru, deficiencies in technical documentation and inaccurate cost estimates are critical causes of delays [
74]. Optimism bias during planning [
71] and “optimism blindness” [
72] can also lead to unrealistic expectations and underestimated risks—factors likely reflected in the project intensity variable. The variable time suspended has also been reported as relevant in rural road projects in Colombia [
10].
Another group of variables included project-specific factors such as type, region, and year. In this study, only project type was significant for both cost and time deviations. New construction projects showed higher time overruns, while their cost performance was better than that of improvement projects. Previous studies have also examined project type as a determinant. Flyvbjerg et al. [
75] found that road projects are less prone to cost escalation, while Love [
76] reported that rework costs do not significantly vary by project type. In Colombia, maintenance projects on rural roads tend to report higher deviations [
73]. Although year and region were not significant in this study, prior research has shown their influence. For example, election-related projects have been linked to performance issues [
10,
73], and regional factors have been associated with cost overruns in both infrastructure [
77] and educational building projects [
5].
Despite the robustness of the methodology, this study has certain limitations. The analysis was constrained by the scope of available data, which excluded qualitative factors such as stakeholder behavior, site-specific challenges, and contextual variables—such as social or political influences—that may affect project outcomes. It would be beneficial if the public data platform allowed for the classification of project locations as either urban or rural, as this distinction could provide additional insights into deviation patterns and regional disparities.
Although the quantitative approach used in this study provides strong insights into the factors influencing cost and time deviations, incorporating such qualitative factors through mixed-method approaches—such as interviews, case studies, or field observations—could significantly enhance the explanatory power of future models and provide a more comprehensive understanding of project performance in public infrastructure. Beyond the supervised learning approach used in this study, future research could apply unsupervised techniques such as clustering to identify groups of projects with similar risk profiles. This would enable a deeper understanding of shared patterns in delays and cost overruns, supporting more targeted planning and mitigation strategies.
While the methodology is sound, the sample is limited to educational infrastructure projects, and the findings should not be generalized to other types of public infrastructure without further validation. Future studies should expand their scope to include other countries or sectors, such as transportation or healthcare, and assess whether similar patterns hold across different types of projects. Future research also could benefit from approaches such as time-series analysis or survival models, which may better capture the dynamics of project delays over time. This is particularly relevant for understanding when delays are most likely to occur, how early indicators evolve, and how interventions might influence outcomes. Future studies should also incorporate broader and more diverse datasets, including qualitative insights and tracking, and adopt mixed-method approaches to deepen the understanding of the mechanisms behind time and cost deviations in public infrastructure projects. Additionally, further analysis of the real causes of these deviations could be conducted to strengthen the evidence base for decision-making and policy development. To enhance its practical application, future research could adapt the proposed methodology for real-time monitoring dashboards that support decision-making during project execution. While this study focused on post-completion analysis using finalized contract data, the use of open data and machine learning models makes the approach scalable for integration into live systems. With access to real-time procurement and execution data, predictive models could be deployed to continuously assess project risks, enabling early detection of deviations and timely interventions by policymakers.
5. Conclusions
This study proposes a methodological framework for identifying the most influential factors contributing to time and cost deviations in educational infrastructure projects in Colombia, employing a data-driven approach that leverages open government datasets and machine learning techniques. Key variables—such as contract value, initial project duration, award growth, suspension time, number of bidders, project intensity, process type, and project type—were found to affect project performance significantly. These findings have important implications for educational infrastructure planning and public procurement within the education sector, as they enable more accurate risk prediction and support the development of targeted mitigation strategies. Moreover, the results confirm the central hypothesis, demonstrating that these significant variables are not only impactful but also distinctly associated with different stages of the project life cycle, ranging from pre-contractual planning and bidding to execution and closure, highlighting the need for stage-specific interventions to enhance project outcomes.
Another key result is the strong correlation between time and cost deviations, indicating that budget overruns frequently accompany delays. This interdependence suggests that time and cost management should be approached as integrated processes within project planning and control. Enhancing scheduling practices and minimizing execution delays could directly contribute to cost containment. It is therefore recommended that project managers and public agencies implement comprehensive monitoring systems that track both schedule and budget indicators in real-time, enabling the early detection of deviations and coordinated corrective actions to improve overall project performance.
The results are beneficial for policymakers, public sector managers, and infrastructure professionals involved in planning and overseeing educational construction projects. They also offer valuable insights for researchers and data analysts seeking to understand systemic inefficiencies in public procurement. By identifying key factors influencing delays and cost overruns, the findings support evidence-based decision-making aimed at improving infrastructure delivery and resource allocation.