Next Article in Journal
Assessment of Airport Pavement Condition Index (PCI) Using Machine Learning
Previous Article in Journal
Integrating Large Language Model and Logic Programming for Tracing Renewable Energy Use Across Supply Chain Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Implementation and Rollout of a Trusted AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects

by
Michael Grims
1,
Daniel Karas
1,
Marina Ivanova
1,
Gerhard Höfinger
1,
Sebastian Bruchhaus
2,
Marco X. Bornschlegl
2,* and
Matthias L. Hemmje
2
1
STRABAG Innovation & Digitalisation, 1220 Vienna, Austria
2
Faculty of Mathematics and Computer Science, University of Hagen, 58097 Hagen, Germany
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2025, 8(6), 161; https://doi.org/10.3390/asi8060161
Submission received: 9 September 2025 / Revised: 10 October 2025 / Accepted: 21 October 2025 / Published: 24 October 2025
(This article belongs to the Special Issue AI-Driven Decision Support for Systemic Innovation)

Abstract

Using big data for risk analysis of construction projects is a largely unexplored area. In this traditional industry, risk identification is often based either on so-called domain expert knowledge, in other words on experience, or on different statistical and quantitative analysis of individual past projects. The motivation of this research is based on the implemented and evaluated data-driven and AI-based DARIA approach to identify financial risks in the execution phase of transportation infrastructure construction projects that shows exceptional results at an early stage of the project execution phase and has already been deployed into enterprise-wide production within the STRABAG group. Due to DARIA’s productive use, concern and doubts about the trustworthiness of its ML algorithm are certainly possible, especially when DARIA identifies risky projects while all conventional metrics within the STRABAG controlling system do not identify any problems. “If AI systems do not prove to be worthy of trust, their widespread acceptance and adoption will be hindered, and the potentially vast societal and economic benefits will not be fully realized”. Thus, and based on the results of a user study during DARIA’s successful deployment into enterprise-wide production, this paper focuses on the identification of suitable indicators to measure the trustworthiness of the DARIA ML algorithm in the interaction between individuals and systems as well as on the modeling of the reproducibility of the internal state of DARIA’s ML model.

1. Introduction and Motivation

The motivation of this research is based on [1], where the AI-based DARIA (Data-driven Risk Analysis in the Construction Industry) approach to identify financial risks in the execution phase of transportation infrastructure construction projects has been introduced [2]. To be more precise, this research represents the second part of [2] and had visually been presented at the Innovate 2025—World AI, Automation, and Technology Forum conference [3]. “DARIA shows exceptional project risk identification results at an early stage of the execution of transportation infrastructure construction projects and has already been deployed into enterprise-wide production in all transportation infrastructure construction units of the STRABAG group” [2].
However, and to justify its practical application, it must offer trustworthy, i.e., reproducible, valid, and capable features. Calibrated trust [4] aims to ensure that the level of trust placed in a system accurately reflects its capabilities. Users need to have calibrated trust in the results because of, e.g., business rules of due diligence. In this way, regulatory compliance is achieved. However, trust is a subtle psychological concept that cannot only be based on regulatory compliance alone. If users doubt, e.g., the AI capability, they are unlikely to utilize it. Therefore, DARIA has to offer suitable indicators to convey its trustworthiness.
Due to DARIA’s deployment into enterprise-wide production [2], “concern and doubts about the trustworthiness of the DARIA machine learning (ML) algorithm are certainly possible, especially if and when DARIA identifies risky projects at a time when all conventional metrics within the STRABAG controlling system do not identify any problems” [2] (primary motivation). “If artificial intelligence (AI) systems do not prove to be worthy of trust, their widespread acceptance and adoption will be hindered, and the potentially vast societal and economic benefits will not be fully realized” [5]. Moreover, “DARIA aims not to replace human workers” [2]. “Instead, and because intelligent people and intelligent computers have complementary abilities” [6], “people and computers can realize opportunities together that neither can realize alone” [7]. In this way, and based on the results of [2], “the observation of the human–computer interaction to identify potential concern and open questions about the trustworthiness of the DARIA AI application during its successful deployment into enterprise-wide production” [2] to define an approach for DARIA and future deployments of trustworthy DARIA-like approaches [6] can be identified as secondary motivation [6].
Whereas “AI is a foundational catalyst for digital business” [8], in reality most organizations struggle to scale their AI pilots into enterprise-wide production, which limits the ability to realize AI’s potential business value [9]. To be more precise, organizations “are discovering ’AI pilot paradox’, where launching pilots is deceptively easy but deploying them into production is notoriously challenging” [9]. Apart from these challenges inherent to DARIA, trust is a human emotion, which depends on many factors that are hard to address from a technical point of view. The way of instilling trust in DARIA is by demonstrating its trustworthiness. There are hardly any agreed upon metrics of trust in AI systems that are applicable to DARIA (first problem statement). So, how can the trustworthiness of DARIA be qualified?
In order to be reproducible, an AI system’s complete internal state must be known. This state or behavior is determined by the inputs, training data, and hyper-parameters. Initially, the training data and hyper-parameters as well as the selected ML model of DARIA were not perpetually accessible. Reproducibility serves as an indirect proxy measure of reliability by demonstrating consistency across repeated experiments or analyses. It also supports validity by confirming that findings are robust and not artifacts of methodological flaws or random chance. Therefore, reproducibility is a key factor for establishing trust in both the validity and reliability of a system like DARIA. To provide reproducible outputs of DARIA during its risk prediction, the necessary information to reproduce the system’s internal state needs to be collected, managed, archived, and preserved. Unfortunately, DARIA lacks such functionalities (second problem statement). So how can reproducibility be guaranteed?
This research on DARIA follows Nunamaker’s approach to systems development [10] for the phases observation, theory building, implementation, and evaluation. Based on the derived problem statements, this paper focuses on the identification of suitable criteria to establish trustworthiness of the DARIA ML algorithm in the interaction between individuals and systems (RQ1) as well as on the modeling of the reproducibility of the internal state of DARIA’s ML model (RQ2). In this way, after a review of relevant state of the art in science and technology, the results of the implemented DARIA ML model (c.f. [2]) will be outlined and validated before DARIA’s deployment process, including a user study to evaluate trustworthiness in the interaction between individuals and systems, will be described and discussed.

2. Selected Relevant State of the Art in Science and Technology

Related to the first goal of this paper to derive suitable criteria that the trustworthiness of the DARIA ML algorithm in the interaction between individuals and systems can be qualified (RQ1), the initial review of relevant selected state of the art is focusing on the definition of trust and trustworthiness first. In order to model the reproducibility of DARIA’s internal state (RQ2), relevant related work will be outlined and discussed.
Although it has been shown that trust is a crucial factor in the adoption of innovative technologies, “surprisingly, trust literature offers very little guidance for systematically integrating the vast amount of behavioral trust results into the development of computing systems” [11]. Nevertheless, trust is much more ambivalent [12]. “This becomes clear when the term is approached philosophically” [6]. David Easton highlights “that trust is based on experience with the outputs of decision-makers” [13]. “In contrast to specific support, however, it does not refer to the daily outputs of concrete decision-makers, but to ongoing positive experiences with outputs of various decision-makers over a longer period of time” [13]. In this way, Easton defines trust based on the reliability and competence as well as on the output of an individual or system instead of an individual or system itself [6]. “This becomes even clearer when combined with Annette Baier’s statements” [6]. Baier emphasizes that “trust is much easier to maintain than it is to get started and is never hard to destroy” [14].
“Trust is often discussed when it becomes fragile or when it ceases to be perceived as a given background and starts being questioned” [12]. Therefore, and as discussed in [6], trust can be defined based on the honesty of an individual and plays a significant role where there is no security or certainty [12]. “Especially in the context of AI, where the analytics process is turned into what is commonly referred to as a ’black box’ that is impossible to interpret, and therefore, humans are challenged to comprehend and retrace how the algorithm produces a result” [6].
Summarizing the discussion about trust and according to [6], “between individuals, trust is, on the one hand, based on their competence, and, on the other, based on their honesty [12]. Therefore, trust and trustworthiness have both at least a knowledge component and a moral component [15]. “With regard to the initial rational definition of trust (reliability, truth, and ability), truth and ability can be replaced by honesty and competence, whereas reliability remains as an important third attribute, according to the interpretation of Easton’s argumentation” [6].
“Even if these dimensions of trust are suitable for individuals, they are inappropriate for describing a system” [6]. Therefore, Ref. [6] introduced and discussed the corresponding dimensions that are able “to define trustworthiness based on the output of an individual or system instead of an individual or system itself” [6,13]. Figure 1 illustrates the derived “dimensions of trust and trustworthiness in the interaction between individuals as well as between individuals and systems” [6].
As outlined within [6], “the first trust dimension reliability, which describes the quality of individuals to perform consistently, can be represented by reproducibility in the context of a system” by describing the ability “of obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis” [16]. “The next dimension honesty, which can be defined as adherence to the facts between individuals, can be represented by second dimension of trust in the context of a system validity, which describes the degree to which data or results are correct” [6]. Finally, “the dimension competence, which characterizes the ability of a person to do something well, can be represented by the term capability, which describes the general potential of a system to accomplish something” [6].
“Following the definition of trustworthiness in the interaction between individuals and systems that addresses the modeling of trustworthiness in AI-based big data analysis” [6], the Trustworthy AI-based Big Data Management (TAI-BDM) Reference Model—a combination of Kaufmann’s refined BDM Reference Model [17] with the AIGO’s (AI Group of Experts at the OECD) high-level structure of a generic AI system [18]—has been introduced in 2024 [6]. Figure 2 illustrates the Trustworthy AI-based Big Data Management Reference Model that integrates trustworthy AI into the whole big data analysis process [6].
As introduced and outlined in [6] and with focus on the right side of the conceptual TAI-BDM Reference Model, “technical aspects of AI within the whole big data analysis process are only located within the first three layers data integration, data analytics, and data interaction [6]. The fourth data effectuation layer focuses ”on the utilization of the data analysis results to create value in products, services, and operations of the organization” [6,17]. “Therefore, AI does not play a role in this layer. The second data analytics layer represents the most significant layer in the context of AI at first glance. Nevertheless, AI also plays an important role within both the data integration and the data interaction layers” [6].
Regarding the left side, data intelligence refers “to the ability of the organization to acquire knowledge and skills in the context of big data management” [6]. “Revisiting Easton’s definition that trust is based on the output of an individual or system instead of an individual or system itself [13] in combination with the integration of AI within the right-sided layers of the conceptual reference model for trustworthiness in AI-based big data analysis” [6], each layer with AI-support needs an individual trust component based on its output within the data intelligence layer [6]. “Even then, the overall trustworthiness of a system depends on the trustworthiness of the outcome of the entire process” [6]. Thus, “the individual trustworthiness of the AI-supported layers needs to be integrated into an overall trust building process [6]. In this process, “the dimensions of trustworthiness in the interaction between individuals and AI components within certain layers as well as information about past interactions between individuals and the overall AI system” [6] in “the emergent knowledge process of higher order” [17], symbolized by the outer loop of the conceptual reference model for trustworthiness in AI-based big data analysis, need to be stored to gauge trustworthiness of the overall AI-based big data analysis to enable a trustworthy AI [6,19].
Focusing on TAI-BDM’s trust bus component, [20] introduced the concept of Abbass’ trust bus as a reference architecture designed to enhance trustability analytics [6]. As illustrated in Figure 3, “the trust bus is based on a Belief-Desire-Intention-Trust-Motivation (BDI-TM) framework, which involves several components such as actors and entities memory, trust production, identity management, intent management, emotion management, risk management, and complexity management” [20].
“This architecture aims to facilitate interactions within AI systems and between AI systems and users by synthesizing qualified trust through a rule-based expert system” [6]. “The trust bus processes metadata to recommend analytics, preprocessing, and visualization methods that maximize user trust in different scenarios” [20]. Actual domain experts can initially substitute an expert system [6].
After a review of relevant state of the art to derive suitable criteria for evaluating the trustworthiness of the DARIA ML algorithm in the interaction between individuals and systems, relevant related work has been outlined and discussed. In the following section, the process of establishment of trustworthiness in DARIA will be outlined based on the TAI-BDM Reference Model.

3. Establishment of Trustworthiness in DARIA

As described in detail in [2], “using big data for risk analysis of construction projects is a largely unexplored area”. “In this traditional industry, risk identification is often based either on so-called domain expert knowledge, in other words on experience, or on different statistical and quantitative analysis of individual of projects that have already been completed” [2,21]. “Tools, commonly used among construction managers and quantity surveyors, include but are not limited to checklists, cause analysis, SWOT-analysis [22], decision trees [23] and fault tree analysis [23], singular surveys, or multi-stage survey procedures (e.g., Delphi method)” [2]. “At best, such manual methods can only reveal the reasons for the success or failure of individual projects after they have been completed, whereas the main goal must be to reliably identify high-risk projects during their execution in order to mitigate risks and minimize financial losses” [2].
“When examining numerous construction projects, statistical patterns emerge. Using, e.g., STRABAG’s internal construction business data from its various infrastructure construction business units, it has already been proven that empirical project margins always follow an asymmetrical Laplace distribution [24], which is only determined by three parameters” [2]. According to [24], “these three parameters allow analysis, comparison, benchmarking and even trend analysis of the distribution” [2].
Probabilistic applications like the Monte Carlo simulation are interesting approaches in risk analysis because large amounts of data can be processed [25], “but they are more useful for analyzing input parameters that are not clearly defined” [6]. “While Monte Carlo simulations are well suited for dealing with uncertainty in individual input parameters, ML methods are the method of choice for classifying projects through pattern recognition” [1]. “Nowadays the utilization of ML is a promising approach to assist in decision making for risk assessment” [6,26]. “However, there are only a few approaches that are relevant for the construction industry” [6]. As outlined within [2], Gondia et al. [27] “developed two ML models: Decision trees and naive Bayes classifiers, which predict schedule risks in a construction project with an accuracy of 74.5% and 78.4%, respectively” [2]. Another approach to ML models is proposed by Yaseen et al. [23] “where a hybrid model is presented, which consists of a combination of a random forest and a genetic algorithm” [2]. “This model predicts the delay in a construction project in Iraq with an accuracy of 91.67%” [6]. Egwim et al. [28] use Ensemble Machine Learning Algorithms (EMLA) for the first time in literature to predict time delays in construction projects to “prove that the use of ensemble algorithms for predicting delays in construction projects has greater predictive power compared to a single algorithm” [2].
“Currently, ML models are mainly being developed for predicting time delays, i.e. for schedule risks” [6]. “These continue to be a major influencing factor in construction projects, having a significant negative impact on both the industry and the global economy” [1].
As a part of transforming STRABAG into a data-driven organization, DARIA was brought to life with the idea to utilize ML algorithms to automatically predict the financial outcome based on commercial controlling data without using domain expert knowledge in order to potentially assist business domain experts in the mitigation of financial losses with an objective and data-driven approach. At that time, company-wide AI applications had not yet been applied. Accordingly, potential business domain experts of DARIA were not familiar with the usage of AI applications. Thus, the initial DARIA experiments were conducted without domain expert knowledge. Until this point, the first stable DARIA ML approach utilized financial project data of around 10,000 completed transportation infrastructure construction projects to train an XGBoost [29] ML model. With a particular focus on the impact, the defined and implemented data-driven and AI2VIS4BigData-based [30] DARIA approach (c.f. Figure 4) has initially been evaluated by means of a quantitative evaluation [6].
DARIA has been integrated into an existing productive industrial organization system architecture [2]. Nevertheless, “there is no direct interchange between the business domain experts and DARIA because the resulting risk predictions have been integrated into the existing standardized group-wide controlling applications as additional risk indicators” [2]. In this way, “the DARIA results could be easily evaluated by the aid of state-of-the-art statistical (AI-)algorithm performance metrics such as precision and recall [2].
From a global perspective, DARIA’s ML model allows a project outcome prediction based on the four classes (flop, negative, positive, top) [2]. “These classes are in alignment with STRABAG internally defined thresholds for project performance, which are based on financial results in relation to contract value” [2]. Regarding the aforementioned evaluation performance metrics and as outlined within [2], recall measures how well the model identifies true positives, i.e. out of all flop projects, how many were correctly identified, whilst precision answers the question “out of all projects predicted as flop, how many are actually flop, thus, providing insights about the quality of the positive predictions” [2]. Precision and recall are defined as follows within [2]:
Precision ( class )   =   TP ( class ) TP ( class )   +   FP ( class ) Recall ( class )   =   TP ( class ) TP ( class )   +   FN ( class )
“The earlier project risks can be identified, the greater the potential for successful mitigation” [2]. “Considering this as well as the relatively short duration of transportation infrastructure construction projects, DARIA focuses on the first 12 months of project execution” [2]. Therefore and as outlined in Table 1, a 3-, 6-, 9-, and 12-month ML model was implemented, trained, and discussed in [2].
During model development, the most crucial part of generating the ML model was intensive feature engineering. After performing a hyperparameter tuning, a recall of 64.29% could be achieved for the flop class within the 3 month ML model. The recall increased up to 89.29% in the 12 month ML model.
After this initial quantitative evaluation of the algorithm itself (c.f. [6]), the trust building process of DARIA will be described in detail. This includes the three dimensions of trustworthiness reproducibility, validity, and capability (c.f. Figure 1) which had to be consistently considered during DARIA’s development and especially for its deployment into enterprise-wide production.
The first step in the development of DARIA ML approach is to conduct various statistical experiments, e.g., in depth time-series analysis, to identify the most promising variables for distinguishing construction projects with a high financial risk from non-risky projects. This leads to a selection of variables, which could be used for training a state-of-the-art neural network. The direct choice of algorithm is based on the fact that an abundance of different algorithms in the first phase of DARIA would probably have overwhelmed the users and hence damaged trust. The selection and disclosure of a single reliable state-of-the-art algorithm fosters trust. Communicating the DARIA results to end users is one of the main project challenges as typical for the development and deployment of any AI solution. As humans respond much better to visualizations of complex facts in comparison to just describing them with a metric, visualizations were created and used to enable an initial trust building process.
The first visualization is based on the fact that most of the end users of the solution are project controllers who work regularly with controlling data. This allows them to interpret typical time series of said data and find insights based on their experience. Therefore, a user-centered system design approach is determined by the domain experts’ visual habits. A visualization (Figure 5), depicting trends of relevant key figures, has been created as a result.
To generate the overall data basis for this plot, the data needs to be grouped by following variables variable_name, month_ongoing, and label, where variable_name represents the name of the respective KPI of the controlling data, month_ongoing represents the ongoing month of the construction project, and label represents a categorical variable with the possible value top, positive, negative, or flop in alignment with the STRABAG classes for project performance. Subsequently, the mean and the confidence interval for each group is calculated. All values are normalized as percentage values to make them comparable for different project sizes. With this illustration that enables the comparison of typical project controlling data (relevant key figures) of a specific project over time in the context of a typical progression of the predicted class (flop in this illustration), business domain experts are empowered to gain insights based on their domain experience without understanding DARIA’s ML algorithm in detail. In this way, the validity (c.f. Figure 1) of DARIA can be attested.
The second visualization (Figure 6) illustrates the development of probabilities of the four classes top, positive, negative, or flop over the ongoing months of a construction project. This enables a concise interpretation of the results.
In this exemplary visualization, it can be clearly observed that the ML model predicts the financial outcome as flop already in the first month of project execution. Over time, DARIA enables more precise predictions, which results from the increasing amount of project data. This example illustrates typical patterns of flop classes.
These results also had to be presented to domain experts. In order to show the capability (c.f. Figure 1) of DARIA’s ML model, both visualizations are subsequently included in a web-based dashboard, which allows a dynamic use and evaluation. These visualizations enable an intensive dialog with domain experts and contributed to a more comprehensive selection of feature variables based on specific domain knowledge. Considering domain experts’ suggestions not only improves the ML model, but also enhanced the trust of the business domain experts.
After the successful verification of validity and capability and in alignment with the identified dimensions of trust and trustworthiness in the interaction between individuals and systems (c.f. Figure 1), relevant business domain users highlighted DARIA’s reproducibility as a crucial factor for creating sustainable trustworthiness (RQ2). End users expect to receive identical results when using identical input data. As a consequence, a switch from neural networks to XGBoost has been performed due to the reproducibility of training results [6]. Neural networks are often referred to as a “black box” that imitates the human brain and include many random processes during training, such as random weights initialization or drop out layers. XGBoost, on the other hand, is a series of decision trees, where each tree corrects the errors of its predecessor and also includes the possibility to work with randomization. However, it has the option of eliminating random elements, hence making training results completely reproducible. Moreover, “results show that tree-based models remain state-of-the-art on medium-sized data (approx. 10k samples) even without accounting for their superior speed” [31]. In this way, the reproducibility (c.f. Figure 1) of DARIA can be attested.
After improving DARIA’s ML model with the feedback of business domain experts (risk managers), a sufficient degree of trust has been obtained to provide this solution to selected end users (project controllers) within an initial and supervised test phase. In this test phase, the overall trust building process between end users and DARIA highly depends on how accurately DARIA predicted the financial outcome of transportation infrastructure construction projects.
For project controllers that tested DARIA they were of significant importance how many false positives DARIA’s ML algorithm generated. The introduced precision metric is generally a good indicator for this question. However, during the ML model development phase with participation of domain experts it became clear that there was a significant difference between a false positive value, when a flop project was predicted as a negative project or as a positive or even top project. Predicting a negative as a flop project was deemed acceptable by the domain experts. Therefore, the adjusted precision metric has been defined as follows [2]:
Adjusted Precision ( flop )   =   TP adjusted ( flop ) TP ( flop )   +   FP ( flop ) with TP adjusted :   =   TP ( flop )   +   # { negative projects predicted as flop }
Recall is generally suitable to answer the question, how many of all flop projects could be detected at all. This was always a very important metric in DARIA, but it must be considered in relation to precision, as these two metrics are in trade-off. Training an ML model with extraordinary recall usually leads to poor precision, and vice versa. As already outlined within [2], Table 2 shows the detailed flop classifier results for the adjusted precision for the 3-, 6-, 9-, and 12-month ML model and Figure 7 shows the distribution of the adjusted precision (flop probability) for projects correctly predicted as flops for the months 3, 6, 9, and 12 [2].
“It can be observed that in each violin plot the mean (represented by a black dot) is clearly above 0.5” [2]. “Moreover, shape of all four violin plots indicates that the data distribution is negatively skewed, meaning that the majority of values are above the mean” [2]. In conclusion, “it can be stated that correct flop predictions are made with a high degree of certainty, which increases for each later model in the time series” [2].
Due to the fact that the defined and implemented DARIA approach (c.f. Figure 4) “has been integrated into an existing productive industrial organization system architecture” [2], “there is no direct interchange between the business domain experts and DARIA” [2]. “The resulting risk predictions have been integrated into the existing standardized group-wide controlling applications as additional indicators” [2]. This decision supports the overall trust building process as the STRABAG Controlling Portal is an existing and trusted system [2]. Moreover, “this strategy kickstarted the trust qualification process from the very beginning” [2].
Even though standardized and enterprise-wide software components within the STRABAG Group (Data Science Hub, Controlling Portal) that are utilized to implement the data management & curation, insight & effectuation, and interaction & perception process stages, the respective decision for the (most important) AI-modeling (analytics) process stage remained open [2]. In addition to the technical algorithm-related capabilities, the capability of easily deploying an endpoint, which could be used by the Controlling Portal to provide live predictions, is crucial for deploying DARIA into enterprise-wide production [2]. In this way, Databricks [32] on Microsoft Azure serves as basis of the DARIA MLOps technology stack, whereas the endpoint deployment is realized via a deep integration of MLflow [33] into Databricks [2]. “MLflow covers all major parts of the ML life cycle, from automatic logging of the training of ML models to the deployment of an application programming interface (API) endpoint” [33]. MLflow also considers security and scalability, “which are crucial factors when deploying AI pilots into enterprise-wide production and hence contributes to the overall trustworthiness” [2].
In conclusion, suitable criteria to evaluate the trustworthiness of DARIA’s ML algorithm have been derived in this section as per the first goal of this paper. With a focus on the reproducibility of the internal state of DARIA’s ML model (RQ2), even though DARIA is based on the TAI-BDM Reference Model and therefore, implements the concept of a trust bus, its software architecture is a step towards a full implementation of the trust bus, which can be identified as a remaining challenge for further research.

4. Evaluation

After an initial quantitative evaluation of the algorithm itself and an intensive qualitative evaluation based on the feedback of the business domain experts to prove DARIA’s trustworthiness according to the derived dimensions validity, capability, and reproducibility, DARIA was exposed to selected test users (project controllers) within a supervised test phase. The overall trust building process between end users and DARIA was evaluated for a first time in this phase.
However, as there were no company-wide AI applications at that point in time, concern and doubts about the trustworthiness of DARIA’s ML algorithm are certainly reasonable from an end user perspective, especially in cases where DARIA identifies risky projects while all conventional metrics within the STRABAG controlling system do not identify any problems. Therefore, in order to evaluate and boost further DARIA’s trustworthiness prior to its enterprise-wide deployment, an extensive pilot phase was conducted with input of 30 business domain users (commercial division managers, project controllers, and construction managers) across 10 countries (Relevant countries with transportation infrastructure construction projects: DE, AT, PL, CZ, SK, HU, RS, HR, RO, and BG) of the STRABAG Group.
All participants evaluated DARIA’s prediction performance based on more than one hundred both completed and ongoing transportation infrastructure construction projects (10 per country) by means of cognitive walkthrough sessions. Thus, a combination of structured questionnaires and several feedback rounds engaged DARIA’s business domain users not only to gather insights on the ML model’s accuracy but also to foster a sense of collaborative development to establish an overall trustworthiness of DARIA, despite the fact that DARIA has been integrated into the existing productive industrial organization system architecture.
The questionnaire’s results show a strong alignment between DARIA predictions and pilot users’ assessments. A total of 75% of the business domain users confirmed that DARIA’s predictions match their assessment of the tested projects. Moreover, they highlighted that the additional visualizations (c.f. Figure 5 and Figure 6) supported the trust building process. As outlined in Section 3, these visualizations facilitated a dynamic use and evaluation. For example, model input variables, such as financial result and working capital, were frequently cited as accurate markers that reinforced pilot users’ trust in DARIA’s predictions. The business domain users consistently reported that this alignment boosted their confidence as they were provided with a visualization of how DARIA is reflecting the quantitative criteria well-known to their roles. Most importantly, the pilot users unanimously stated that they find the overall trends in the predictions correct and valuable. This underlines their belief in the validity, capability, and reproducibility of DARIA.
During this pilot phase the business domain users also highlighted potential areas for improvement. Country-specific details, like project contract and payment structure, could lead to an incorrect prediction due to a misinterpretation of the working capital indicator. Moreover, construction managers discovered that winter pauses in project activity are not weighted sufficiently in the ML model, which leads to overly optimistic projections during inactive months. Thus, the optimization of the ML model (consideration of contract type, winter pause, etc.) can be identified as a remaining challenge for further research.
Regardless of the aforementioned areas for improvement, the business domain users unanimously stated that no data-driven ML model could make a completely accurate prediction because of poor or manipulated data. This highlights that the pilot users developed calibrated trust in the model based on prior knowledge from their domain of expertise. In this way, and summarizing the evaluation phase, both the feedback loop and the early business domain user involvement during the piloting process itself contributed greatly to the trust building process of DARIA.

5. Conclusions and Outlook

As outlined in Section 1, the motivation of this research is based on [1], where the AI-based DARIA approach to identify financial risks in the execution phase of transportation infrastructure construction projects (c.f. [2]) “has been introduced, presented, and discussed” [2]. To be more precise, this research represents the second part of [2] and had visually been presented at the Innovate 2025—World AI, Automation, and Technology Forum conference [3].
“Although DARIA shows exceptional project risk identification results at an early project execution stage and has already been deployed into enterprise-wide production in all transportation infrastructure construction units of the STRABAG group” [2], “concern and doubts about the trustworthiness of DARIA’s ML algorithm are certainly possible, especially if DARIA identifies risky projects at a time when all other metrics within the STRABAG controlling system are still unremarkable” [2].
“If AI systems do not prove to be worthy of trust, their widespread acceptance and adoption will be hindered, and the potentially [...] economic benefits will not be fully realized” [5]. Concluding the establishment of trustworthiness in DARIA and related to the first goal of this paper, suitable criteria to qualify the trustworthiness (validity, capability, and reproducibility) have been derived and evaluated (RQ1). With a particular focus on the reproducibility of the internal state of DARIA’s ML model (RQ2), “even though DARIA is based on TAI-BDM and therefore, implements the concept of a trust bus, its software architecture is a step towards a full implementation of the trust bus” [6], which can be identified as a first remaining challenge for further research.
“Even though DARIA has been integrated into the existing productive industrial organization system architecture” [2], the integration of DARIA’s ML model into the company’s existing processes took significantly longer than initially anticipated [2]. Due to the fact that DARIA is the first company-wide and self-developed AI framework, a robust MLOps framework, a new data governance standard, and security protocols had to be established before deploying DARIA into enterprise-wide production.
While business domain experts during the model development and the test phase as well as further users during the pilot phase developed calibrated trust in DARIA’s ML model based on prior knowledge from their domain of expertise, potential areas for improvement were recognized. Country-specific details like, e.g., project contract and payment structure can lead to an incorrect prediction due to a misinterpretation of the working capital indicator. Moreover, construction managers within the business domain users discovered that winter pauses in project activity are not considered well in the ML model, which leads to overly optimistic projections during inactive months. Thus, the optimization of the ML model (e.g., consideration of contract type, winter pause, etc.) could be identified as a second remaining challenge for further research.
Moreover, “identifying risks for the financial outcome of a construction project as early as possible is crucial for successful mitigation and reduction in potential financial losses” [2]. “In the current version, DARIA is able to identify risks after three months of project execution on a trustworthy base” [2]. “Unfortunately, even if risky projects are identified at an early execution stage, this detection only enables a reduction in the potential loss (damage limitation)” [2]. Therefore, DARIA needs to be adjusted accordingly to be able “to identify risks prior to project start or even during the selection phase of potential construction projects” [2]. However, and as already discussed within [2], “during both selection and tender phases of construction projects, no extensive financial reporting exists” [2]. In this way, the extension of DARIA’s ML model to both selection and tender phases can be hence identified as a third remaining challenge for further research.

Author Contributions

Conceptualization, M.G., D.K., M.I. and G.H.; methodology, M.L.H.; software, M.G. and D.K.; validation, M.G., D.K., M.I., G.H. and S.B.; investigation, S.B. and M.X.B.; writing—original draft preparation, M.G., D.K., M.I. and M.X.B.; writing—review and editing, M.G., D.K., M.I., G.H., S.B., M.X.B. and M.L.H.; visualization, M.G. and D.K.; supervision, M.X.B. and M.L.H.; project administration, M.G. and M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Austrian Research Promotion Agency (Österreichische Forschungsförderungsgesellschaft mbH (FFG)) General Programmes 2022, FFG Project No.: FO999897631, Proposal Acronym: DARIA [1].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Michael Grims, Daniel Karas, Marina Ivanova and Gerhard Höfinger was employed by the company STRABAG Innovation & Digitalisation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. FFG—The Austrian Research Promotion Agency. Data-driven Risk Analysis in the Construction Industry. In General Programmes 2022, FFG Project Number: FO999897631, Proposal Acronym: DARIA; FFG—The Austrian Research Promotion Agency: Vienna, Austria, 2022. [Google Scholar]
  2. Ivanova, M.; Grims, M.; Karas, D.; Höfinger, G.; Bornschlegl, M.X.; Hemmje, M.L. An AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects. In Artificial Intelligence Applications and Innovations; Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A., Eds.; Springer: Cham, Switzerland, 2024; pp. 158–173. [Google Scholar] [CrossRef]
  3. Innovate 2025 World AI, Automation, and Technology Forum Conference 2025; Davos, Switzerland; SMART Society, USA. Available online: https://innovate2025.smartsociety.org/ (accessed on 7 October 2025).
  4. Wischnewski, M.; Krämer, N.; Müller, E. Measuring and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. In Proceedings of the CHI ’23: 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–16. [Google Scholar] [CrossRef]
  5. KPMG AG Wirtschaftsprüfungsgesellschaft. Trust in Artificial Intelligence: A Five Country Study; The University of Queensland: Brisbane, QLD, Australia, 2021. [Google Scholar]
  6. Bornschlegl, M.X. Towards Trustworthiness in AI-Based Big Data Analysis; Research Report; FernUniversität: Hagen, Germany, 2024. [Google Scholar] [CrossRef]
  7. Winston, P.H. Artificial Intelligence; Addison-Wesley: Boston, MA, USA, 1984. [Google Scholar] [CrossRef]
  8. Chandrasekaran, A.; Burke, B.; Brethenoux, E. Building a Digital Future: Emergent AI Trends; Technical Report; Gartner, Inc.: Stamford, CT, USA, 2022. [Google Scholar]
  9. Gartner, Inc. Gartner Predicts the Future of AI Technologies; Gartner, Inc.: Stamford, CT, USA, 2019; Available online: https://www.gartner.com/smarterwithgartner/gartner-predicts-the-future-of-ai-technologies (accessed on 7 October 2022).
  10. Nunamaker, J.F.; Chen, M.; Purdin, T.D. Systems Development in Information Systems Research. J. Manag. Inf. Syst. 1990, 7, 89–106. [Google Scholar] [CrossRef]
  11. Söllner, M.; Hoffmann, A.; Hoffmann, H.; Leimeister, J.M. Vertrauensunterstützung für sozio-technische ubiquitäre Systeme. Z. Betriebswirtschaft 2012, 82, 109–140. [Google Scholar] [CrossRef]
  12. Simon, J. Vertrauen in die Wissenschaft–philosophische Erwägungen. In Infektionen und Gesellschaft: Was Haben Wir von COVID-19 Gelernt? Springer: Berlin/Heidelberg, Germany, 2022; pp. 30–35. [Google Scholar]
  13. Easton, D. A Re-Assessment of the Concept of Political Support. Br. J. Political Sci. 1975, 5, 435–457. [Google Scholar] [CrossRef]
  14. Baier, A. Trust and Antitrust. In Feminist Social Thought; Routledge: Oxford, UK, 2014; pp. 604–629. [Google Scholar]
  15. Hardwig, J. The Role of Trust in Knowledge. J. Philos. 1991, 88, 693–708. [Google Scholar] [CrossRef]
  16. National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Division on Earth and Life Studies; Division on Engineering and Physical Sciences; Policy and Global Affairs; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on National Statistics; Nuclear and Radiation Studies Board; Board on Mathematical Sciences and Analytics; Committee on Applied and Theoretical Statistics; et al. Reproducibility and Replicability in Science; National Academies Press: Washington, DC, USA, 2019. [Google Scholar]
  17. Kaufmann, M. Towards a Reference Model for Big Data Management; FernUniversität: Hagen, Germany, 2016. [Google Scholar]
  18. OECD. Artificial Intelligence in Society; OECD: Paris, France, 2019; p. 152. [Google Scholar] [CrossRef]
  19. Abbass, H.A.; Leu, G.; Merrick, K. A Review of Theoretical and Practical Challenges of Trusted Autonomy in Big Data. IEEE Access 2016, 4, 2808–2830. [Google Scholar] [CrossRef]
  20. Bruchhaus, S.; Reis, T.; Bornschlegl, M.X.; Störl, U.; Hemmje, M. Towards a User-Empowering Architecture for Trustability Analytics. In BTW 2023; Gesellschaft für Informatik e.V.: Bonn, Germany, 2023; pp. 901–914. [Google Scholar] [CrossRef]
  21. De Marco, A.; Thaheem, J. Risk Analysis in Construction Projects: A Practical Selection Methodology. Am. J. Appl. Sci. 2014, 11, 74–84. [Google Scholar] [CrossRef]
  22. Cui, Q.; Erfani, A. Automatic Detection of Construction Risks. In ECPPM 2021-eWork and eBusiness in Architecture, Engineering, and Construction; CRC Press: London, UK, 2021; pp. 184–189. [Google Scholar] [CrossRef]
  23. Yaseen, Z.M.; Ali, Z.H.; Salih, S.Q.; Al-Ansari, N. Prediction of Risk Delay in Construction Projects Using a Hybrid Artificial Intelligence Model. Sustainability 2020, 12, 1514. [Google Scholar] [CrossRef]
  24. Lulei, F. Verteilung von Projektrenditen—Risiko neu interpretiert. In Bau Aktuell—Heft 1/2024; Linde Verlag GmbH: Vienna, Austria, 2021; pp. 28–31. [Google Scholar]
  25. Hofstadler, C.; Kummer, M. Grundlagen der Monte-Carlo-Simulation. In Chancen- und Risikomanagement in der Bauwirtschaft: Für Auftraggeber und Auftragnehmer in Projektmanagement, Baubetrieb und Bauwirtschaft; Springer: Berlin/Heidelberg, Germany, 2017; pp. 191–248. [Google Scholar] [CrossRef]
  26. Paltrinieri, N.; Comfort, L.; Reniers, G. Learning About Risk: Machine Learning for Risk Assessment. Saf. Sci. 2019, 118, 475–486. [Google Scholar] [CrossRef]
  27. Gondia, A.; Siam, A.; El-Dakhakhni, W.; Nassar, A.H. Machine Learning Algorithms for Construction Projects Delay Risk Prediction. J. Constr. Eng. Manag. 2020, 146, 04019085. [Google Scholar] [CrossRef]
  28. Egwim, C.N.; Alaka, H.; Toriola-Coker, L.O.; Balogun, H.; Sunmola, F. Applied Artificial Intelligence for Predicting Construction Projects Delay. Mach. Learn. Appl. 2021, 6, 100166. [Google Scholar] [CrossRef]
  29. The XGBoost Contributors. XGBoost (Version 2.03); XGBoost: Seattle, WA, USA, 2023; Available online: https://xgboost.ai/ (accessed on 12 December 2023).
  30. Reis, T.; Kreibich, A.; Bruchhaus, S.; Krause, T.; Freund, F.; Bornschlegl, M.X.; Hemmje, M.L. An Information System Supporting Insurance Use Cases by Automated Anomaly Detection. Big Data Cogn. Comput. 2023, 7, 4. [Google Scholar] [CrossRef]
  31. Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? arXiv 2022, arXiv:2207.08815. [Google Scholar] [CrossRef]
  32. Databricks, Inc. . Databricks AI and Machine Learning; Databricks, Inc.: San Francisco, CA, USA, 2023; Available online: https://www.databricks.com/product/machine-learning (accessed on 15 December 2023).
  33. LF Projects, LLC. MLflow; LF Projects, LLC: Wilmington, DE, USA, 2023; Available online: https://mlflow.org/ (accessed on 15 December 2023).
Figure 1. Dimensions of trust and trustworthiness in the interaction between individuals as well as between individuals and systems [6].
Figure 1. Dimensions of trust and trustworthiness in the interaction between individuals as well as between individuals and systems [6].
Asi 08 00161 g001
Figure 2. Trustworthy AI-based Big Data Management (TAI-BDM) Reference Model [6].
Figure 2. Trustworthy AI-based Big Data Management (TAI-BDM) Reference Model [6].
Asi 08 00161 g002
Figure 3. Trust Bus Architecture for AI2VIS4BigData [20].
Figure 3. Trust Bus Architecture for AI2VIS4BigData [20].
Asi 08 00161 g003
Figure 4. Conceptual DARIA Workflow based on AI2VIS4BigData [6].
Figure 4. Conceptual DARIA Workflow based on AI2VIS4BigData [6].
Asi 08 00161 g004
Figure 5. Comparison of typical project controlling data of a specific project in the context of a typical progression of the predicted class.
Figure 5. Comparison of typical project controlling data of a specific project in the context of a typical progression of the predicted class.
Asi 08 00161 g005
Figure 6. Project outcome (top, positive, negative, or flop) probabilities.
Figure 6. Project outcome (top, positive, negative, or flop) probabilities.
Asi 08 00161 g006
Figure 7. Distribution of the adjusted precision (correct flop probabilities) [2].
Figure 7. Distribution of the adjusted precision (correct flop probabilities) [2].
Asi 08 00161 g007
Table 1. Recall and Precision for the 3-, 6-, 9-, and 12-Months ML Model [2].
Table 1. Recall and Precision for the 3-, 6-, 9-, and 12-Months ML Model [2].
Months36912
Metrics
recall64.29%75.00%82.14%89.29%
precision33.75%51.22%60.00%67.57%
Table 2. Adjusted precision for the 3-, 6-, 9-, and 12-Months ML model [2].
Table 2. Adjusted precision for the 3-, 6-, 9-, and 12-Months ML model [2].
Months36912
Metrics
adjusted precision73.75%82.93%89.57%91.89%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Grims, M.; Karas, D.; Ivanova, M.; Höfinger, G.; Bruchhaus, S.; Bornschlegl, M.X.; Hemmje, M.L. Implementation and Rollout of a Trusted AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects. Appl. Syst. Innov. 2025, 8, 161. https://doi.org/10.3390/asi8060161

AMA Style

Grims M, Karas D, Ivanova M, Höfinger G, Bruchhaus S, Bornschlegl MX, Hemmje ML. Implementation and Rollout of a Trusted AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects. Applied System Innovation. 2025; 8(6):161. https://doi.org/10.3390/asi8060161

Chicago/Turabian Style

Grims, Michael, Daniel Karas, Marina Ivanova, Gerhard Höfinger, Sebastian Bruchhaus, Marco X. Bornschlegl, and Matthias L. Hemmje. 2025. "Implementation and Rollout of a Trusted AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects" Applied System Innovation 8, no. 6: 161. https://doi.org/10.3390/asi8060161

APA Style

Grims, M., Karas, D., Ivanova, M., Höfinger, G., Bruchhaus, S., Bornschlegl, M. X., & Hemmje, M. L. (2025). Implementation and Rollout of a Trusted AI-Based Approach to Identify Financial Risks in Transportation Infrastructure Construction Projects. Applied System Innovation, 8(6), 161. https://doi.org/10.3390/asi8060161

Article Metrics

Back to TopTop