AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework

Mobadersani, Mohammad; Candas, Ali Bedii; Kuruoğlu, Murat; Tokdemir, Onur Behzat

doi:10.3390/buildings16010120

Open AccessArticle

AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework

¹

Department of Civil Engineering, Istanbul Technical University, 34469 Maslak, Istanbul, Turkey

²

Department of Civil Engineering, Middle East Technical University, 06800 Çankaya, Ankara, Turkey

³

AI-Center, Istanbul Technical University, 34469 Maslak, Istanbul, Turkey

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(1), 120; https://doi.org/10.3390/buildings16010120

Submission received: 21 November 2025 / Revised: 23 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

Arbitration efficiency is widely recognized as a factor influencing outcomes in construction disputes. To increase the chance of finding and designating the best-fit arbitrator, a large number of candidate profiles must be investigated, which is an overwhelming, time-consuming process. This study develops and evaluates a large language model (LLM)- enabled framework for arbitrator selection based on dispute details and predefined expert criteria. To reach this goal, 500 standardized, anonymized arbitrator resumes were evaluated using a unified scoring structure. These resumes were scored and classified using two GPT-5 models with different levels of detail in their prompts. The results of these models were then compared with expert evaluations to assess their ability to replicate human decision-making patterns in resume evaluation and classification. According to the results, the second model, with a high level of detail in its prompt structure, achieved an accuracy of 84%, while the first model, with a concise prompt that provides only a brief description of the experts’ expectations, achieved an overall accuracy of 53%. As can be concluded, the accuracy of the LLM-assisted resume analysis framework improves when guided by a detailed, expert-aligned prompt structure. From a research perspective, this study’s results highlight the importance of prompt engineering in an AI-assisted decision-support system for professional evaluation tasks. Since this framework is limited to resumes in English, future research should examine the effectiveness of LLMs in evaluating and classifying resumes in languages other than English. Moreover, future studies might consider replicating this study using other large language models to compare precision and accuracy across different LLMs.

Keywords:

construction arbitration; arbitrator selection; large language model; prompt engineering; human resources

1. Introduction

The global construction industry, which forms the backbone of modern civilization, is currently plagued by escalating high-value conflicts; the average value of construction disputes in North America reached an all-time high of USD 60.1 million in 2024, a significant increase over 2023’s value (USD 43.0 million) [1]. This vulnerability stems from inherent characteristics of the construction sector, including persistent budget pressures, tight schedule constraints, and frequent contractual misinterpretations, which are further intensified in complex, international projects involving intricate documentation, diverse governing laws, and cultural differences [2,3,4]. The consequences are severe, often resulting in time-consuming resolutions, significant cost overruns, and fractured professional relationships [5,6]. As a result, they have to be solved in the most optimal way to harness all the mentioned problems. A large body of research has been published regarding determining the most efficient ADR method over litigation by comparing ADR methods based on the main cause and the type of dispute [7]. Alternative dispute resolution (ADR) methods are commonly employed to settle disputes successfully while preserving professional relationships. They consist of negotiation, mediation, arbitration, mini-trials, and the like.

According to the literature, arbitration is a widely used dispute resolution method in which parties agree to resolve conflicts through a neutral third party, known as an arbitrator, instead of pursuing court litigation. This process involves a neutral arbitrator or panel that reviews evidence, listens to both sides, and makes a binding or non-binding decision based on the facts and relevant contractual terms. Particularly common in construction contracts, arbitration offers a more formal and efficient alternative to litigation, frequently proving faster and more flexible for resolving complex construction disputes. Moreover, unlike litigation, arbitration does not follow the rigid rules of evidence, implying that the arbitrators are free to accept any evidence and give it whatever weight they perceive appropriate [8]. On the other hand, arbitration can be ineffective due to the limited expertise of tribunals (arbitrators), which often does not align with the complexity of the cases they handle [9]. Tribunal competency would play a major role in establishing discipline in the dispute settlement process [10,11]. The analysis of cases reveals that most moderately complex cases have been addressed by the tribunals of moderate competence, whereas the least complex cases have been assigned to highly competent tribunals. These findings suggest shortcomings in the appointment process and a potential mismatch between the complexity of the cases and the jurisdiction of the tribunals [12]. Considering that arbitrator appointments and changes are the main reasons for delays in the arbitration process [13], there is a need for a new approach that assists disputant parties in designating the most suitable arbitrator and creating an effective panel to get the most out of the process.

Beyond issues related to individual arbitrator competence and appointment practices, arbitration effectiveness is also shaped by broader considerations regarding how tribunals are constituted. Research indicates that race, ethnicity, and culture significantly impact jury decision-making. Jurors from diverse backgrounds bring unique attitudes, beliefs, and values that can shape their trial judgments and approaches to service. These differences influence how jurors assess evidence and make decisions, repeatedly resulting in a broader range of perspectives. Jury members represent their communities better and contribute to more comprehensive and inclusive deliberations when they come from various backgrounds [14,15]. Due to the fact that the chance of finding the best arbitrator for the dispute file increases by increasing the number of possible candidates, this approach must incorporate a vast number of arbitrator profiles, which raises another problem. The manual screening of arbitrator profiles, which may number in the hundreds to ensure diversity and competence according to the conflict documentations, is time-consuming and prone to human bias, often leading to delays and non-optimal panel composition [13,16]. Therefore, a scalable, unbiased, and objective approach is urgently needed to maximize the benefits of construction arbitration. Such an approach should be capable of systematically screening and analyzing a large number of arbitrator resumes based on dispute-specific characteristics and recommending the most suitable arbitrator for the conflict at hand.

Large language models (LLMs), particularly generative pre-trained transformer (GPT) architectures, have achieved substantial advancements in natural language processing and artificial intelligence, making them particularly well-suited for large-scale text-intensive tasks such as resume analysis and decision support. This has made them a significant area of research due to their potential for application across a variety of fields [17]. Numerous studies have evaluated AI and suggested a meaning that only differs in details [18,19,20,21,22,23,24,25,26]. However, AI generally refers to developing systems or machines that can interpret data, learn from it, and adapt their actions to achieve specific goals. By simulating human intelligence, AI enables machines to perform tasks traditionally performed by humans, ranging from decision-making to problem-solving, with minimal intervention. AI combines theories and techniques to create adaptable, goal-oriented technologies for various applications in business and society.

The importance of arbitration and the significant capabilities of AI have persuaded researchers to conduct studies to understand the role of AI during arbitration processes. Moreover, arbitration may be the most appropriate method for introducing AI innovations in dispute resolution since it takes place in a regulated and defined setting where disputing parties can voluntarily agree (or not) to use such systems. In addition, professional associations and arbitral institutions can control the scope and applications of AI [27]. Incorporating AI into arbitration can accelerate proceedings and enhance data analysis accuracy. However, challenges such as potential errors and AI’s inability to understand the legal nuances and human factors persist. This highlights the necessity of clear regulations and standards to address these issues and uphold fairness and legal certainty.

Implementing AI capabilities in arbitrator resume analysis provides multiple advantages. The first benefit is reduced subjectivity. LLMs have the capacity to mitigate the potential for human bias by evaluating and comparing profiles against objective, predefined criteria rather than relying on subjective preferences [28]. This capability is crucial in promoting fairness and diversity in the selection process, providing a more objective assessment for all candidates, including less-experienced arbitrators who may possess specialized technical skills. Second, AI has demonstrated the ability to substantially accelerate recruitment processes, which is driving its growing adoption in business [29,30,31,32].

Much literature has been published on the use of AI in resume creation and analysis [33,34,35,36]. For instance, Gan et al. [37] introduced a novel LLM-based agent framework for resume screening to enhance efficiency and time management in recruitment processes and reported that the automated resume screening model is 11 times faster than manual methods. This efficiency is vital in arbitrator selection, which is often a source of critical delays [13]. In addition, these models can analyze a wide range of data related to candidates based on the parties’ predefined criteria. This prevents disregarding even a small qualification in a candidate’s resume that may occur via traditional screening. Zinjad et al. [38] explored the use of advanced language models, such as GEMINI and GPT-4, to create tailored resumes based on a given job description and an applicant’s original resume. To evaluate the accuracy and quality of the created resumes, this study focused on how well the resume is in line with the job description and how efficiently it can maintain the original content from the applicant’s resume. Moreover, Heakl et al. [39] compared the accuracy of a variety of large language models with that of traditional machine-learning techniques in terms of resume classification. Their results revealed that LLMs are extremely trustworthy and accurate (an accuracy rate of >90%). Likewise, using LLMs, Rithani and Venkatakrishnan [40] examined how well resumes can be classified into seven groups. First, the resumes were labeled, and then LLMs were tasked with identifying the function of each candidate. Next, metrics, such as F1 score, accuracy, and other pertinent measurements, were employed to assess the models’ performance.

Despite these demonstrated advantages, the application of large language models in resume analysis and arbitrator selection is not without challenges and limitations. One such risk is AI hallucination; AI hallucinations happen when models generate responses or conclusions that seem credible but are actually invented [27]. Varun Magesh and others describe Hallucinations as ‘the tendency of AI systems to produce results that are clearly false or misleading’ [41]. In the case of this study, hallucinations are capable of affecting the results of this study’s approach, where models may generate incorrect or misleading information, potentially affecting arbitrator recommendations. Moreover, misclassification or biased recommendations raise accountability concerns, necessitating clear responsibility frameworks and auditability.

To address these limitations, human oversight remains essential. A hybrid AI–human approach to resume evaluation increases efficiency and objectivity by combining AI’s analytical capabilities with expert judgment and authority. This approach supports reasoned and equitable decision-making while requiring ongoing human monitoring and accountability [28,42,43]. However, how this hybrid AI–human approach can be practically applied and empirically examined in arbitrator resume evaluation remains largely unexplored.

In practice, arbitrator selection in construction disputes involves evaluating complex professional profiles against case-dependent criteria, such as dispute type, project scale, contractual framework, and legal–technical expertise. To date, there is limited empirical research investigating whether large language models can replicate expert evaluation patterns in this context by systematically screening and scoring human arbitrator resumes using expert-defined criteria and validation against expert judgments.

Addressing this gap is essential, as shortcomings in arbitrator appointment have been shown to directly affect arbitration efficiency, duration, and outcome quality. Therefore, this study aims to bridge this gap by proposing and empirically validating an AI-driven, expert-calibrated framework for arbitrator selection in construction disputes. The proposed approach evaluates whether LLM-empowered GPT models can support arbitrator appointment by replicating expert-based resume assessment while preserving human oversight. By doing so, the study seeks to reduce the time and effort required for arbitrator designation and to enhance objectivity and consistency in tribunal formation. From a research perspective, this study contributes to the emerging literature on AI-assisted professional decision support by examining how large language models can replicate expert judgment when constrained by structured, domain-specific evaluation frameworks.

2. Methods

2.1. Data Acquisition and Preprocessing

The data acquisition process involved collecting arbitrator resumes from publicly available arbitration institution websites, including the American Arbitration Association and the Singapore International Arbitration Center [44,45]. A total of 589 resumes were analyzed, including those belonging to 498 (approximately 84.5%) male and 91 (15.5%) female arbitrators. This notable gender imbalance reflects the current demographics of the arbitration profession, particularly within the construction industry, where male arbitrators significantly outnumber their female counterparts. Such disparities are consistent with broader industry trends and highlight ongoing challenges related to gender representation in this field. More precisely, it encompassed profiles from arbitrators originating from 41 countries, with the majority based in Singapore (134), followed by the United States of America (83) and the United Kingdom (74). This geographic diversity underscores the global nature of arbitration in the construction sector, with significant representation from Asia, North America, and Europe (Figure 1).

2.2. Resume Extraction and Structuring Workflow

All resumes were obtained in PDF format and stored in a centralized repository for preprocessing. Given the heterogeneity of document layouts and formatting styles across institutions, a structured preprocessing pipeline was designed to ensure consistency and reproducibility prior to analysis.

The collected resumes varied substantially in terms of length, structure, level of detail, and formatting quality. Some resumes provided comprehensive descriptions of arbitration experience and project involvement, while others contained limited or highly condensed information. To mitigate bias arising from such structural variability, inclusion in the final dataset was based solely on the availability of substantive content relevant to the evaluation criteria rather than resume length, formatting style, or presentation quality.

During preprocessing, resumes were screened for duplication, incomplete content, and extraction errors. Duplicate profiles appearing across multiple institutional listings were identified using name matching and professional background comparison, and retained only once. Resumes with insufficient or ambiguous information that prevented reliable evaluation were excluded from further analysis. Optical character recognition (OCR) errors and irregular formatting issues, such as misaligned text, fragmented headings, or missing fields, were addressed through automated cleaning routines and manual verification where necessary to ensure data integrity.

Resume extraction was performed using Python (version 3.11.2)-based PDF processing libraries, which enabled text extraction based on layout features such as font size, bold formatting, and positional hierarchy. Bullet points, tables, and embedded elements were converted into linear text representations prior to classification. Post-processing steps included heading normalization, removal of non-informative symbols, and alignment of semantically equivalent fields across resumes to produce a standardized structured representation suitable for expert review and model input.

Afterward, data were extracted under a broad set of headings to capture the full range of information contained in the resumes. The workflow was designed to identify and extract all potential headings, ensuring that no essential information was omitted. As illustrated in Table 1, the extracted data included both professional and personal details.

2.3. Dataset Standardization and Anonymization

Resume content forms the basis for weighting and scoring by decision-makers. However, the collected resumes varied in structure, particularly in how sections were segmented and named, even when the underlying content was comparable. This structural variation posed challenges for consistent evaluation. To address this issue, the resumes were reorganized under a set of standardized main headings to create a homogeneous dataset. This standardization simplified the weighting and scoring process by allowing decision-makers to focus on content rather than differences in terminology or formatting. Accordingly, all equivalent resume sections were grouped into five main categories (Table 2), enabling comparable evaluation across different resume formats and improving the efficiency, accuracy, and reliability of the decision-making process. Table 2 provides a detailed description of these categories.

Figure 2 presents the overall workflow of data acquisition, extraction, standardization, and preparation for expert and model-based evaluation.

The following subsections describe the scope and role of each category as used in the structured dataset and subsequent expert and model-based evaluations.

Personal Information: This section contains crucial details that determine and provide context about each individual. It encompasses the individuals’ full names and surnames, functioning as basic identifiers, in addition to their phone numbers, email addresses, and, if possible, fax numbers, warranting successful communication. The individuals’ addresses and current residence places are enclosed for geographical context. By integrating these basic elements, this section guarantees that key determining and contact data are consistently documented and easily obtainable for evaluation and referencing.

Professional Experience: The Professional Experience thoroughly explains the individuals’ careers, highlighting the breadth and depth of their expertise in various roles and industries. Moreover, it delineates the extent of their work, emphasizing engagement in sectors such as commercial litigation, construction law, and engineering. It normally encompasses descriptions of essential projects or cases, repeatedly assessing their level of complexity and scale in order to represent individuals’ capacities for dealing with critical issues. Furthermore, it emphasizes their expertise in dispute resolution, contract management, and technical knowledge, representing a multidisciplinary approach. This section also focuses on thought leadership and geographical reach, demonstrating contributions to cross-border projects and international forums. The section is an all-inclusive repository of individuals’ capabilities and professional achievements, providing a powerful premise for assessing their usefulness and qualifications for particular tasks or roles.

Arbitration Experience: It thoroughly explains individuals’ engagement in arbitration and alternative dispute resolution mechanisms. It captures the individuals’ extensive experiences in a variety of capabilities, encompassing their contributions as sole arbitrators, chairpersons, and panel members, on cases managed by leading institutions (e.g., American Arbitration Association, Singapore International Arbitration Center, the International Chamber of Commerce, Hong Kong International Arbitration Center, and London Court of International Arbitration). The section represents individuals’ levels of familiarity with various global legal systems and arbitration seats, highlighting their adaptability to cross-boundary disputes. In addition, the cases underlined in this section span a variety of industries, such as energy, construction, and commercial sectors, representing individuals’ expertise in coping with intricate, critical disputes. Furthermore, membership in highly regarded professional associations and arbitration panels indicates the recognition of their impartiality and qualifications. This section also emphasizes individuals’ technical acumen in managing a wider range of issues, including cost overruns, delays, and contract interpretation, along with their roles in thought leadership through conferences and seminars. Overall, the Arbitration Experience section characterizes individuals as experienced experts in the sector, capable of navigating complicated arbitration proceedings across jurisdictions.

Education: The education section explains individuals’ professional certifications, academic qualifications, and affiliations. It encompasses fields of study, formal degrees, awarding institutions, and extra qualifications (e.g., certifications and professional licensure). It further underlines memberships in professional organizations and, where applicable, committee participation or leadership contributions. Moreover, it comprehensively summarizes the individuals’ professional development and educational backgrounds, pinpointing their expertise and commitment to their field.

Additional Qualifications and Contributions: This section emphasizes the supplementary dimensions of individuals’ profiles that improve their personal and professional capabilities. It contains details related to individuals’ levels of language proficiency, representing the number of languages they can speak and their level of fluency, which is notably valuable in international arbitration-related contexts. It further focuses on speaking engagements and publications, indicating individuals’ expertise, thought leadership, and roles in professional discourse via authored works and contributions to industry gatherings. In addition, recommendations presented by clients or peers are included, providing external validation for individuals’ skills, accomplishments, and professionalism. Finally, this section presents a comprehensive view of individuals’ qualifications beyond their formal education and professional experience, highlighting their capabilities to contribute to their fields noticeably.

During the classification phase, potential biases were reduced by establishing a distinct dataset that omitted sections related to personal information (e.g., gender, nationality, name, family name, and current place of residence) from the resumes. It was intended to guarantee an unbiased evaluation process performed by the GPT model and the expert group, concentrating solely on experience, arbitration qualifications, and individual achievements.

Resumes were included in the study if they contained sufficient information relevant to the evaluation criteria, particularly arbitration experience, professional experience, education, or additional qualifications. Resumes with insufficient detail or unclear content were excluded from further analysis. As a result, 500 out of 589 resumes were ultimately selected for evaluation. The remaining resumes were excluded due to inadequate or ambiguous information that prevented reliable assessment, and no filtering was performed based on nationality, gender, or institutional affiliation at the resume selection stage. The selected resumes were anonymized by omitting personal information, and a unique four-digit ID was assigned to each of them to facilitate document tracking. This approach enabled an impartial evaluation process that emphasized expertise and qualifications without the influence of personal identifiers.

2.4. Expert-Based Weighting Using AHP

Descriptions mentioned above were presented to a panel of five experts. A purposive expert sampling strategy was used to select the expert panel [47,48]. This strategy was chosen because the Analytic Hierarchy Process (AHP) relies on informed expert judgment rather than random or statistically representative samples [49]. Accordingly, experts were deliberately selected based on predefined criteria, including senior-level professional roles, extensive experience in construction projects and construction contracts, and direct involvement in dispute-related or contractual decision-making. Selecting experts with relevant domain knowledge ensured that the pairwise comparisons and resulting AHP weights were based on informed and credible judgments, thereby enhancing the reliability of the weighting process. Table 3 summarizes the professional backgrounds of the selected experts, highlighting their senior-level experience and long-standing involvement in the construction industry.

The Analytic Hierarchy Process (AHP) was selected for criteria weighting because it is specifically designed to generate meaningful and reliable results from relatively small groups of domain experts, rather than large survey-based samples [50,51]. Unlike statistical methods that require large respondent pools, AHP relies on structured pairwise comparisons performed by knowledgeable decision-makers. Consequently, there is no strict minimum sample size prescribed for AHP applications. Previous studies report that expert panel sizes typically range from as few as 4–9 participants, while only a limited number of studies employ more than 30 experts [52,53]. Furthermore, methodological reviews indicate that involving more than approximately 7–10 experts may lead to diminishing returns, as additional judgments tend to be repetitive and do not substantially improve decision quality [54]. Based on these considerations, five experts were deemed sufficient for this study, particularly given their extensive senior-level experience and the verification of judgment consistency through acceptable consistency ratios (CR < 0.1) values for all pairwise comparison matrices.

The experts were asked to assess the relative significance of each criterion and assign a weight to represent its importance in assessing arbitrators’ qualifications. This process sought to reach a consensus on the most critical criteria for arbitration roles in construction disputes, providing a standard for an informed and unbiased scoring process.

As it is depicted in Figure 3, experts compare the criteria against the others and then use the results to create a pairwise comparison matrix. To confirm that their judgments were reasonably valid, consistency ratios (CR) were calculated. All CR values satisfied the accepted threshold (CR < 0.1), confirming acceptable agreement among expert judgments prior to their use in model training and evaluation [55].

The individual weights assigned to each criterion by the five decision-makers are displayed in Figure 4. The results revealed that arbitration experience and professional experience received higher weights among most of the decision-makers, highlighting their perceived significance in assessing arbitrators for construction-associated disputes. Individual priority weights for each criterion from the normalized pairwise comparison matrices of each decision-maker were summed and divided by the total number of decision-makers to ensure that the final weights reflected a balanced synthesis of collective expert judgments.

2.5. Real-Case Application

Understanding the applicability and efficiency of the AI models on actual cases requires a profound investigation, especially designed for a real construction-related dispute file. As a result, the study requires a comprehensive claim folder with dispute details that assist the model in shaping the optimum prompt while forming the tribunal panel. In this paper, a comprehensive dispute file has been selected among 102 Illinois circuit court construction cases that belong to the studies done by Arditi et al. [56,57,58]. This file has more than 40 claim features that constitute the characteristics of the selected circuit court case. As shown in Table 4, these features were grouped into categories based on their nature. Improved feature classification using these groups, as explained in Table 5, provides a better understanding of the conflict by breaking down the complexity of legal construction disputes into manageable components.

2.6. Expert Resume Scoring Procedure

Before the beginning of the resume scoring phase, features extracted from the studied dispute case were presented to the expert group. They were then asked to develop a checklist of criteria under each of the four main headings that they had already weighted. Based on the overlap in their responses, a unified checklist was created, as shown in Table 6. Employing a standardized checklist among multiple experts guarantees convergence in results by aligning their evaluations around shared criteria.

The remainder of the Methods section is organized into two phases (Figure 5). During the first phase, the resumes are scored by an expert panel, followed by evaluation using the developed AI models. Two LLM-based models were created to analyze their efficiency in replicating human reasoning (results obtained from the first phase) during resume screening and candidate classification. The only difference between these two models is the level of detail included in the models as the instruction. The documents, along with the claim folder, extracted features, prepared checklist, and assigned weights, were distributed among the experts to guarantee an efficient and thorough evaluation process. Due to the fact that it would be overwhelming and time-consuming if each expert were to evaluate all 500 resumes at once, resumes were given to each expert in 100 document cluster in five-step with 10-day intervals to assist in preserving the quality of evaluation by decreasing the risk of fatigue influencing their performance.

2.7. Expert and GPT-Based Resume Scoring

The second phase of the study begins to create GPT models to score the resumes according to the given instructions. The models needed a well-designed prompt in order to replicate the patterns and preferences of human scoring effectively. Notably, the prompt’s accuracy and quality could directly affect GPT’s performance to reproduce human preferences during the scoring and classification stages. As a result, the weights assigned to each criterion computed by the AHP and prepared checklist were utilized within the structure of the prompts. Figure 6 depicts the instructions presented to the first GPT model used in this study.

The intended instructions were organized into three main components (i.e., role designation, task assignment, and output formation), each having a particular purpose in directing the model’s function. The model’s primary function in the role designation section is clarified, defining its responsibility to perform as a virtual expert in evaluating the construction arbitration resume. This section paves the way for the model’s performance and establishes clear-cut expectations for its decision-making abilities. The second section concentrates on the operational features, calculating the weights of the main criteria for evaluation, and provides a concise yet precise explanation of the tasks to be conducted, assuring that the model complies with the scoring framework by ordering the criteria by their priority. Eventually, the output formation section defines the format and structure of the anticipated results, encompassing score scaling (from 0 to 100) and a clear-cut logic associated with the scores. Such a reasoning component guarantees transparency and matches the model’s outputs with the expert-based technique by explaining how the evaluation criteria were applied to each of the resumes. Overall, these components guarantee that the performance of the model is systematic and consistent and reflects expert-level judgments.

Another model was simultaneously generated with different levels of detail, particularly in the task assignment section. The resulting improved prompt structure is depicted in Figure 7. The first and second model instructions both aim to evaluate arbitrator resumes, especially their relevance to construction arbitration cases, but they differ in level of detail and emphasis in shaping the guidelines for scoring. The first model provides a general framework, identifying four major weighted categories (arbitration experience, professional experience, education, and additional qualifications) and encouraging models to focus on available information without penalizing missing data. It emphasizes construction relevance but leaves significant room for subjective interpretation. In contrast, the second model is much more detailed and prescriptive by exactly reflecting the criteria mentioned in the expert checklist in the form of precise prompts. It not only maintains the same four weighted categories but also offers explicit scoring guidelines and clarifies exactly what constitutes high, moderate, and low scores within each category. It demands models to assess candidates based on factors like project value, complexity of disputes, arbitration roles (e.g., Chair), involvement with recognized institutions, and specialized education in construction-related fields. Additionally, the second model enforces stricter, evidence-based scoring by tying justifications explicitly to predefined criteria. Overall, the second model provides a more rigorous, structured, and industry-aligned evaluation process, reducing subjectivity and enhancing consistency across resume assessments compared to the broader, more flexible approach of the first model. After generating the second model, all profiles were scored to investigate whether the updated instructions improved the model’s performance.

2.8. Model Evaluation and Validation

After the resumes were scored by the models, a classification scheme was applied to facilitate validation. The scores were divided into four categories: Class A (85–100), Class B (70–85), Class C (55–70), and Class D (0–55). The scored documents were then grouped according to this classification. To validate the model-generated scores and assess the model’s performance, a comparison was made between the classification developed by the model and expert-determined ones. Validation metrics were utilized for this purpose, including precision, recall, F1 score, and accuracy (conventional metrics employed to evaluate the classification models’ accuracy and effectiveness). The accuracy category evaluates the model’s overall correctness by computing the proportion of appropriately categorized instances (positives and negatives) out of all instances. It further presents a straightforward metric to assess how well the model functions across all groups. However, it may not thoroughly capture its performance in unbalanced datasets where precision and recall are considered more informative. In addition, precision quantifies the proportion of properly predicted positive instances out of all other instances that were predicted as positive. A higher precision indicates that the model makes fewer false positive predictions. Furthermore, recall (also known as sensitivity or true positive rate) measures the proportion of correctly predicted positive instances out of all actual positive instances. A higher recall suggests that the model is capturing a larger portion of the actual positive cases. The F1 score represents the harmonic mean of precision and recall, providing a balanced measure that considers both metrics. A higher F1 score implies that the model balances precision and recall, ensuring robust performance. The formulae of all four metrics are expressed as Equations (1)–(4). The results were compared with the experts’ classifications to evaluate the efficiency of the prompts.

A c c u r a c y = \frac{C o r r e c t c l a s s i f i c a t i o n}{T o t a l c l a s s i f i c a t i o n} = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{C o r r e c t l y c l a s s i f i e d a c t u a l p o s i t i v e s}{E v e r y t h i n g c l a s s i f i e d a s p o s i t i v e} = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{C o r r e c t l y c l a s s i f i e d a c t u a l p o s i t i v e s}{A l l a c t u a l p o s i t i v e s} = \frac{T P}{T P + F N}

(3)

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

2.9. Methodological Limitations and Risk Control

Potential methodological limitations were explicitly considered during model design and evaluation. These include the risk of large language model hallucination, sensitivity to prompt formulation, and the propagation of bias present in the underlying data. To mitigate these risks, model outputs were constrained through structured prompts, expert-derived weighting schemes, and post hoc validation against expert classifications. The inclusion of expert benchmarking and quantitative performance metrics further reduced the likelihood of unsubstantiated or biased model outputs.

3. Results

3.1. Inter-Rater Reliability of Expert Scoring

As discussed in the methods section, the resumes were divided among five experts for scoring, and each candidate’s final score was calculated as the average of the five expert ratings. To reduce inconsistencies stemming from different evaluative perspectives, the relative importance each expert assigned to the scoring criteria was assessed using the Analytic Hierarchy Process (AHP), and a uniform checklist was provided to guide the evaluation process.

In addition to these standardization efforts, the Intraclass Correlation Coefficient (ICC) method was applied after collecting the scores to assess the level of consistency among the experts’ evaluations and ensure inter-rater reliability. Intraclass Correlation Coefficient (ICC) [59] is a way to measure the level of agreement between multiple judges scoring the same quantity [60]. The ICC values of the scores given by the experts are calculated. The average measure value of the ICC test is equal to 0.946 for this study. There are different guidelines for interpreting the ICC inter-rater agreement measures. According to Cicchetti [61], an ICC value of less than 0.40 is considered poor, while a value between 0.40 and 0.59 is interpreted as fair. Values between 0.60 and 0.74 indicate good agreement, and values between 0.75 and 1.00 reflect excellent agreement among raters. Another study by Koo and Li [62] suggests that ICC values below 0.50 indicate poor reliability, while values between 0.50 and 0.75 reflect moderate reliability. Values between 0.75 and 0.90 are considered good, and values above 0.90 indicate excellent reliability. According to both of these guidelines, the raters of the study are of excellent reliability. Having established a satisfactory level of inter-rater reliability, the expert-assigned scores can now be confidently compared with those generated by the first and second models.

3.2. Performance of the First (Baseline) Model

The violin plot (Figure 8) illustrates score density distributions for manual scoring and the first model’s scoring. Manual scoring ranges from 40 to 95, with a broader and more uniform spread across this range. The mean score for manual scoring is approximately 76, while the median score is 80, indicating a central tendency slightly above the mid-point of the range. Most manual scores are densely distributed between 70 and 90, reflecting considerable variability. The first model scoring ranges from 45 to 98, with a narrower distribution compared to manual scoring. The mean score for the first model is approximately 86, and the median score is 88, representing a generally higher central tendency. The first model’s scores are more tightly clustered between 70 and 85, showing reduced variability compared to the manual scoring process. The concentrated distribution suggests that the model favors a narrower scoring range, with fewer extreme values and outliers.

Table 7 summarizes the performance metrics for the first model’s classification results across the four classes (A, B, C, and D). The model’s overall accuracy is 0.53, indicating that 53% of the total classifications align with the expert-defined classifications. Class A exhibits a high precision of 0.95, implying that most profiles predicted as Class A were correct. However, the recall for Class A is 0.48, meaning that only 48% of actual Class A profiles were identified by the model, resulting in an F1 score of 0.64, representing a moderate balance between precision and recall. In contrast, Class B demonstrates lower precision at 0.31, with a recall of 0.59, reflecting that while the model captures more actual Class B profiles, a substantial number of false positives have been made, leading to an F1 score of 0.41. Class C shows the weakest performance among all classes, with a precision of 0.09 and a recall of 0.42, producing an F1 score of 0.15. This indicates considerable difficulty for the model in accurately predicting this class, likely due to its low representation in the dataset (support = 12). On the other hand, Class D achieves a precision of 0.61 and perfect recall at 1.0, as all actual Class D profiles were correctly classified. This has resulted in the highest F1 score of 0.76, reflecting a strong balance between precision and recall for this class. The support column highlights the distribution of instances across the classes, with Classes A and C having the highest (346) and lowest (12) representation, respectively.

3.3. Performance of the Second (Refined Prompt) Model

The violin plot (Figure 9) highlights the score distribution for the second model. The scores range from 40 to 98, with a mean of 80 and a median of 82. Most scores are concentrated between 75 and 85, with a narrower spread compared to the manual scoring. The distribution represents a centralized pattern, with fewer outliers and a higher density around the median.

The second model’s classification performance across four classes (A, B, C, and D) is presented in Table 8. With an overall accuracy of 0.84, the model significantly outperforms the first model in terms of alignment with expert-defined classifications. Class A functions well in detecting and precisely classifying resumes (with a precision and recall of 0.91 and 0.77, adding up to an F1 score of 0.83). Class B represents a balanced classification capability and a strong recall (with a precision of 0.78, an F1 score of 0.84, and a recall of 0.92). On the other hand, Class C displays reliable categorization, regardless of its negligible support of 54 resumes (with a precision of 0.86 and a recall of 0.89, resulting in an F1 score of 0.87). Although Class D has the smallest support of 28 resumes, it functions well (the highest precision at 0.96, recall of 0.79, and F1 score of 0.86). The resilience of the second model is illustrated by its consistent function across all classes, regardless of its support.

3.4. Agreement Between Expert and Model Scores

Although F1-scores and precision/recall are standard metrics for evaluating model performance, they do not fully capture inter-rater agreement or account for class-based confusion. To address this, the Intraclass Correlation Coefficient (ICC) was calculated to measure the degree of reliability between the model-generated scores and the manual scores provided by experts. ICC was computed for the First Model vs. expert scores and the Second Model vs. expert scores. As shown in Table 9, both models demonstrate good (First Model) and excellent (Second Model) agreement with the expert evaluations. However, despite this high correlation, the use of class intervals (A, B, C, and D) in the scoring system reduces classification accuracy, particularly for the First Model. This is because predicted scores that are numerically close to expert scores may still fall into adjacent class intervals, resulting in misclassification and negatively affecting metrics such as F1-score and precision/recall.

3.5. Score Distribution and Pattern Analysis

The line graph and the violin plot in Figure 10 depict discrepancies in the distributions and frequencies of scores in both models and manual scoring. The violin plot emphasizes the changeability in the scoring patterns, demonstrating that manual scoring has an extensive range with more variability compared to the first and second models. Moreover, manual scoring represents a more dynamic process of evaluation, pinpointing the delicate differences applied by the experts.

Contrarily, the first model showcases a more concentrated and narrower score distribution, implying decreased variability and an inclination toward higher scores. The second model shows improved consistency with the expert-based scoring patterns and a more balanced distribution. This is observable from a better approximation and a broader range of variability in this model in comparison to the first model.

The line graph also confirms these observations, displaying that the first model tends to cluster scores at the higher end, leading to a distinct peak near the upper range of scores. Manual scoring preserves a more equally distributed curve, with scores spread across the total range. The frequency curve of the second model is more consistent with manual scoring, indicating improved distribution and balance at different score levels.

The scatter plots (Figure 11) compare the two models and manual scoring and represent the correlation between their scores. For the first model, the data points are less closely aligned with the red line. As a result, the trendline deviates more significantly from the optimum line, representing perfect alignment with manual scoring. This indicates a weaker correlation between the first model’s scores and manual evaluations.

The second model demonstrates improved alignment with manual scoring. The data points are more closely clustered around the optimum line, and the trendline shows a reduced deviation, reflecting a stronger correlation. This implies that the second model captures variability more effectively and aligns its scoring more closely with expert evaluations.

The scatter plots also reveal distinct patterns in score concentration. The first model shows a tendency to over-predict scores in the upper range. In contrast, the second model achieves a more even distribution of points across the range, with fewer instances of over-prediction, highlighting its improved consistency and accuracy.

4. Discussion

4.1. Interpretation of Model Performance

Score distribution analysis revealed that the first model had a deeper gap than manual scoring, especially in middle-range scores. A higher frequency for high-score predictions in the first model resulted in classifying more resumes as Classes A and B. Although this approach enabled the first model to detect nearly all of the Class A and Class B resumes, it increased false true positives (low actual positive cases). This was evident in classification metrics, representing that the first model tended to give high scores to most profiles (support equals 346 for Class A). A 0.95 precision for Class A indicated that this model could detect most of the actual Class A resumes. However, low recall for Class A (0.48) confirmed the high number of false positive Class A. The condition was even worse for the first model’s other three classes, leading to an overall accuracy of 0.53 for this model.

The comparison of the frequency of the scores of the second model with manual scores demonstrated a similar trend with a slight difference in middle-range scores. Moreover, the score distribution in the second model performs better than the first model because its trendline was more aligned with manual scoring (Figure 11). The classification reports of the second model also represented the higher performance of this model in comparison with the first model. For instance, the second model reduced the number of profiles classified as Class A and B to 209 for both classes (346 and 128, respectively) in the first model, indicating that, in contrast to the first model, the second model can discriminate classes more accurately and does not assign most of the resumes to one particular class.

Overall, the graphs in the results section demonstrate the iterative improvements achieved in the second model compared to the first. While the first model displayed limited variability and over-concentration of higher scores, the second captured a more balanced distribution and improved alignment with manual scoring patterns. The line graph and violin plot showed that the second model’s scores were distributed more realistically, and the scatter plots confirmed stronger correlations with expert evaluations. The achieved accuracy of 84% by the second model in classification underscores the importance of refining the model’s instructions for better expert judgment replication while maintaining consistency and reliability.

4.2. Influence of Prompt Design on Model Behavior

The superior performance of the second model can be attributed primarily to differences in prompt design and instruction granularity rather than model architecture. The first model relied on high-level evaluation guidance, which encouraged broadly positive scoring and resulted in score inflation and reduced class discrimination. In contrast, the second model incorporated detailed, criterion-specific instructions derived directly from the expert checklist, effectively constraining the model’s decision space. This reduced ambiguity during evaluation and encouraged the model to attend to discriminative features such as dispute type, project scale, arbitration role, and institutional exposure, thereby improving class separation and reducing false positive classifications.

4.3. Key Practical Findings

The results of this study are summarized as follows:

The high accuracy rate in resume classification obtained during the second model revealed this model’s high capability and adaptability by leveraging detailed instructions and well prompting. The classification accuracy results between the two models indicated that detailed instructions that accurately reflect the human desire during evaluation enable LLM-based models in the screening and classification process.
Stakeholders need to learn how to use these AI models and state the arbitrator criteria they are looking for in detail. These details could be the arbitrator’s previous dispute type, the value of the dispute, educational background, and the like.
Considering that these models pay more attention to details as a way to discriminate between candidates, arbitrators should include more detail about their previous records while avoiding giving general information about their background that leads them to lose the chance of being picked.

4.4. Research Objective and Relation to Prior Studies

As previously mentioned, the present study determined the role and capability of the LLM-empowered GPT-5 model in the arbitrator selection stage and measured how accurately it can follow human instructions and evaluation criteria during profile analysis to fill the existing gap.

These findings are consistent with prior studies demonstrating that LLM performance in decision-support tasks is highly sensitive to prompt specificity and domain grounding [63,64]. Previous research in construction dispute prediction and AI-assisted legal decision-making has similarly shown that models guided by structured expert knowledge outperform generic or loosely instructed approaches [65]. The present study extends this literature by empirically demonstrating, within the arbitrator selection context, that aligning prompt design with expert-derived evaluation frameworks significantly enhances both accuracy and interpretability, thereby supplementing existing findings on AI-assisted legal analytics.

4.5. Practical Implications and Human Oversight

Final decision-making authority remains with human experts to ensure appropriate oversight of AI-assisted arbitrator selection. While the proposed models demonstrated strong empirical performance reflected in high agreement with expert evaluations as quantified by classification accuracy and statistically significant Intraclass Correlation Coefficient (ICC) values reported in Section 3.4, these results indicate alignment rather than autonomous decision capability. Accordingly, AI outputs are intended to function as a structured decision-support mechanism rather than a substitute for expert judgment. Consistent with prior research emphasizing the necessity of human oversight in AI-supported professional decision-making [28,42,66], disputing parties may use the model to identify a short list of high-ranked arbitrators. From this list, a final selection can be made based on case-specific considerations that require contextual and experiential assessment.

4.6. Ethical Considerations and Adoption Challenges

From an ethical and practical perspective, the use of LLM-based systems in arbitrator selection raises important considerations. While anonymization and standardized scoring reduce certain forms of bias, LLMs may still amplify latent biases present in training data or professional norms embedded in resume writing practices. Additionally, practical adoption barriers remain, including concerns regarding transparency, explainability, and trust among legal practitioners. For this reason, the proposed system is intended as a decision-support tool rather than a replacement for human judgment, with final selection authority retained by experienced professionals.

4.7. Limitations and Directions for Future Research

Despite its promising results, this study has several methodological limitations. The dataset relies on English-language resumes obtained from major international arbitration institutions, which may limit applicability to jurisdictions where arbitrator profiles are written in other languages or follow different professional conventions. Variability in resume structure and level of detail may also influence model performance, potentially disadvantaging arbitrators with less structured or less detailed profiles despite equivalent expertise. Furthermore, model outputs remain sensitive to prompt design, indicating a dependency on carefully engineered instructions for reliable performance. Finally, while this study employed the Intraclass Correlation Coefficient (ICC) to assess continuous-score agreement between expert evaluations and model outputs, and standard classification metrics (accuracy, precision, recall, and F1-score) to analyze class-level consistency, additional agreement statistics were not explicitly reported. In particular, alternative correlation-based or agreement-oriented measures could further clarify the nature of agreement and disagreement, especially near class boundaries. Incorporating such complementary statistical tests represents a valuable direction for future research and would allow a more granular examination of agreement patterns between expert judgments and AI-based scoring models.

Moreover, well-incorporated features extracted from real construction dispute cases can significantly influence the performance of AI models used for arbitrator selection. However, due to the complex nature of construction claims and the typically large volume of associated data, accurately extracting key dispute features remains a major challenge. To develop a comprehensive AI-assisted arbitrator recruitment framework, future research should focus on automatically incorporating case-specific dispute metadata into the resume scoring process that results in more precise and relevant arbitrator matching. Furthermore, integrating legal knowledge graphs or ontologies should be explored to support deeper semantic alignment between dispute characteristics and arbitrator profiles, thereby enhancing the overall quality and reliability of AI-driven decision-making. The findings of this research enable parties to consider a broader range of arbitrator profiles, ensuring a better fit for their disputes. Furthermore, the system’s unbiased structure provides arbitrators of all nationalities an equal opportunity to be evaluated solely on their qualifications, fostering fairness and inclusivity in arbitrator selection.

5. Conclusions

This study set out to investigate whether large language models can reliably support arbitrator selection in construction disputes when guided by expert-defined evaluation frameworks. The findings demonstrate that LLM-based models can meaningfully approximate expert judgment when detailed, domain-specific instructions are used. In particular, the second prompting model achieved substantially higher agreement with expert scoring and classification outcomes than the first model (the baseline model), reaching an overall classification accuracy of 84% compared to 53% for the first model, thereby confirming that instruction granularity plays a decisive role in model reliability and discrimination capability.

By integrating expert-derived weighting (AHP), structured resume standardization, and prompt-engineered LLM evaluation, this research addresses a key gap identified in the literature: the lack of transparent, expert-aligned AI frameworks for arbitrator selection in construction disputes. The proposed approach demonstrates how expert reasoning can be explicitly embedded into AI-driven decision-support systems, thereby enhancing interpretability, reproducibility, and trust.

Despite these contributions, several limitations must be acknowledged. The dataset is based on English-language resumes obtained from major international arbitration institutions, which may restrict applicability to other linguistic or institutional contexts. Variability in resume structure and level of detail may also influence model outputs, potentially disadvantaging candidates with less structured profiles. In addition, the performance of the models remains sensitive to prompt design, highlighting a dependency on carefully specified evaluation instructions.

Future research should directly address these limitations by integrating automated extraction of case-specific dispute metadata, enabling the model to adapt scoring logic to dispute type, value, and complexity. Expanding the framework to multilingual and multimodal resume formats would further improve generalizability. From a practical perspective, arbitration institutions could deploy such systems as decision-support tools by validating model outputs against expert panels, implementing regular performance audits, and maintaining human oversight throughout the selection process. This staged validation and monitoring approach would allow institutions to benefit from AI-assisted efficiency while preserving accountability, fairness, and professional judgment.

Author Contributions

Conceptualization, M.M., A.B.C., M.K. and O.B.T.; Methodology, M.M., A.B.C., M.K. and O.B.T.; Validation, M.M. and O.B.T.; Formal analysis, M.M. and O.B.T.; Investigation, M.M.; Resources, M.K.; Data curation, M.M. and O.B.T.; Writing—original draft, M.M.; Writing—review and editing, M.M., A.B.C., M.K. and O.B.T.; Visualization, M.M.; Supervision, A.B.C. and O.B.T.; Project administration, O.B.T.; Funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Istanbul Technical University (University Administrative Board Decision of 15 October 2025 (Meeting No. 1471), conducted under the provisions of Article 58(k) of Law No. 2547).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Arcadis. Construction Disputes in Motion: Speed, Agility and Adapting to Change; Arcadis: Amsterdam, The Netherlands, 2025; Available online: https://media.arcadis.com/-/media/project/arcadiscom/com/expertise/global/contract-solutions/2025/2025-15th-annual-construction-disputes-report-final-19jun25.pdf (accessed on 10 December 2025).
Naji, K.K.; Mansour, M.M.; Gunduz, M. Methods for Modeling and Evaluating Construction Disputes: A Critical Review. IEEE Access 2020, 8, 45641–45652. [Google Scholar] [CrossRef]
Candaş, A.B.; Tokdemir, O.B. Automated Identification of Vagueness in the FIDIC Silver Book Conditions of Contract. J. Constr. Eng. Manag. 2022, 148, 04022007. [Google Scholar] [CrossRef]
Cheung, S.O.; Yiu, T.W. Are Construction Disputes Inevitable? IEEE Trans. Eng. Manag. 2006, 53, 456–470. [Google Scholar] [CrossRef]
Cheung, S.O.; Ng, S.T.; Lam, K.C.; Sin, W.S. A fuzzy sets model for construction dispute evaluation. Constr. Innov. 2001, 1, 117–127. [Google Scholar] [CrossRef]
Cheung, S.O.; Pang, K.H.Y. Anatomy of Construction Disputes. J. Constr. Eng. Manag. 2013, 139, 15–23. [Google Scholar] [CrossRef]
Alrasheed, K.; Khalafallah, A.; AlShaheen, A.; Albader, H. Litigation versus Judicial Arbitration as Binding Dispute Settlement Techniques in Public Construction Projects. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2023, 15, 04522034. [Google Scholar] [CrossRef]
Currie, O.; Robey, R. Winning Strategies in Construction Negotiations, Arbitration, and Litigation; Wiley: New York, NY, USA, 1988; pp. 911–960. [Google Scholar]
Abwunza, A.A.; Peter, T.K.; Muigua, K. Explaining the Ineffectiveness of Construction Arbitration. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2022, 14, 04522009. [Google Scholar] [CrossRef]
Abwunza, A.A.; Peter, T.K.; Muigua, K. Explaining Delays in Construction Arbitration: A Process-Control Model Approach. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2020, 12, 04520003. [Google Scholar] [CrossRef]
Abwunza, A.A.; Peter, T.K.; Muigua, K. Explaining the Effectiveness of Construction Arbitration: An Organizational Justice Perspective. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2021, 13, 04521017. [Google Scholar] [CrossRef]
Abwunza, A.A.; Peter, T.K.; Muigua, K. Explaining time performance of construction arbitration. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2021, 13, 04521001. [Google Scholar] [CrossRef]
Moza, A.; Paul, V.K. Review of the Effectiveness of Arbitration. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2017, 9, 03716002. [Google Scholar] [CrossRef]
Sommers, S.R. On racial diversity and group decision making: Identifying multiple effects of racial composition on jury deliberations. J. Personal. Soc. Psychol. 2006, 90, 597. [Google Scholar] [CrossRef]
Haridi, S.A.F. Towards Greater Gender and Ethnic Diversity in International Arbitration. BCDR Int. Arbitr. Rev. 2015, 2, 305–316. Available online: http://www.kluwerlawonline.com/api/Product/CitationPDFURL?file=Journals\BCDR\BCDR2015016.pdf (accessed on 28 September 2024). [CrossRef]
Candaş, A.B.; Tokdemir, O.B. Automating Coordination Efforts for Reviewing Construction Contracts with Multilabel Text Classification. J. Constr. Eng. Manag. 2022, 148, 04022027. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Kaplan, A.; Haenlein, M. Rulers of the world, unite! The challenges and opportunities of artificial intelligence. Bus. Horiz. 2020, 63, 37–50. [Google Scholar] [CrossRef]
van Esch, P.; Black, J.S.; Ferolie, J. Marketing AI recruitment: The next phase in job application and selection. Comput. Hum. Behav. 2019, 90, 215–222. [Google Scholar] [CrossRef]
Malik, A.; Budhwar, P.; Srikanth, N. Gig economy, 4IR and artificial intelligence: Rethinking strategic HRM. In Human & Technological Resource Management (HTRM): New Insights into Revolution 4.0; Emerald Publishing Limited: Bingley, UK, 2020; pp. 75–88. [Google Scholar]
Makarius, E.E.; Mukherjee, D.; Fox, J.D.; Fox, A.K. Rising with the machines: A sociotechnical framework for bringing artificial intelligence into the organization. J. Bus. Res. 2020, 120, 262–273. [Google Scholar] [CrossRef]
Schmidt, P.; Biessmann, F.; Teubner, T. Transparency and trust in artificial intelligence systems. J. Decis. Syst. 2020, 29, 260–278. [Google Scholar] [CrossRef]
Wamba-Taguimdje, S.-L.; Wamba, S.F.; Kamdjoug, J.R.K.; Wanko, C.E.T. Influence of artificial intelligence (AI) on firm performance: The business value of AI-based transformation projects. Bus. Process Manag. J. 2020, 26, 1893–1924. [Google Scholar] [CrossRef]
Mikalef, P.; Gupta, M. Artificial intelligence capability: Conceptualization, measurement calibration, and empirical study on its impact on organizational creativity and firm performance. Inf. Manag. 2021, 58, 103434. [Google Scholar] [CrossRef]
Dwivedi, Y.K.; Hughes, L.; Ismagilova, E.; Aarts, G.; Coombs, C.; Crick, T.; Duan, Y.; Dwivedi, R.; Edwards, J.; Eirug, A. Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int. J. Inf. Manag. 2021, 57, 101994. [Google Scholar] [CrossRef]
Russel, S.; Norvig, P. “Artificial Intelligence–A Modern Approach”, Pearson Education, 2003; Bharathidasan Engineering College: Vellore, India, 2015. [Google Scholar]
Cardoso, A.G.; Chan, E.; Quintão, L.; Pereira, C. Generative Artificial Intelligence and Legal Decisionmaking. Glob. Trade Cust. J. 2024, 19, 710–730. [Google Scholar] [CrossRef]
Rathnayake, C.; Gunawardana, A. The role of generative ai in enhancing human resource management recruitment, training, and performance evaluation perspectives. Int. J. Soc. Anal. 2023, 8, 13–22. [Google Scholar]
Suryanarayana, R.M.; Babu, M.N.; Yamuna, G.; Madhavi, T.; Bizon, C.C.; Bizon, N.; Thounthong, P. Hybridizing Technology Management and Knowledge Management to Spur Innovation: A System Dynamics Approach. Organ. Mark. Emerg. Econ. 2023, 14, 696–720. [Google Scholar] [CrossRef]
Madhavi, T.; Mehrotra, R. An Effective Mapping of Competencies for Sustainable Growth. J. Adv. Res. Dyn. Control Syst. 2019, 11, 49–56. [Google Scholar]
Madhavi, T.; Mehrotra, R. Competency mapping of sales employees in pharmaceutical industry-a blue print for future. Int. J. Pharm. Res. 2019, 11, 207. [Google Scholar]
Madhavi, T.; Sharma, S.; Joshi, A. Exploring the Factors that Contribute to Psychological Wellbeing in Successful Entrepreneurs. J. Reatt. Ther. Dev. Divers. 2023, 6, 195–203. [Google Scholar]
Abisha, D.; Keerthana, S.; Navedha Evanjalin, R.; Kavitha, K.; Jothi Mary, S.; Ramya, R. Resspar: AI-Driven Resume Parsing and Recruitment System using NLP and Generative AI. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 28–30 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Bhalke, V.D.; Satanure, D.S.; Ghabade, P.S.; Vedalankar, A. AI-Driven Transformation of HR: Enhancing Recruitment, Training, and Employee Retention. In Proceedings of the 2024 4th Asian Conference on Innovation in Technology (ASIANCON), Pimari Chinchwad, India, 23–25 August 2024; pp. 1–5. [Google Scholar]
Rajput, A.; Dubey, A.; Thakur, R.; Singh, D.; Singh, U.P. Career Craft AI: A Personalized Resume Analysis and Job Recommendations System. In Proceedings of the 2024 1st International Conference on Innovative Sustainable Technologies for Energy, Mechatronics, and Smart Systems (ISTEMS), Dehradun, India, 26–27 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chou, Y.C.; Yu, H.Y. Based on the application of AI technology in resume analysis and job recommendation. In Proceedings of the 2020 IEEE International Conference on Computational Electromagnetics (ICCEM), Singapore, 24–26 August 2020; pp. 291–296. [Google Scholar] [CrossRef]
Gan, C.; Zhang, Q.; Mori, T. Application of llm agents in recruitment: A novel framework for resume screening. arXiv 2024, arXiv:2401.08315. [Google Scholar] [CrossRef]
Zinjad, S.B.; Bhattacharjee, A.; Bhilegaonkar, A.; Liu, H. ResumeFlow: An LLM-facilitated Pipeline for Personalized Resume Generation and Refinement. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 2781–2785. [Google Scholar] [CrossRef]
Heakl, A.; Mohamed, Y.; Mohamed, N.; Sharkaway, A.; Zaky, A. ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models. arXiv 2024, arXiv:2406.18125. [Google Scholar] [CrossRef]
Rithani, M.; Venkatakrishnan, R. Empirical Evaluation of Large Language Models in Resume Classification. In Proceedings of the 2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 11–12 January 2024; pp. 1–4. [Google Scholar]
Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. J. Empir. Leg. Stud. 2025, 22, 216–242. [Google Scholar] [CrossRef]
Stone, D.L.; Deadrick, D.L. Challenges and opportunities affecting the future of human resource management. Hum. Resour. Manag. Rev. 2015, 25, 139–145. [Google Scholar] [CrossRef]
Alhasan, T.K. Integrating AI Into Arbitration: Balancing Efficiency With Fairness and Legal Compliance. Confl. Resolut. Q. 2025, 42, 523–534. [Google Scholar] [CrossRef]
Singapore International Arbitration Center. Available online: https://siac.org.sg/siac-panel-directory (accessed on 2 November 2024).
American Arbitration Association. Available online: https://constructionmegapanel.adrapps.org/ (accessed on 30 October 2024).
Google. Nano Banana: Gemini AI Image Generation and Editing. Available online: https://gemini.google/overview/image-generation/ (accessed on 13 December 2025).
Etikan, I.; Musa, S.A.; Alkassim, R.S. Comparison of convenience sampling and purposive sampling. Am. J. Theor. Appl. Stat. 2016, 5, 1–4. [Google Scholar] [CrossRef]
Sajadian, M.; Khoshbakht, K.; Liaghati, H.; Veisi, H.; Mahdavi Damghani, A. Developing and quantifying indicators of organic farming using analytic hierarchy process. Ecol. Indic. 2017, 83, 103–111. [Google Scholar] [CrossRef]
Saaty, T.L. Decision making with the analytic hierarchy process. Int. J. Serv. Sci. 2008, 1, 83–98. [Google Scholar] [CrossRef]
Abduh, M.; Omar, M.A. Islamic-bank selection criteria in Malaysia: An AHP approach. Bus. Intell. J. 2012, 5, 271–281. [Google Scholar]
Melillo, P.; Pecchia, L. What is the appropriate sample size to run analytic hierarchy process in a survey-based research. In Proceedings of the International Symposium on the Analytic Hierarchy Process, London, UK, 4–7 August 2016; pp. 4–8. [Google Scholar]
ŞahiN, M.; Yurdugül, H. A content analysis study on the use of analytic hierarchy process in educational studies. Eğitimde Psikolojide Ölçme Değerlendirme Derg. 2018, 9, 376–392. [Google Scholar] [CrossRef]
Darko, A.; Chan, A.P.C.; Ameyaw, E.E.; Owusu, E.K.; Pärn, E.; Edwards, D.J. Review of application of analytic hierarchy process (AHP) in construction. Int. J. Constr. Manag. 2019, 19, 436–452. [Google Scholar] [CrossRef]
Hickman, L.; Longman, C.; Barker, B.; Williams, P. Case Method: Business Interviewing; Addison-Wesley: Boston, MA, USA, 1994. [Google Scholar]
Saaty, T.L.; Ozdemir, M.S. Why the magic number seven plus or minus two. Math. Comput. Model. 2003, 38, 233–244. [Google Scholar] [CrossRef]
Arditi, D.; Oksay, F.E.; Tokdemir, O.B. Predicting the outcome of construction litigation using neural networks. Comput.-Aided Civ. Infrastruct. Eng. 1998, 13, 75–81. [Google Scholar] [CrossRef]
Arditi, D.; Tokdemir, O.B. Using case-based reasoning to predict the outcome of construction litigation. Comput.-Aided Civ. Infrastruct. Eng. 1999, 14, 385–393. [Google Scholar] [CrossRef]
Arditi, D.; Tokdemir, O.B. Comparison of case-based reasoning and artificial neural networks. J. Comput. Civ. Eng. 1999, 13, 162–169. [Google Scholar] [CrossRef]
Koch, G.G. Intraclass correlation coefficient. Encycl. Stat. Sci. 2004. [Google Scholar] [CrossRef]
Shrout, P.E.; Fleiss, J.L. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 1979, 86, 420. [Google Scholar] [CrossRef]
Cicchetti, D.V. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess. 1994, 6, 284. [Google Scholar] [CrossRef]
Koo, T.K.; Li, M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef]
Gonen, H.; Iyer, S.; Blevins, T.; Smith, N.A.; Zettlemoyer, L. Demystifying prompts in language models via perplexity estimation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 10136–10148. [Google Scholar]
Stephenson, M.; Sidji, M.; Ronval, B. Codenames as a Benchmark for Large Language Models. IEEE Trans. Games 2025, 1–12. [Google Scholar] [CrossRef]
Wang, P.; Huang, Y.; Zhu, J.; Shan, M. Construction Dispute Potentials: Mechanism versus Empiricism in Artificial Neural Networks. Sustainability 2022, 14, 15239. [Google Scholar] [CrossRef]
Sulaeman, M.M.; Nurcholidah, L.; Handayati, R.; Wibowo, S.N. Optimizing employee recruitment process in the education business through chatgpt implementation. Technopreneurship Educ. Dev. Rev. (TENDER) 2024, 1, 14–20. [Google Scholar] [CrossRef]

Figure 1. Global distribution of arbitrators included in the dataset.

Figure 2. Data collection and processing process explanation (created using Gemini AI image generation [46]).

Figure 3. Expert-based criteria weighting process using the Analytic Hierarchy Process (AHP).

Figure 4. Distribution of criterion weights assigned by experts and their aggregated averages.

Figure 5. Overall methodological framework and sequential flow of the study.

Figure 6. Prompt structure and instruction design of the first (baseline) LLM model.

Figure 7. Detailed prompt structure of the second (refined) LLM model.

Figure 8. Comparison of score distributions between expert manual scoring and the first LLM model.

Figure 9. Comparison of score distributions between expert manual scoring and the second LLM model.

Figure 10. Comparative analysis of score distributions across expert evaluations and both LLM models.

Figure 11. Correlation between expert scores and model-generated scores for both LLM models.

Table 1. Extracted resume headings used to capture arbitrator profile information.

Categories
Personal information	Educational background	Experience with legal systems
Nationality	Bar admission	Publications and speaking engagements
Current residence	Panels	Recommendations
Gender	Professional experience	Arbitral appointments
Career	Arbitration experience	Areas of expertise
Current position	Professional associations	Recent cases
Language	Experience with seats of arbitration

Table 2. Standardized resume structure adopted for uniform evaluation.

Personal Information	Professional Experience	Arbitration Experience	Education	Additional Qualifications and Contributions
Name	Professional experience	Experience with legal systems (globally)	Educational background	Language proficiency
Surname	Areas of expertise	Professional licenses		Publication
Email	Current position	Arbitration experience		Speaking engagement
Address	Career	Professional associations		Recommendations
Phone number		Arbitral appointments
Fax		Recent cases
Nationality		Bar admissions
Current residence		Panels
Gender

Table 3. Professional background and experience profile of the expert panel.

Expert No.	Construction Industry Experience (Year)	Construction Contract Experience (Year)
1	40	22
2	31	15
3	25	18
4	20	12
5	23	7

Table 4. Features extracted from the dispute case.

Claim Features
Parties involved	Delay and Acceleration Claims	Contract Characteristics	Scope and Changes in Work
Status of parties involved	Compensable acceleration	Type of contract	Directed changes
Type of plaintiff	Noncompensable acceleration	Contract value	Constructive changes
Type of defendant	Compensable delay	Type of designer used	Radical changes in scope
Type of counterplaintiff	Excusable delay	The provision of the contract involved	Misrepresentation of the site
Type of counterdefendant	Nonexcusable delay	Legal interpretation of contract documents	Unknown site conditions
Third-party plaintiff	Concurrent delay	Legal interpretation of drawings and specifications	Conditions discovered in pre-bid site exploration
Third-party defendant	CPM involved (Critical Path Method)	Surety bonds
Supplier has a contract with whom	Contractor coordination
Subcontract involved
Surety assured
Legal Process & Resolution	Material and Equipment Issues	Supervision and Performance	Damages and Payment Issues
Any post-trial filed	Claim for material and equipment involved	Misrepresentation of supervision	Liquidated damages involved
Resolution technique involved/used (e.g., litigation, arbitration, mediation)	Alternative material and equipment used	Technical testimony (quality of work performed)	Measure of damages
The estoppel doctrine involved	Installation requirements satisfied		Non or late payment
			The lien case involved

Table 5. Conceptual grouping and explanation of dispute feature categories.

Dispute Case Element	Explanation
Parties Involved	Identifies the roles and relationships of the participants in the dispute.
Delay and Acceleration Claims	Addresses scheduling disputes, reasons for delays, and responsibility for acceleration or lost time.
Contract Characteristics	Examines the form, terms, and interpretation of the contract governing the project.
Scope and Changes in Work	Involves changes to the original contract work and disputes about whether such changes were valid or foreseeable.
Legal Process and Resolution	Focuses on procedural aspects and legal doctrines affecting how the dispute is resolved.
Material and Equipment Issues	Deals with the adequacy, substitution, or installation of materials and equipment used in the project.
Supervision and Performance	Covers claims about the quality of work and the sufficiency of supervision or oversight provided during construction.
Damages and Payment Issues	Relates to financial compensation, enforcement of payment obligations, and monetary losses suffered by either party.

Table 6. Expert evaluation checklist used for resume scoring.

Arbitration Experience		Professional Experience
Criteria	Yes/No	Criteria	Yes/No
Has handled construction arbitration cases		Direct industry experience (e.g., engineer, contractor, project manager, construction lawyer)
Experience with delays, acceleration, or schedule-related claims		Experience with large-scale or high-value projects
Experience with contract interpretation, change orders, site conditions, or payment		Involvement in cross-border or international construction projects
Familiarity with your contract type (e.g., FIDIC, EPC, NEC, etc.)		Familiarity with subcontracting, coordination, and supply chain disputes
Experience with relevant legal system or seat of arbitration (e.g., common law/civil law)		Hands-on understanding of technical documents and specifications
Experience in multiparty or complex infrastructure disputes		Has held senior or decision-making roles in relevant organizations
Recent arbitral appointments more than 5 years
Panel memberships (e.g., ICC, LCIA, DRBF)
Bar admission in a relevant jurisdiction
Education		Additional Qualifications and Contributions
Criteria	Yes/No	Criteria	Yes/No
Degree in law or related to construction (e.g., civil engineering, architecture)		Fluent in the language of arbitration
Advanced/specialized education in arbitration or construction law		Publications or presentations on construction arbitration topics
Cross-disciplinary education (both legal and technical background)		Recognized authority or contributor in construction law/arbitration forums
Education from reputable or globally recognized institutions		Positive peer/client recommendations or known reputation for impartiality and diligence

Table 7. Classification performance metrics of the first (baseline) LLM model.

Classification	Precision	Recall	F1 Score	Support
Class A	0.95	0.48	0.64	346
Class B	0.31	0.59	0.41	128
Class C	0.09	0.42	0.15	12
Class D	0.61	1	0.76	14
Accuracy			0.53	500

Table 8. Classification performance metrics of the second (refined prompt) LLM model.

Classification	Precision	Recall	F1 Score	Support
Class A	0.91	0.77	0.83	209
Class B	0.78	0.92	0.84	209
Class C	0.86	0.89	0.87	54
Class D	0.96	0.79	0.86	28
Accuracy			0.84	500

Table 9. Agreement analysis between expert scores and model-generated scores.

Metric	First-Model vs. Manual Scoring	Second-Model vs. Manual Scoring
ICC Average Measures	0.724	0.941
Accuracy	0.53	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mobadersani, M.; Candas, A.B.; Kuruoğlu, M.; Tokdemir, O.B. AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework. Buildings 2026, 16, 120. https://doi.org/10.3390/buildings16010120

AMA Style

Mobadersani M, Candas AB, Kuruoğlu M, Tokdemir OB. AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework. Buildings. 2026; 16(1):120. https://doi.org/10.3390/buildings16010120

Chicago/Turabian Style

Mobadersani, Mohammad, Ali Bedii Candas, Murat Kuruoğlu, and Onur Behzat Tokdemir. 2026. "AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework" Buildings 16, no. 1: 120. https://doi.org/10.3390/buildings16010120

APA Style

Mobadersani, M., Candas, A. B., Kuruoğlu, M., & Tokdemir, O. B. (2026). AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework. Buildings, 16(1), 120. https://doi.org/10.3390/buildings16010120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Assisted Arbitrator Selection in Construction Disputes: An Expert-Calibrated Large Language Model Framework

Abstract

1. Introduction

2. Methods

2.1. Data Acquisition and Preprocessing

2.2. Resume Extraction and Structuring Workflow

2.3. Dataset Standardization and Anonymization

2.4. Expert-Based Weighting Using AHP

2.5. Real-Case Application

2.6. Expert Resume Scoring Procedure

2.7. Expert and GPT-Based Resume Scoring

2.8. Model Evaluation and Validation

2.9. Methodological Limitations and Risk Control

3. Results

3.1. Inter-Rater Reliability of Expert Scoring

3.2. Performance of the First (Baseline) Model

3.3. Performance of the Second (Refined Prompt) Model

3.4. Agreement Between Expert and Model Scores

3.5. Score Distribution and Pattern Analysis

4. Discussion

4.1. Interpretation of Model Performance

4.2. Influence of Prompt Design on Model Behavior

4.3. Key Practical Findings

4.4. Research Objective and Relation to Prior Studies

4.5. Practical Implications and Human Oversight

4.6. Ethical Considerations and Adoption Challenges

4.7. Limitations and Directions for Future Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI