Next Article in Journal
MultiDecNet: An Ensemble-Based Semantic Segmentation Architecture for Urban Scene Understanding
Previous Article in Journal
An ‘Enlightenment Phase’: Police Perspectives on the Contemporary Challenges of Digital Evidence and Digital Forensic Investigations
Previous Article in Special Issue
Applying Integrated Delphi–AHP to Maintenance Competency Prioritization in Industry 4.0: A Formally Specified Group Decision Framework with Consistency and Sensitivity Diagnostics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Criteria Decision Framework for Enterprise LLM Routing

Faculty of Engineering Management, Poznan University of Technology, Jacka Rychlewskiego 1, 61-131 Poznań, Poland
Information 2026, 17(6), 539; https://doi.org/10.3390/info17060539
Submission received: 7 May 2026 / Revised: 24 May 2026 / Accepted: 26 May 2026 / Published: 1 June 2026
(This article belongs to the Special Issue New Applications in Multiple Criteria Decision Analysis, 3rd Edition)

Abstract

The increasing use of large language models (LLMs) in enterprises creates a need for routing mechanisms that select models according to both technical performance and organizational preferences. This article proposes a multicriteria decision-support framework for enterprise LLM routing that combines AHP-based criterion weighting with SAW-based prompt-level model selection. The framework evaluates prompts according to criteria related to required accuracy, business risk, reasoning depth, cost sensitivity, response-time sensitivity, standardization, and creativity. The empirical evaluation was conducted on 500 heterogeneous business prompts, using GPT-5-nano as the prompt-scoring router, GPT-4o-mini as the cheaper response model, and GPT-5 as the stronger response model. Costs were calculated from actual input and output token counts, including routing overhead. Response sufficiency was assessed using a structured LLM-as-a-judge protocol with three evaluator profiles. The proposed SAW routing variant with confidence margin and risk veto achieved a sufficiency rate of 94.4%, compared with 94.6% for the always-strong strategy and 86.8% for the always-cheap strategy. Relative to always-strong routing, it reduced total cost by 37.4%, with only a 0.2 percentage-point decrease in sufficiency. The framework was also compared with keyword-risk, token-threshold, TF-IDF centroid, logistic-regression, multiplicative-SAW, and TOPSIS baselines. The results indicate that an interpretable multicriteria router can achieve near-strong-model response sufficiency at substantially lower cost while preserving auditability and alignment with enterprise decision criteria.

1. Introduction

The dynamic development of large language models (LLMs) is transforming the way organizations carry out knowledge-based work, communication, information analysis, and decision-support processes. In enterprise environments, these models are increasingly used not only for content generation but also for document summarization, the preparation of managerial reports, customer service support, the drafting of business communication, the analysis of descriptive data, and the performance of tasks requiring synthesis and reasoning. Consequently, the practical problem is no longer limited to the question of whether organizations should use LLMs, but rather concerns how they can use them effectively, responsibly, and economically across different types of business tasks. In particular, enterprises face the need to determine when a less expensive and faster model is sufficient, and when the use of a more advanced, more costly model is justified because it may offer higher-quality responses.
The importance of this problem is growing alongside the expansion of the range of available models and the increasing differentiation of their parameters in terms of cost, response time, reasoning capability, and the quality of generated outputs. In business practice, users rarely operate under conditions in which only a single model is available. Much more often, they have access to a portfolio of models among which choices must be made depending on the nature of the query. This creates a need for prompt-routing mechanisms, understood as rules for directing queries to different models in a manner consistent with organizational priorities. These priorities are not exclusively technical in nature. They encompass trade-offs among response quality, processing cost, speed of operation, the level of error risk, the need for standardization, and the business consequences of an inappropriate response.
Existing approaches to the problem of LLM model selection and routing, however, remain dominated by a technical perspective. The primary focus is most often placed on quality benchmarks, computational efficiency, latency reduction, inference cost optimization, or improvements in system parameters. Although these aspects are important, they do not fully reflect the realities of managing the use of artificial intelligence in organizations. Enterprises do not assess model outputs solely through the prism of average technical quality, but also from the standpoint of their operational usefulness, acceptable level of risk, compliance with organizational requirements, and the economic justification of deployment. From a managerial perspective, the key question, therefore, becomes not so much which model is objectively the best, but rather which model is more appropriate for a specific task under a given profile of organizational preferences.
At this point, a research gap becomes apparent. Existing LLM routing research has developed several technically advanced approaches, including cascading, confidence-based routing, embedding-based similarity matching, learned routing policies, and cost-aware model selection. These methods are valuable for improving the cost–quality trade-off, but they usually treat routing as a predominantly technical optimization problem. Less attention has been paid to routing as an organizational decision-support problem in which the choice of model must be explainable, auditable, and aligned with managerial preferences concerning risk tolerance, response sufficiency, cost constraints, standardization requirements, and operational speed. In enterprise settings, the relevant question is therefore not only whether a router can maximize benchmark performance, but also whether its decisions can be justified in terms that are understandable to organizational stakeholders. This article addresses this gap by proposing an interpretable multicriteria routing framework that translates managerial criteria into computable prompt-level scores and model-selection rules. Under organizational conditions, the fundamental issue is not whether a model generates the best possible response in absolute terms, but whether the response is sufficient for practical application without incurring unjustified costs.
The novelty of this study should therefore be understood in relation to two adjacent but distinct streams of work. The first stream develops technical LLM routers, cascades, learned routing policies, and benchmark-oriented cost–quality optimization methods. These approaches are important because they show that not all prompts require the most capable model and that adaptive routing can reduce inference cost. However, they typically define the routing problem primarily in terms of predictive performance, expected quality, model confidence, or cost. The second stream concerns enterprise AI governance and managerial decision support, where decisions must be explainable, auditable, and aligned with organizational risk tolerance. The present study contributes to the intersection of these streams. It does not claim to outperform learned semantic routers in all benchmark settings. Instead, it proposes an interpretable multicriteria governance layer in which model-selection decisions are explicitly linked to organizational criteria such as business risk, required accuracy, reasoning depth, cost sensitivity, time sensitivity, standardization, and creativity.
The aim of this article is to propose and empirically evaluate a multicriteria decision-support framework for enterprise LLM routing. The framework combines AHP-based elicitation of organizational criterion weights with SAW-based prompt-level model selection. Its purpose is not to replace machine-learning-based semantic routers, but to provide an interpretable and auditable routing layer for enterprise contexts in which model choice must reflect explicit organizational preferences. The empirical evaluation examines whether such a router can approximate the sufficiency level of a stronger model while reducing token-level costs relative to an always-strong strategy and improving response sufficiency relative to an always-cheap strategy.
The contribution of the article is threefold. First, it reframes enterprise LLM routing as a multicriteria organizational decision problem rather than only a technical optimization problem. In this framing, routing is driven not only by expected response quality and cost, but also by explicit managerial preferences concerning risk, response-time sensitivity, standardization, and creativity. Second, it operationalizes these preferences as prompt-level routing criteria and combines expert-derived AHP weights with SAW-based model selection, including a confidence margin and a risk-veto mechanism that reduces the compensatory weakness of simple additive aggregation. Third, it provides an empirical proof-of-concept evaluation on a stratified business-prompt dataset and compares the proposed routing rule with fixed strategies and several heuristic or alternative multicriteria baselines. The contribution is therefore not a universal claim that AHP/SAW is superior to learned routers, but evidence that an auditable multicriteria layer can support cost-aware and managerially interpretable model allocation in enterprise settings.
The empirical part of the study is guided by three research questions. RQ1: Can an interpretable multicriteria router achieve response sufficiency close to the always-strong strategy while reducing token-level costs? RQ2: Does the proposed routing strategy outperform simple heuristic and alternative MCDM baselines in terms of the cost–sufficiency trade-off? RQ3: Are the routing decisions stable under changes in criterion weights, confidence-margin settings, and alternative aggregation methods?
The remainder of the article is structured as follows. The next section presents a review of the literature on the use of LLMs in organizations, model routing, and multicriteria decision support. This is followed by a presentation of the research methodology and the construction of the proposed decision model. The subsequent section discusses the results of the empirical study, while the conclusion offers a discussion of the findings, managerial implications, study limitations, and directions for future research.

2. Literature Review

Large language models (LLMs) are currently conceptualized in the literature as the next phase in the development of AI systems capable of performing a wide range of linguistic, analytical, and creative tasks, with their significance extending beyond classical NLP applications and increasingly encompassing organizational and managerial contexts [1,2]. Studies on the business value of AI emphasize that the impact of these technologies on firm performance does not arise solely from their technical parameters, but from their capacity to reconfigure processes, support decision-making, and enhance organizational efficiency [3,4,5]. More recent literature reviews also stress that generative AI is becoming an important component of business model innovation, value creation, and organizational transformation, while at the same time giving rise to challenges related to implementation, governance, and the assessment of usefulness in business practice [6,7,8].
The significance of this issue is reinforced by empirical studies showing that generative AI can improve the productivity of knowledge workers, although these effects vary markedly across tasks and user groups. Brynjolfsson, Li, and Raymond [9] demonstrated that the use of generative AI in customer service increases productivity by approximately 15% on average, whereas Noy and Zhang [10] identified significant gains in both productivity and quality in writing tasks. In turn, Dell’Acqua et al. [11] describe the phenomenon of the “jagged technological frontier,” indicating that AI may improve performance in some tasks while worsening it in others, even when those tasks appear superficially similar. It is precisely this unevenness of effects that is particularly important from the enterprise perspective: it suggests that not every prompt and not every task should be handled by the same model or by the same AI usage strategy [9].
In parallel, a stream of research has been developing that emphasizes the importance of governance, accountability, and the organizational embeddedness of AI systems. Papagiannidis, Mikalef, and Conboy [12] argue that responsible AI governance requires not only a set of general principles, but also procedural, relational, and structural practices that make it possible to operationalize oversight over the technology. Schneider, Abraham, Meske, and vom Brocke [13], in turn, conceptualize AI governance for businesses as a problem of governing data, models, and AI systems through explicit decisions about who governs what and how. This perspective is directly relevant to enterprise LLM routing because model-selection decisions also require clear responsibility, auditability, and governance mechanisms.
Vidgof, Bachhofner, and Mendling [14] were among the first to systematically describe the opportunities and challenges of using LLMs in BPM, indicating that the potential of these models spans many stages of the process lifecycle, while simultaneously requiring new rules of application. Subsequent studies have presented more specific applications: Bernardi et al. [15] proposed the BPLLM framework for process-aware decision support; Kourani et al. [16] developed an approach for generating process models from textual descriptions; Apaydin and Zisgen [17] investigated the use of local language models for process modeling; and Kourani et al. [18] introduced a benchmark and self-improvement analysis of models in process-modeling tasks. Kampik et al. [19], in turn, formulated a broader vision of “large process models,” in which LLMs are intended to support contextual modeling and the improvement of business processes. The common denominator of these studies is clear: the effectiveness of LLMs in enterprise applications is task-dependent, and model selection should be linked to the type of task, the required level of reliability, and the organizational context of use.
The literature closest to the problem addressed in this article concerns the routing and cascading of language models. Yue et al. [20] developed an ICLR-published LLM cascade for cost-efficient reasoning, using answer consistency from weaker models as a signal for deciding whether escalation to a stronger model is necessary. Chen, Zaharia, and Zou [21], in their TMLR-published FrugalGPT study, showed that cascaded or adaptive compositions of LLM APIs can improve the cost–performance trade-off. Šakota, Peyrard, and West [22] proposed FORC, a meta-model-based approach to cost-effective language model choice across multiple tasks. Ong et al. [23] introduced RouteLLM, an ICLR-published framework that learns to route between weaker and stronger LLMs using preference data. Song et al. [24] proposed IRT-Router, an ACL-published multi-LLM routing approach that models LLM abilities and query difficulty using Item Response Theory. Shah and Shridhar [25] proposed Select-then-Route, a taxonomy-guided approach that first narrows the model pool and then applies adaptive routing or cascading within that pool. Taken together, these studies demonstrate that LLM routing is an active and increasingly mature research area focused on cost, performance, confidence, interpretability, and latency trade-offs.
This literature is highly valuable, but it also reveals important limitations from the perspective of enterprise governance. Most routing approaches optimize primarily for technical or economic targets, such as benchmark accuracy, expected reward, inference cost, latency, model ability, or query difficulty. Organizationally meaningful criteria, including business risk, standardization requirements, auditability, and managerial acceptability, are usually not modeled explicitly. Moreover, learned, embedding-based, or preference-based routers may be effective, but their internal logic can be difficult for non-technical stakeholders to interpret and audit. These limitations do not invalidate technical routers. Rather, they indicate that enterprise environments may require an additional decision-support layer that translates organizational preferences into transparent routing rules. The present study, therefore, does not claim that linear multicriteria aggregation is theoretically superior to learned or semantic routers, but investigates whether an explicitly parameterized multicriteria layer can provide a transparent and empirically competitive routing rule when managerial preferences must be visible in the decision process.
This is the point at which multicriteria decision-support methods become relevant. Their role in this study is not to replace semantic similarity, preference learning, or confidence estimation. Rather, they provide a formal mechanism for making organizational priorities explicit. In enterprise AI governance, this explicitness is important because routing decisions may need to be justified after the fact: why a prompt was escalated to a stronger model, why a routine task was assigned to a cheaper model, or why business risk overrode cost sensitivity. AHP and SAW are therefore used here because of their transparency, auditability, and compatibility with managerial decision-making, not because they are assumed to be universally more accurate than learned routing policies.
At this point, multicriteria decision-support methods become particularly useful [26]. The classic works of Saaty [27,28] and Hwang and Yoon [29] laid the foundations for the AHP and SAW methods, which make it possible to translate decision-makers’ preferences into a formal rule for the evaluation and selection of alternatives. Later studies confirm that SAW remains a transparent, interpretable, and convenient method in situations requiring the aggregation of multiple criteria, whereas AHP is a useful tool for determining criterion weights on the basis of expert judgments [30,31]. Importantly, MCDM methods have already been applied to the selection of AI-based systems, including chatbots for customer service, demonstrating that decisions concerning the choice of AI tools can be effectively formalized as multicriteria problems [32]. Related management research also shows that formal decision models can integrate measurable and qualitative factors through expert knowledge and explicit weighting procedures under uncertainty [33]. However, the available literature still lacks convincing attempts to apply AHP and SAW to prompt routing between LLMs in enterprise environments while taking into account criteria of managerial significance rather than exclusively technical ones.
The use of AHP and SAW in the present study is therefore motivated by transparency and organizational fit rather than by an assumption of universal predictive superiority. AHP provides a structured procedure for eliciting and documenting stakeholder preferences, while SAW offers a simple and auditable aggregation rule that can be inspected by non-technical decision-makers. At the same time, the linear and additive nature of SAW imposes important assumptions, including preferential independence among criteria and the absence of strong threshold or veto effects. These assumptions are particularly relevant in LLM routing because criteria such as business risk, response-time sensitivity, standardization, and creativity may interact. For this reason, the empirical part of the study treats SAW as a baseline multicriteria routing mechanism and evaluates its robustness through sensitivity analysis, alternative aggregation methods, a confidence-margin rule, and a risk-veto extension.
In summary, the literature is currently developing along three related but still weakly integrated streams: research on the use of LLMs and GenAI in organizations, research on model routing and cascading, and research on governance and multicriteria decision support. The first stream demonstrates the growing importance of LLMs for productivity, knowledge creation, and business transformation; the second provides techniques for improving the cost–quality trade-off; and the third offers tools for formalizing organizational preferences. However, what remains insufficiently developed is an approach that would integrate these three perspectives and conceptualize LLM routing as a managerial problem, in which the decision regarding model selection depends on cost, quality, risk, response time, and the requirement for standardization, while the effectiveness of the solution is assessed through the lens of the business adequacy of the response. It is precisely this gap that the present article seeks to address.

3. Materials and Methods

The study follows a design-and-evaluation approach. Its objective is to develop and empirically evaluate an interpretable routing mechanism that supports the selection of an LLM in an enterprise environment under explicit organizational preferences. The proposed mechanism assumes that the choice between a cheaper and a stronger model should depend not only on expected response quality and cost, but also on the business risk of an error, required reasoning depth, response-time sensitivity, standardization requirements, and the need for creativity.
The empirical procedure consisted of seven stages: identification of managerial routing criteria, elicitation of criterion weights using AHP, construction of a structured business-prompt dataset, prompt-level scoring by a lightweight LLM-based router, SAW-based routing with confidence-margin and risk-veto extensions, generation of responses by the reference models, and evaluation of routing strategies using response sufficiency, token-level cost, latency, robustness, and statistical-comparison metrics.
From a managerial perspective, the study does not ask which model is objectively best in all circumstances. Instead, it asks whether an auditable routing rule can allocate prompts between model tiers in a way that approximates the sufficiency level of the stronger model while reducing token-level cost and preserving alignment with organizational decision criteria.
Stage 1. Identification of managerial decision criteria for LLM model selection
For prompt evaluation, the following set of decision criteria was used:
  • C1—required substantive accuracy (describes how high the correctness and precision of the response must be for the outcome to be business-useful; the greater the required accuracy, the greater the need to use a more advanced model);
  • C2—risk of the business consequences of error (describes the potential effects of generating an incomplete, misleading, or incorrect response; an error in a draft marketing post has different significance than an error in a compliance analysis, HR policy, communication with a strategic client, or the interpretation of a document);
  • C3—required depth of reasoning (describes whether the task requires simple information processing or multi-step reasoning, synthesis, and logical analysis; the greater the depth of reasoning required, the greater the likelihood that the organization will prefer a more powerful model);
  • C4—sensitivity to processing cost (describes how important cost savings are from the company’s perspective for a given type of query; not every task requires maximizing quality at any cost. In many large-scale processes, unit cost is the priority);
  • C5—task sensitivity to response time (describes how important it is to obtain the result quickly; in operational, contact-intensive, or high-volume tasks, speed may be just as important as quality);
  • C6—required standardization and compliance of the response (describes whether the response must strictly conform to the adopted style, structure, company policy, or communication standard; in organizations, a large part of AI’s value derives not from creativity, but from repeatability, consistency, and the scalability of communication);
  • C7—required creativity/openness of generation (describes the extent to which the task requires a creative, non-standard, or exploratory approach).
The indicated set of criteria combined four managerial logics: cost efficiency, risk control, quality of the decision-making process, and operational effectiveness.
Stage 2. Determination of criterion weights by organizational experts
Criterion weights were elicited from a three-person expert panel representing complementary organizational perspectives: senior management, operational management, and AI implementation. The panel included the Chief Executive Officer, an operations manager, and an AI implementation manager from the studied enterprise. The experts were selected because the routing decision involves both strategic cost–risk trade-offs and operational considerations related to response usefulness, time sensitivity, standardization, and AI adoption (Table 1). Before completing the pairwise comparisons, the experts received a short instruction on the Saaty nine-point scale and on the interpretation of each routing criterion. The purpose of this step was to reduce ambiguity in the comparison task and to ensure that criteria such as business risk, cost sensitivity, and standardization were interpreted consistently.
To translate organizational preferences into a formal decision model, the Analytic Hierarchy Process (AHP) method was applied. This method makes it possible to determine criterion weights on the basis of pairwise comparisons made by experts. In the first step, each expert compares every criterion with every other criterion by answering the question of which of them is more important from the organization’s perspective and to what extent. The comparisons were conducted using the classical nine-point Saaty scale, in which 1 denotes equal importance of both criteria, 3 a moderate preference of one criterion over the other, 5 a strong preference, 7 a very strong preference, and 9 an extreme preference; the values 2, 4, 6, and 8 represent intermediate judgments. For the k -th expert, a pairwise comparison matrix is constructed:
A ( k ) = a i j ( k ) n   ×   n
where a i j ( k ) denotes the relative importance of criterion C i with respect to criterion C j , subject to the following conditions:
a i i ( k ) = 1 ,                 a i j ( k ) > 0 ,                 a j i ( k ) = 1 a i j ( k ) .
Because the assessments were provided by several experts, it was necessary to aggregate them into a single group matrix. For this purpose, the geometric mean of the experts’ judgments was applied, which is the standard solution in the group version of AHP. For each matrix element, the following was adopted:
a i j = ( k = 1 m a i j ( k ) ) 1 m
where m denotes the number of experts. In the present study, m = 3 was adopted. As a result, a group comparison matrix was obtained:
A = [ a i j ] n × n
Based on matrix A , the vector of criterion weights was determined. In computational practice, the method of the normalized geometric mean of rows was applied. First, for each criterion, the geometric mean of the judgments in the row was calculated:
g i = ( j = 1 n a i j ) 1 n
and the obtained values were then normalized, yielding the weight of each criterion:
w i = g i i = 1 n g i , i = 1,2 , , n
The weight vector may therefore be written as:
w = ( w 1 , w 2 , , w n )
subject to the normalization condition:
i = 1 n w i = 1
The obtained weights reflect the relative importance of the individual criteria from the organization’s perspective. The higher the value of w i , the greater the influence of the given criterion on the subsequent decision regarding the selection of the LLM. An important element of the AHP method is also the assessment of the consistency of expert judgments. For this purpose, the maximum eigenvalue of the comparison matrix was calculated:
λ m a x = 1 n i = 1 n ( A w ) i w i
On this basis, the consistency index was determined:
C I = λ m a x n n 1
followed by the consistency ratio:
C R = C I R I
where RI denotes the Random Index, which depends on the number of criteria analyzed. For n = 7 , the value R I = 1.32 is typically adopted. In the literature, pairwise comparisons are considered sufficiently consistent when C R < 0.10 . If the value of C R exceeds the threshold of 0.10, this indicates that the expert judgments are characterized by excessive inconsistency and should be re-examined. In addition to the consistency assessment, the stability of the weight vector was examined through a perturbation-based sensitivity analysis. The analysis tested whether moderate changes in the AHP-derived criterion weights altered the routing decisions or materially changed the cost–sufficiency trade-off. This step was introduced because AHP weights may be sensitive to expert judgments and because different organizational stakeholders may legitimately assign different priorities to cost, risk, quality, and response-time criteria. The final empirical evaluation therefore reports not only the aggregated weight vector, but also the robustness of routing results under weight perturbations. In the present study, the resulting weight vector was subsequently used in the next stage of the procedure, namely in the SAW method employed for routing prompts between the less expensive model and the more capable model.
Stage 3. Development of the research prompt dataset
The empirical dataset consisted of 500 business prompts designed to represent heterogeneous enterprise use cases. The prompts were structured across four descriptive dimensions: business function, task type, risk level, and industry context. This design was adopted to reduce the risk that the evaluation would reflect only a narrow class of short, low-risk, or highly standardized queries. The dataset covered ten business functions: Legal/Compliance, IT/Security, Finance, HR, Sales, Marketing, Operations, Procurement, Strategy, and Customer Support. It also covered ten task types: data interpretation, creative generation, decision support, business email drafting, risk assessment, classification, process improvement, summarization, report synthesis, and policy drafting. The prompts were additionally assigned to three risk levels: low, medium, and high. The prompt set was intentionally balanced to include both routine tasks likely to be sufficient for a cheaper model and more complex or risk-sensitive tasks likely to require escalation to a stronger model. This structure made it possible to evaluate not only the aggregate performance of routing strategies, but also their behavior across different business functions, task categories, and risk levels (Table 2).
The prompt dataset was constructed using a stratified template-based procedure. Each prompt was generated from a structured specification containing four metadata fields: business function, task type, risk level, and industry context. The purpose of this procedure was to avoid an evaluation dominated by a single class of prompts, such as short, low-risk summaries or routine emails. The prompt specifications were distributed across business functions, task categories, and risk levels so that the router would be tested on both routine and escalation-worthy cases. The dataset did not contain confidential company data, personal data, or real customer records. Instead, it used synthetic but business-plausible scenarios representing typical enterprise tasks such as summarization, policy drafting, decision support, risk assessment, data interpretation, customer communication, and process improvement.
The construction process followed four steps. First, the set of business functions, task types, and risk levels was defined. Second, synthetic business scenarios and output constraints were assigned to these categories. Third, prompts were generated so that each prompt was self-contained and could be answered without access to external organizational documents. Fourth, the resulting dataset was checked for basic completeness: each prompt had to specify a task, a business context, and an expected output form. This procedure improves reproducibility because the dataset can be reconstructed from explicit metadata and prompt-construction rules, while also limiting the risk that results depend on idiosyncratic real-company documents.
Stage 4. Prompt-level scoring by the routing model
Each prompt was evaluated before response generation by a lightweight LLM-based routing model. In the empirical implementation, GPT-5-nano was used as the prompt-scoring router. This model was separate from the cheaper response-generation model, GPT-4o-mini, and from the stronger response-generation model, GPT-5. This separation was introduced to reduce the circularity risk that would arise if the cheaper response model were also responsible for estimating whether its own capabilities were sufficient.
The router did not evaluate the generated answers. Instead, it evaluated the input prompt before model selection and returned a structured JSON object containing scores for seven criteria: required accuracy, business risk, reasoning depth, cost sensitivity, response-time sensitivity, standardization, and creativity. Each criterion was scored on a five-point ordinal scale, where higher values indicated a stronger presence of the corresponding property. These scores constituted a computable prompt-level vector used by the routing rule. The following scale was used:
C1—Accuracy
To what extent does the required accuracy of the response exceed the typical capabilities of the less expensive model?
1—The less expensive model will probably be fully sufficient.
2—The less expensive model should be sufficient with only minor risk.
3—Borderline task, with no clear advantage.
4—Higher accuracy clearly favors the more expensive model.
5—The required accuracy definitely justifies the more expensive model.
C2—Business_Risk
To what extent does the potential error justify escalation to the more expensive model?
1—A possible error has little business significance.
2—An error would be undesirable, but not critical.
3—An error would have moderate significance.
4—An error could have serious consequences.
5—An error definitely justifies cautious escalation.
C3—Reasoning_Depth
To what extent does the task require reasoning beyond the typical capabilities of the less expensive model?
1—A simple, routine, or template-based task.
2—Minor analysis or organization of information.
3—Moderate reasoning.
4—Complex, multi-step reasoning.
5—Deep reasoning definitely justifies the more expensive model.
C4—Cost_Sensitivity
How important is it to complete the task at the lowest possible cost?
1—Cost has little significance.
2—Cost has minor significance.
3—Cost has moderate significance.
4—Cost is important and favors the less expensive model.
5—Cost is very important and strongly favors the less expensive model.
C5—Time_Sensitivity
How important is it to obtain a rapid response?
1—Response time has little significance.
2—Time has minor significance.
3—Time has moderate significance.
4—A rapid response is important.
5—Response speed strongly favors the less expensive model.
C6—Standardization
To what extent is the task template-based, predictable, and grounded in a standard response structure?
1—The response requires a non-standard approach.
2—The response is rather non-standard.
3—The response is partially standardized.
4—The response is highly template-based.
5—The response is strongly standardized and predictable.
C7—Creativity
To what extent does the task require a creative, conceptual, or non-standard approach?
1—Creativity is not needed.
2—A small degree of creativity may help.
3—Moderate creativity.
4—High creativity is useful.
5—The task clearly requires a more creative model.
In this way, for each prompt, a score vector is obtained: P i = ( p i 1 , p i 2 , , p i 7 ) .
Because prompt scoring is itself an inference step, the router’s computational overhead was explicitly recorded. For each prompt-scoring call, the experiment stored input tokens, output tokens, router cost, and router latency. These values were later included in the cost and latency evaluation of routing strategies. As a result, the reported cost advantage of the routing framework is not based on a zero-cost router assumption, but on the actual token-level cost of both prompt scoring and response generation. The full prompt-scoring router protocol is provided in Appendix A.
Stage 5. Routing of prompts to one of the analyzed LLMs using the SAW method
To assign each prompt to either the less expensive model or the more capable model, the Simple Additive Weighting (SAW) method was applied. This method consists of calculating the total weighted score of each decision alternative and then selecting the alternative with the highest final value. In the present study, the alternatives are two LLMs: the less expensive model and the more capable model. For each prompt P i , the score vector obtained in Stage 4 is known:
p i = ( p i 1 , p i 2 , , p i 7 )
as well as the criterion weight vector determined using the AHP method in Stage 2:
w = w 1 , w 2 , , w 7 ,                 j = 1 7 w j = 1 .
It was assumed that high values of criteria C1, C2, C3, and C7 favor the selection of the more capable model, whereas high values of criteria C4, C5, and C6 favor the selection of the less expensive model. To maintain a uniform evaluation logic, two synthetic scores were calculated for each prompt: one for the less expensive model and one for the more capable model. The score for the more capable model was determined as:
S i ( s t r o n g ) = w 1 p i 1 + w 2 p i 2 + w 3 p i 3 + w 7 p i 7 + w 4 ( 6 p i 4 ) + w 5 ( 6 p i 5 ) + w 6 ( 6 p i 6 ) ,
whereas the score for the less expensive model was determined as:
S i ( c h e a p ) = w 4 p i 4 + w 5 p i 5 + w 6 p i 6 + w 1 ( 6 p i 1 ) + w 2 ( 6 p i 2 ) + w 3 ( 6 p i 3 ) + w 7 ( 6 p i 7 ) .
Such a construction means that criteria favoring a given model increase its score, whereas criteria supporting the alternative model are reversed according to the transformation ( 6 p i j ) . As a result, both alternatives can be evaluated within the same aggregation logic. The final decision rule is as follows:
d i = { cheap , if     S i ( c h e a p ) S i ( s t r o n g ) , strong , if     S i ( s t r o n g ) > S i ( c h e a p ) .
Additionally, the difference between the scores of the two models was calculated as:
Δ i = S i ( s t r o n g ) S i ( c h e a p )
where positive values indicate that the prompt is more strongly associated with the stronger model, whereas negative values indicate that the cheaper model is favored. In the basic SAW variant, the prompt would be routed to the stronger model when Δ i > 0 and to the cheaper model otherwise. However, because near-zero score gaps may indicate ambiguous routing cases, the final routing variant introduced a confidence margin ε . In the empirical implementation, ε = 0.10 . If the advantage of the stronger model did not exceed this margin, the prompt was routed to the cheaper model. This rule reflects a conservative cost-aware assumption: escalation is justified only when the stronger model has a sufficiently clear advantage.
The following margin-based decision rule was applied:
r o u t e i = { strong , i f     Δ i > ε , cheap , i f     Δ i ε .
In addition, a risk-veto extension was introduced for prompts identified before routing as high-risk or risk-sensitive cases. The risk-veto flag was based on the prompt’s ex ante risk annotation and the router’s business-risk assessment. If a prompt was marked as a risk-veto candidate, the routing mechanism escalated it to the stronger model unless the case was already clearly assigned to the stronger model by the SAW score. This extension addresses the limitation of purely additive aggregation, in which high business risk could otherwise be compensated by cost or time sensitivity. The final routing strategy evaluated in the study is therefore referred to as SAW with confidence margin and risk veto.
Stage 6. Response generation and token-level cost measurement
After prompt scoring and routing-score calculation, responses were generated for each prompt using two reference response models: GPT-4o-mini as the cheaper model and GPT-5 as the stronger model. The prompt-scoring router was GPT-5-nano and was used only before response generation. Thus, the routing model, the cheaper response model, and the stronger response model were treated as three separate components of the pipeline. Table 3 summarizes the model configuration used in the empirical evaluation, including the role of each model in the experimental pipeline, the main inference settings, and the cost basis used for token-level cost calculation.
Cost values correspond to the token-pricing assumptions used in the experimental script. Routing-strategy costs include the prompt-scoring router cost and selected response-model generation cost. The sufficiency evaluator was used only for offline evaluation and was not included in the routing-strategy cost.
For each of the 500 prompts, responses were generated by both reference models. This made it possible to construct and compare several routing strategies on the same prompt set: always-cheap, always-strong, SAW without margin, SAW with confidence margin and risk veto, and additional heuristic or alternative multicriteria baselines.
Costs were calculated at the token level rather than using fixed average per-prompt assumptions. For each response-generation call, the experiment recorded input tokens, output tokens, response cost, and response latency. For routing-based strategies, the total cost of handling a prompt included both the router cost and the cost of the selected response model:
C o s t i r o u t i n g = C o s t i r o u t e r + C o s t i s e l e c t e d   r e s p o n s e
For fixed strategies, the total cost consisted only of the response-generation cost of the selected model:
C o s t i f i x e d = C o s t i r e s p o n s e
This design ensured that the cost advantage of routing was evaluated under a conservative assumption in which the router was not treated as free. Latency was measured analogously: for routing strategies, the reported latency included the router latency and the latency of the selected response model, whereas for fixed strategies, it included only response-generation latency.
Stage 7. Evaluation of routing strategies
The final stage evaluated whether the proposed routing mechanism provided a favorable cost–sufficiency trade-off relative to fixed model-selection strategies and alternative routing baselines. The evaluation was conducted at the prompt level using the same set of 500 business prompts. For each prompt, the analysis compared the response that would be obtained under each routing strategy. Response sufficiency was evaluated using a structured LLM-as-a-judge protocol with three independent evaluator profiles. The evaluators assessed whether a response was acceptable for use in organizational conditions without significant substantive revision. To reduce model-label bias, responses were evaluated using anonymized model labels rather than the actual model names. For each response, the evaluator profiles returned structured quality scores referring to instruction following, completeness, specificity, business usability, risk and safety, and style or format adequacy. They also indicated whether the response contained a major issue requiring substantive revision. Individual binary sufficiency labels were then derived deterministically from these scores using the risk-adjusted thresholds reported in Appendix B. The final response-level sufficiency classification was determined by majority vote. A response was classified as sufficient if at least two of the three evaluator profiles classified it as sufficient. Agreement among the three automated evaluator profiles was calculated to assess the consistency of sufficiency judgments. Fleiss’ kappa was reported for all evaluated responses and separately for the cheaper and stronger response models.
The use of LLM-as-a-judge was adopted for scalability and consistency, but it is also a methodological limitation. The evaluation does not replace human expert assessment of organizational usefulness, contextual appropriateness, or politically sensitive business communication. To reduce evaluation arbitrariness, the judge used a structured sufficiency rubric and returned a constrained JSON output. The evaluator was instructed to assess whether the answer was acceptable for practical enterprise use, followed the prompt, was coherent, did not omit critical requested elements, and did not introduce obvious risky claims. The evaluator did not compare alternative model answers directly, but assessed each answer against the prompt and sufficiency criteria. Nevertheless, the absence of independent human validation means that the reported sufficiency rates should be interpreted as a structured automated evaluation rather than as final evidence of human-perceived business usefulness. This step was included because response sufficiency is a utility-oriented measure and may involve judgment under uncertainty. The main effectiveness metric was the sufficiency rate (SR):
S R s = N s s u f f i c i e n t N
where N s s u f f i c i e n t denotes the number of prompts for which strategy s produced a sufficient response, and N denotes the total number of prompts.
Cost was measured at the token level. For each strategy, the following cost indicators were calculated:
A C P s = C o s t s N
C S R s = C o s t s N s s u f f i c i e n t
where A C P s denotes the average cost per prompt and C S R s denotes the cost per sufficient response. For routing strategies, the total cost included both the router cost and the selected response-model cost. For fixed strategies, the total cost included only the cost of the selected response model.
To compare the economic efficiency of strategies that improved sufficiency relative to the always-cheap baseline, the incremental cost of sufficiency gain was calculated as:
I C S G s = C o s t s C o s t c h e a p 100 × ( S R s S R c h e a p )
This metric expresses the additional cost required to increase the sufficiency rate by one percentage point relative to the always-cheap strategy. In addition to cost and sufficiency metrics, latency was measured for each strategy. For fixed strategies, latency corresponded to response-generation latency. For routing strategies, latency included both router latency and selected response-model latency. Average latency and 95th percentile latency were reported. The proposed SAW routing strategy was evaluated against the following baselines: always-cheap, always-strong, keyword-risk routing, token-threshold routing, TF-IDF centroid routing, logistic-regression routing, SAW without confidence margin, multiplicative SAW, and TOPSIS. The baseline strategies were operationalized as follows. The keyword-risk baseline escalated prompts containing predefined risk-sensitive terms or belonging to high-risk categories to the stronger model. The token-threshold baseline routed prompts to the stronger model when the prompt length exceeded a predefined input-token threshold. The TF-IDF centroid baseline represented prompts using TF-IDF vectors and assigned them according to similarity to cheap- and strong-oriented centroids derived from routing labels. The logistic-regression baseline used prompt-level features to predict whether the stronger model should be selected. The logistic-regression baseline was evaluated using stratified cross-validation to avoid training and testing the classifier on the same prompt instances.
Alternative MCDM baselines included SAW without confidence margin, multiplicative SAW, and TOPSIS, using the same criterion scores and AHP-derived weights as the proposed method. These baselines were included to compare the proposed framework not only with fixed strategies, but also with simple heuristic, learned, and alternative multicriteria routing rules. This comparison was introduced to avoid evaluating the proposed method only against trivial fixed strategies. Statistical uncertainty was addressed in three ways. First, Wilson confidence intervals were reported for sufficiency rates. Second, confidence intervals were calculated for the average cost per prompt. Third, paired exact McNemar tests were used to compare sufficiency outcomes between the proposed routing strategy and alternative strategies at the prompt level. Finally, several robustness checks were conducted. The analysis included confidence-margin sensitivity, perturbation-based weight sensitivity, alternative aggregation methods, criterion ablation, and stratified performance analysis by business function, task type, and risk level. These checks were used to assess whether the results depended on a narrow configuration of weights, margin settings, or prompt categories.

4. Results

The empirical evaluation was conducted on a dataset of 500 business prompts representing different business functions, task types, and risk levels. The prompt-scoring router was implemented using GPT-5-nano, whereas GPT-4o-mini and GPT-5 were used as the cheaper and stronger response-generation models, respectively. The results are organized as follows. First, the AHP-derived criterion weights and prompt-level routing patterns are presented. Second, the reliability of the automated response-sufficiency assessment is reported. Third, the aggregate performance of fixed, heuristic, MCDM-based, and learned routing strategies is compared. Fourth, computational overhead, latency, scalability, and deployment complexity are discussed. Fifth, paired statistical tests are used to assess whether differences in sufficiency outcomes are significant at the prompt level. Sixth, the sensitivity of the results to the confidence-margin parameter and AHP weight perturbations is examined. Seventh, failure modes are analyzed. Eighth, ablation and alternative aggregation analyses are presented. Finally, the results are stratified by business function, task type, and risk level.

4.1. Criterion Weights and Prompt-Level Routing Patterns

The AHP procedure produced a weight vector in which the highest importance was assigned to required accuracy, cost sensitivity, and response-time sensitivity. As shown in Table 4, the required accuracy received the largest weight ( w = 0.2591 ) , followed by cost sensitivity ( w = 0.2086 ) , response-time sensitivity ( w = 0.1894 ) , business risk ( w = 0.1337 ) , standardization ( w = 0.0922 ) , reasoning depth ( w = 0.0874 ) , and creativity ( w = 0.0297 ) . This structure indicates that the studied organization prioritized a combination of quality assurance and operational efficiency, while treating creativity as a relatively less important routing criterion for the analyzed business-prompt set.
The aggregated AHP comparison matrix satisfied the consistency requirement, with C R = 0.0208 , which is well below the commonly accepted threshold of 0.10. The individual expert matrices were also consistent, with C R values ranging from 0.0217 to 0.0269. This indicates that the expert judgments were highly consistent and suitable for deriving the criterion weight vector used in the routing procedure. The prompt-level scores assigned by the router indicate that the dataset included many standardized and cost-sensitive tasks, but also a substantial number of prompts requiring escalation because of risk, accuracy, or reasoning requirements. Table 5 presents the mean router scores for the seven prompt-level criteria. The highest average score was observed for standardization ( M = 4.73 ) , followed by cost sensitivity ( M = 3.88 ) , business risk ( M = 3.36 ) , reasoning depth ( M = 3.30 ) , required accuracy ( M = 3.07 ) , response-time sensitivity ( M = 2.92 ) , and creativity ( M = 2.04 ) .
Under the final SAW variant with confidence margin and risk veto, 310 prompts were routed to GPT-4o-mini and 190 prompts were routed to GPT-5. As shown in Table 6, the router assigned 62.0% of prompts to the cheaper model and escalated 38.0% to the stronger model. This allocation differs substantially from both fixed strategies and indicates that the router did not simply approximate the always-cheap or always-strong baseline.
The routing pattern was strongly differentiated by the ex ante risk level of the prompt. As shown in Table 7, among 170 high-risk prompts, 157 were routed to GPT-5 and only 13 to GPT-4o-mini. In contrast, among 167 low-risk prompts, 158 were routed to GPT-4o-mini and only nine to GPT-5. Medium-risk prompts were predominantly assigned to the cheaper model, although 24 were escalated to GPT-5. This pattern confirms that the final routing rule behaved consistently with the intended managerial logic: routine and low-risk prompts were usually assigned to the cheaper model, whereas high-risk prompts were predominantly escalated to the stronger model.

4.2. Reliability of the Automated Sufficiency Assessment

Before comparing routing strategies, the internal reliability of the automated response-sufficiency assessment was examined. Sufficiency judgments were obtained using three independent LLM-as-a-judge evaluator profiles. Because the final sufficiency label was based on majority voting, it was important to verify whether the evaluator profiles produced sufficiently consistent judgments. As shown in Table 8, the overall Fleiss’ kappa for all evaluated responses was κ = 0.445 , indicating moderate agreement. Agreement was slightly higher for GPT-4o-mini responses κ = 0.460 and lower for GPT-5 responses ( κ = 0.373 ) . This difference suggests that judgments were more consistent for responses generated by the cheaper model, whereas stronger-model responses produced somewhat more borderline evaluation cases.
Fleiss’ kappa was calculated for binary sufficiency judgments produced by three independent LLM-as-a-judge evaluator profiles. Each of the 500 prompts had two evaluated responses, one from GPT-4o-mini and one from GPT-5. These reliability values should be interpreted as agreement among automated evaluator profiles, not as inter-rater reliability among human experts. They indicate that the structured LLM-as-a-judge protocol produced moderately consistent sufficiency labels, but they do not validate whether human managers or domain experts would reach the same judgments in a real organizational context.

4.3. Aggregate Comparison of Routing Strategies

Table 9 presents the aggregate performance of the fixed, heuristic, MCDM-based, and learned routing strategies. The always-strong strategy achieved the highest sufficiency rate, with 473 sufficient responses out of 500 ( S R = 94.6 % ) , but it also generated the highest total cost (USD 2.0602). The always-cheap strategy had the lowest cost (USD 0.0833), but its sufficiency rate was substantially lower ( S R = 86.8 % ) .
The proposed SAW routing strategy with confidence margin and risk veto produced 472 sufficient responses ( S R = 94.4 % ) , which was only 0.2 percentage points below the always-strong strategy and 7.6 percentage points above the always-cheap strategy. At the same time, its total cost was USD 1.2899, corresponding to a 37.4% cost reduction relative to always-strong routing. The cost per sufficient response was also lower for the proposed strategy (USD 0.00273) than for always-strong routing (USD 0.00436).
Compared with the heuristic and alternative routing baselines, SAW with confidence margin and risk veto achieved the highest sufficiency rate among non-fixed strategies. Keyword-risk routing reached S R = 93.4 % , TF-IDF centroid routing reached S R = 92.8 % , multiplicative SAW and TOPSIS each reached S R = 92.6 % , logistic-regression routing reached S R = 91.6 % , and token-threshold routing reached S R = 90.8 % . Some baselines, especially TF-IDF centroid and logistic-regression routing, achieved lower average cost per prompt, but at the price of lower sufficiency. The proposed strategy, therefore, occupied a near-strong-model region of the cost–sufficiency trade-off rather than the minimum-cost region. These results show that the proposed router does not simply minimize cost by sending most prompts to the cheaper model. Rather, it selectively preserves escalation for prompts whose criterion profile indicates higher business risk, stronger accuracy requirements, or deeper reasoning needs. The small sufficiency gap relative to the always-strong strategy suggests that the margin-and-veto variant successfully identifies many cases in which the cheaper model is sufficient while protecting a substantial share of risk-sensitive cases. The cost reduction is therefore not achieved through indiscriminate downgrading, but through differentiated allocation of prompt types according to the multicriteria profile. The confidence margin also has an important managerial interpretation. Borderline cases are not treated as strong evidence for escalation; instead, escalation requires a sufficiently clear advantage of the stronger model, unless the risk-veto rule applies. This reflects a cost-aware enterprise policy: stronger models should be used when justified by risk, accuracy, reasoning depth, or creativity, but routine or standardized tasks should not be escalated merely because of small differences in aggregate score.
For routing strategies, cost and latency include router overhead.
The cost advantage of the proposed routing strategy was obtained at the expense of higher end-to-end latency. Because routing strategies require an additional prompt-scoring call before response generation, SAW with confidence margin and risk veto produced higher average latency than both fixed strategies. This result should be interpreted as an important deployment trade-off: the proposed framework is most suitable for enterprise use cases in which cost control, auditability, and response sufficiency are more important than minimizing response time. In latency-critical applications, parallelized routing, cached prompt scoring, smaller local classifiers, or direct model selection may be preferable.

4.4. Computational Overhead, Scalability, and Deployment Complexity

The routing framework introduces one additional inference step before response generation: prompt scoring by the lightweight router model. This step increases pipeline complexity relative to fixed always-cheap or always-strong strategies, because each prompt must first be scored on the seven routing criteria and then passed to the selected response model. For this reason, the experiment recorded router input tokens, router output tokens, router cost, and router latency for each prompt. These values were included in the cost calculation for routing-based strategies. The deployment implication is that the proposed framework is most appropriate when the expected savings from assigning a substantial share of prompts to the cheaper model exceed the additional cost and latency of the routing step.
From a scalability perspective, the routing rule itself is computationally simple after prompt scoring: SAW aggregation is linear in the number of criteria and alternatives. The main operational cost, therefore, comes not from the multicriteria calculation but from the router-model call. In high-volume deployments, organizations could reduce this overhead by caching scores for recurring prompt templates, batching routing requests where supported, or applying the router only to prompt classes for which fixed policies are insufficient. Conversely, for uniformly high-risk workloads, the economic benefit of routing may be limited because most prompts will be escalated to the stronger model while still incurring router overhead.

4.5. Statistical Comparison of Sufficiency Outcomes

To assess whether the observed differences in sufficiency rates reflected systematic prompt-level differences rather than only aggregate percentage variation, paired exact McNemar tests were conducted. The tests compared the binary sufficiency outcomes of the proposed SAW strategy with confidence margin and risk veto against selected alternative strategies. The results are reported in Table 10. The proposed strategy significantly outperformed the always-cheap strategy in terms of prompt-level sufficiency outcomes ( p < 0.001 ) . This confirms that the increase in sufficiency from 86.8% to 94.4% was not only a descriptive difference but reflected a statistically meaningful improvement. By contrast, the difference between the proposed routing strategy and always-strong was not significant ( p = 1.000 ) , which indicates that the proposed router achieved a sufficiency level statistically indistinguishable from the always-strong strategy while reducing total cost by 37.4%. Among the alternative routing strategies, the proposed approach significantly outperformed SAW without margin ( p = 0.012 ) , multiplicative SAW ( p = 0.022 ) , TOPSIS ( p = 0.022 ) , token-threshold routing ( p < 0.001 ) , and logistic-regression routing ( p = 0.003 ) . The difference relative to TF-IDF centroid routing was close to the conventional significance threshold ( p = 0.057 ) , whereas the difference relative to keyword-risk routing was not statistically significant ( p = 0.180 ) . These results suggest that the risk-veto and confidence-margin extension improved the robustness of the SAW-based router, but also that some simpler risk-oriented heuristics can approach its sufficiency performance.
Exact McNemar tests compare paired binary sufficiency outcomes at the prompt level. Discordant pairs indicate cases in which the two compared strategies differed in whether the response was classified as sufficient. The comparison with recent LLM-routing research further clarifies the position of the proposed framework. Existing routing approaches, such as FrugalGPT-style cascades, learned weak/strong routing, IRT-based routing, and taxonomy-guided routing, show that adaptive model selection can improve the cost–performance trade-off of LLM deployment. FrugalGPT demonstrates the value of cascaded API/model selection for reducing cost while maintaining performance, whereas RouteLLM learns routing policies from preference data to decide when a weaker or stronger model should be used. IRT-Router introduces a more interpretable routing mechanism by modeling both LLM ability and query difficulty, and Select-then-Route extends the routing problem to larger model portfolios by first narrowing the candidate model pool and then applying routing within it. These approaches are highly relevant to the present study, but their primary orientation remains technical: they focus on expected response performance, preference-based routing, model ability, query difficulty, or model-pool selection. By contrast, the framework proposed in this article treats routing as an enterprise decision-support problem. Its main purpose is not to replace learned or semantic routers, but to provide an auditable governance layer in which escalation decisions are explicitly linked to managerial criteria such as business risk, required accuracy, reasoning depth, cost sensitivity, time sensitivity, standardization, and creativity. In practical deployments, the proposed AHP/SAW layer could therefore complement learned routers: a learned model could estimate task difficulty or expected model performance, while the multicriteria layer could impose explicit organizational constraints related to risk, cost, standardization, and auditability.

4.6. Confidence-Margin and Weight-Sensitivity Analysis

The next analysis examined whether the routing results were sensitive to the value of the confidence-margin parameter ε . Table 11 reports the results for alternative margin values from ε = 0.00 to ε = 0.50 . Across values from 0.00 to 0.30, the sufficiency rate remained unchanged at 94.4%, while the share of prompts assigned to GPT-4o-mini increased from 60.0% to 64.2%. As a result, the total cost decreased from USD 1.3308 to USD 1.2466 without loss of sufficiency. Only at ε = 0.50 did the sufficiency rate decline slightly, from 94.4% to 94.2%. These results indicate that the margin mechanism did not create a fragile decision boundary. Moderate increases in the confidence margin shifted additional borderline prompts to the cheaper model, lowering cost while preserving sufficiency. This supports the interpretation that many near-boundary prompts could be handled by the cheaper model without materially reducing response usefulness.
A separate perturbation-based sensitivity analysis was conducted to test whether moderate changes in the AHP-derived criterion weights would materially change the routing results. The summary of this analysis is shown in Table 12. Across the simulated perturbations, the median sufficiency rate remained 94.4%, and both the 2.5th and 97.5th percentiles were also 94.4%. The average cost per prompt varied only slightly, from USD 0.00254 at the 2.5th percentile to USD 0.00264 at the 97.5th percentile. The cheap-model share varied between 60.6% and 63.0%, while the strong-model share varied between 37.0% and 39.4%. These findings indicate that the proposed routing strategy was not highly sensitive to moderate weight perturbations. This is important from the perspective of organizational adoption because AHP weights may vary across expert panels, departments, or use cases. The results suggest that the main cost–sufficiency conclusion remains stable under plausible changes in criterion weights.
Quantiles were calculated independently for each metric; therefore, cheap- and strong-model shares within the same quantile row should not be interpreted as paired values from a single simulation run.

4.7. Failure-Mode Analysis

To better understand the practical implications of routing errors, a qualitative failure-mode analysis was conducted on the basis of the observed routing logic, score-gap behavior, and risk-veto mechanism. The analysis focused on cases in which the routing decision could be problematic from an enterprise perspective: prompts assigned to the cheaper model despite high business risk or strong reasoning requirements, prompts escalated to the stronger model despite routine and standardized structure, and borderline cases with a small SAW score gap (Table 13). These cases are important because aggregate sufficiency rates can hide asymmetric error costs. In enterprise settings, a false cheap allocation may be more harmful than a false strong allocation: the former can produce an insufficient or risky answer, whereas the latter mainly increases cost.
The most important failure mode is the false cheap allocation, because it may reduce response sufficiency in cases where the business consequences of an error are high. The risk-veto rule was introduced specifically to address this asymmetry. However, it does not eliminate all routing risk, because the veto itself depends on the router’s ex ante assessment of business risk. This means that the framework should be monitored using prompt-level audits, especially for high-risk departments such as legal/compliance, HR, finance, customer escalation, and security-related workflows.

4.8. Ablation and Method-Robustness Analysis

The next analysis examined whether the routing results depended specifically on the additive SAW aggregation rule or whether comparable outcomes could be obtained using alternative multicriteria aggregation methods. Table 14 compares the final SAW strategy with confidence margin and risk veto against SAW without margin, multiplicative SAW, and TOPSIS. The proposed strategy achieved the highest sufficiency rate among the tested MCDM variants ( S R = 94.4 % ) , while SAW without margin, multiplicative SAW, and TOPSIS each reached S R = 92.6 % . The comparison shows that the improvement was not produced by SAW alone, but by the combination of SAW with confidence-margin and risk-veto extensions. The basic SAW variant without margin routed a larger share of prompts to GPT-5 than multiplicative SAW and TOPSIS, but still achieved lower sufficiency than the final strategy. This indicates that the risk-veto mechanism was important for protecting high-risk cases, whereas the confidence margin helped avoid unnecessary escalation in ambiguous cases.
To assess the contribution of individual criteria and criterion groups, an ablation analysis was conducted. Table 15 reports the results of rerunning the routing procedure after removing selected criteria or using reduced criterion sets. The full SAW strategy achieved S R = 94.4 % , with an average cost per prompt of USD 0.00258. Removing most individual criteria did not materially reduce sufficiency: variants without accuracy, business risk, reasoning depth, cost sensitivity, or creativity also achieved S R = 94.4 % . However, these variants differed in cost and model-selection shares, showing that the criteria influenced how the same sufficiency level was achieved. The most visible sufficiency decrease occurred when time sensitivity was removed: S R declined from 94.4% to 94.0%, although the average cost per prompt also decreased to USD 0.00251. Removing standardization did not reduce sufficiency and even produced S R = 94.6 % , but at a substantially higher average cost per prompt (USD 0.00291) and a larger share of GPT-5 selections. This suggests that standardization played an important cost-containment role by helping identify prompts that could be safely assigned to the cheaper model. Reduced criterion sets also produced informative results. The cost-only and cost–time–standardization variants reached S R = 94.6 % with costs close to the full strategy, whereas the risk-only variant reached S R = 94.2 % . The risk–quality-only variant preserved S R = 94.4 % , but at a much higher average cost per prompt (USD 0.00328), because it escalated 55.2% of prompts to GPT-5. Overall, the ablation analysis indicates that multiple criterion configurations can produce high sufficiency, but the full framework provides a more balanced and managerially interpretable cost–sufficiency allocation.

4.9. Stratified Performance Across Business Functions, Task Types, and Risk Levels

The final analysis examined whether the proposed routing strategy behaved consistently across different categories of prompts. Table 16 reports the performance of SAW with confidence margin and risk veto by business function. The sufficiency rate remained high across all business functions, ranging from 92.3% for HR prompts to 98.1% for IT/Security prompts. The router also maintained substantial use of the cheaper model in every business function, with GPT-4o-mini shares ranging from 55.6% in Customer Support to 68.6% in Sales. This indicates that the router did not rely on a single business-domain pattern, but distributed prompts between model tiers across the whole dataset.
Table 17 presents the same analysis by task type. The highest sufficiency rates were obtained for risk assessment ( 100.0 % ) , business email drafting ( 98.1 % ) , and process improvement ( 97.9 % ) . The lowest sufficiency rates were observed for creative generation 89.3 % and summarization ( 89.4 % ) . These two categories, therefore, appear to be the most difficult for the proposed router under the adopted evaluation protocol. The model-selection shares also differed across task types. Risk assessment tasks were escalated to GPT-5 most frequently ( 52.0 % ) , whereas creative-generation tasks were most frequently handled by GPT-4o-mini ( 73.2 % ) . This pattern is consistent with the intended managerial interpretation of the criteria: high-risk tasks were more often escalated, while more standardized or lower-risk tasks were more often assigned to the cheaper model.
Finally, Table 18 compares the proposed strategy with the two fixed strategies by risk level. For low-risk prompts, all three strategies achieved S R = 100.0 % , but SAW with confidence margin and risk veto reduced the average cost per prompt relative to always-strong by assigning 94.6% of low-risk prompts to GPT-4o-mini. For medium-risk prompts, the proposed strategy achieved S R = 98.8 % , close to always-strong ( 99.4 % ) , while still assigning 85.3% of prompts to GPT-4o-mini. The high-risk category produced the most important boundary condition. In this group, SAW with confidence margin and risk veto achieved the same sufficiency rate as always-strong ( 84.7 % ) , but its average cost per prompt was higher because the strategy added router overhead while still escalating most high-risk prompts to GPT-5. This result shows that the proposed router is most economically useful when the prompt stream contains a mix of low-, medium-, and high-risk tasks. For uniformly high-risk prompt streams, direct use of the stronger model may be more cost-efficient because the router has a limited opportunity to assign prompts to the cheaper model.
Overall, the stratified results confirm that the proposed routing framework produced its strongest economic advantage in heterogeneous prompt streams where low- and medium-risk tasks could be safely assigned to GPT-4o-mini, while high-risk tasks were escalated to GPT-5. The results also identify important limitations: some task types, especially creative generation and summarization, achieved lower sufficiency than the aggregate average, and high-risk-only use cases may reduce or eliminate the cost advantage of routing because most prompts require escalation. These findings are important for interpreting the framework not as a universal replacement for stronger models, but as a decision-support mechanism for mixed enterprise workloads.

5. Conclusions

5.1. Main Findings

This article proposed and empirically evaluated an interpretable multicriteria decision-support framework for enterprise LLM routing. The framework combines AHP-based elicitation of organizational criterion weights with SAW-based prompt-level routing, extended by a confidence margin and a risk-veto mechanism. Its purpose is to support auditable model selection between a cheaper and a stronger LLM while taking into account organizational preferences related to required accuracy, business risk, reasoning depth, cost sensitivity, response-time sensitivity, standardization, and creativity.
The empirical evaluation on 500 heterogeneous business prompts showed that the proposed routing strategy can achieve a sufficiency level close to the always-strong strategy while substantially reducing token-level cost. SAW with confidence margin and risk veto achieved a sufficiency rate of 94.4%, compared with 94.6% for always-strong and 86.8% for always-cheap. At the same time, it reduced total cost by 37.4% relative to always-strong, with only a 0.2 percentage-point decrease in sufficiency. Paired McNemar tests showed that the proposed router significantly outperformed always-cheap, whereas its difference from always-strong was not statistically significant. This indicates that, for the analyzed prompt stream, a large share of prompts could be routed to GPT-4o-mini without a statistically meaningful loss of response sufficiency.
The results also show that the value of the framework is not limited to comparison with fixed strategies. The proposed routing strategy achieved higher sufficiency than the evaluated heuristic, alternative MCDM, and learned routing baselines, including keyword-risk routing, token-threshold routing, TF-IDF centroid routing, logistic-regression routing, multiplicative SAW, TOPSIS, and SAW without confidence margin. Robustness analyses further indicated that the results were stable under moderate changes in the confidence-margin parameter and perturbations of AHP-derived criterion weights. The ablation analysis showed that different criterion configurations can produce high sufficiency, but the full framework provides a more balanced and managerially interpretable allocation between cost and response sufficiency.

5.2. Theoretical Contribution

The main contribution of the article is not the claim that AHP and SAW are universally superior to semantic, embedding-based, or learned LLM routers. Rather, the contribution lies in showing that enterprise LLM routing can be operationalized as an explicit decision-support process. In this process, organizational preferences are translated into computable prompt-level criteria, model-selection decisions are auditable, and the cost–sufficiency trade-off can be evaluated using token-level costs, response sufficiency, statistical tests, and robustness checks.
This contribution is particularly relevant for enterprise AI governance, where model-selection decisions often need to be justified not only technically, but also economically, operationally, and managerially. The proposed framework demonstrates that routing between LLMs can be treated as a multicriteria organizational decision rather than only as a technical optimization problem. In this sense, the framework complements existing learned, semantic, and cascade-based routers by adding an explicit governance layer that makes routing criteria visible and inspectable.

5.3. Managerial Implications

From a managerial perspective, the findings suggest that enterprises should not treat LLM model selection as a simple binary policy between always using the cheapest model and always using the strongest model. A structured routing mechanism can support a more differentiated allocation of prompts. Routine, standardized, and low-risk tasks can often be assigned to a cheaper model, whereas risk-sensitive, accuracy-critical, or cognitively demanding tasks can be escalated to a stronger one.
The risk-veto mechanism is especially important in this context because it prevents high business risk from being compensated by cost or time sensitivity in a purely additive scoring rule. This is relevant for enterprise settings in which some errors may have consequences that cannot be fully captured by average cost–quality metrics. The framework, therefore, provides not only an economic mechanism for reducing model-use costs, but also a governance mechanism for making escalation decisions more transparent.
The framework is most useful for heterogeneous enterprise prompt streams. The stratified results showed that low- and medium-risk prompts could often be handled by GPT-4o-mini while maintaining high sufficiency, whereas high-risk prompts were predominantly escalated to GPT-5. However, this also identifies an important boundary condition: in uniformly high-risk workloads, routing may offer less economic advantage, because most prompts require escalation, and the additional router overhead may reduce or eliminate cost savings. In such contexts, direct use of the stronger model or a more conservative routing policy may be preferable.

5.4. Limitations

Several limitations should be acknowledged. First, the study was conducted in a single organizational setting and on a synthetic but business-oriented prompt dataset. Although the dataset was stratified by business function, task type, risk level, and industry context, it cannot represent the full diversity of enterprise prompt streams. Real organizations differ in risk tolerance, regulatory exposure, communication standards, domain vocabulary, and acceptable levels of response uncertainty. The results should therefore be interpreted as evidence from a proof-of-concept enterprise routing scenario rather than as a universally generalizable benchmark.
Second, the study used a two-tier model setup involving one cheaper response model and one stronger response model. This design makes the routing problem interpretable and allows a clear analysis of cost–sufficiency trade-offs, but it simplifies real enterprise model portfolios. In practice, organizations may select among several proprietary, open-source, local, domain-specific, or cross-vendor models with different context windows, latency profiles, security constraints, and deployment costs. The present study, therefore, validates the decision framework in a controlled two-model setting, not the general superiority of the selected models across all enterprise contexts.
Third, response sufficiency was assessed using a structured LLM-as-a-judge protocol rather than independent human expert evaluation. This enabled scalable and consistent assessment across 500 prompts and two response models, but it remains a limitation because LLM evaluators may reproduce model-specific biases, overlook organizational nuance, or fail to capture the political and contextual acceptability of business communication. Future work should validate the sufficiency of the results with blind human evaluators and report inter-rater reliability.
Fourth, the AHP weights reflect the judgments of a limited expert panel. AHP is useful because it makes organizational preferences explicit, but it is also subjective and sensitive to the composition of the expert group, the interpretation of criteria, and the consistency of pairwise comparisons. Different organizations may assign substantially different weights to cost, risk, reasoning depth, standardization, or response-time sensitivity. For this reason, AHP-derived weights should not be treated as universal constants; they require organizational calibration and periodic review.
Fifth, the SAW aggregation rule is additive and partly compensatory. This means that a high value on one criterion can offset a low value on another criterion. In LLM routing, such compensation may be problematic when business risk or compliance sensitivity should not be offset by cost or response-time advantages. The risk-veto rule partly addresses this limitation by introducing a non-compensatory escalation mechanism, but it does not remove all interactions among criteria. Future work should compare SAW with non-compensatory MCDM methods, rule-based safety constraints, and hybrid learned-governance routers.
Sixth, the evaluation did not include open-source or cross-vendor model portfolios. The findings, therefore, concern the proposed routing logic rather than a comprehensive comparison of model providers. Additional studies should test whether the same framework remains effective when the alternatives include local open-source models, cross-vendor APIs, domain-specific models, and models with different privacy or deployment constraints.

5.5. Future Research

Future research should extend the evaluation across multiple organizations, industries, languages, and model portfolios. This would make it possible to assess whether the observed cost–sufficiency trade-off generalizes beyond the present context. Another important direction is the development of hybrid routing architectures that combine semantic or learned routing with an explicit managerial decision layer. Such architectures could use embeddings, reward models, or confidence estimators to capture semantic difficulty, while retaining AHP/SAW-style transparency for risk, cost, and governance constraints.
Future studies should also investigate dynamic adaptation mechanisms. Instead of treating organizational weights as fixed, routing systems could update them in response to concept drift, changes in prompt distribution, model updates, user feedback, or observed sufficiency failures. Finally, future work should extend the framework from two-model routing to multi-model enterprise portfolios. In real deployments, model selection may involve several alternatives with different costs, capabilities, context windows, and governance profiles. Extending the proposed approach to multi-alternative MCDM routing would provide a richer representation of enterprise LLM architecture.

5.6. Final Conclusions

Overall, the study shows that enterprise LLM routing can be designed as an interpretable, auditable, and economically informed decision-support process. By combining explicit organizational preferences, prompt-level scoring, token-level cost measurement, sufficiency evaluation, and robustness analysis, the proposed framework provides a transparent basis for allocating prompts between model tiers. Its value lies not in replacing technical routing methods, but in complementing them with a managerial layer that makes routing decisions understandable and governable in enterprise settings.

Funding

This research received no external funding.

Institutional Review Board Statement

In Poland, the requirement to obtain approval from a Bioethics Committee applies to medical experiments involving human subjects, as defined in the Act of 5 December 1996 on the professions of physician and dentist (Journal of Laws 1997 No. 28, item 152, as amended), in particular Articles 21 and 29. The present study does not constitute a medical experiment within the meaning of this Act. It is a non-interventional study conducted in an organizational context, involving adult participants acting in an expert capacity. The study did not involve any medical procedures, interventions, or the collection of health-related or other sensitive personal data. Moreover, in accordance with Regulation (EU) 2016/679 (General Data Protection Regulation, GDPR), the study did not involve the processing of personal data, as no personally identifiable information was collected and participation was fully anonymous. Therefore, under the applicable Polish legal framework governing medical experiments and data protection, formal approval from a Bioethics Committee was not required for this type of study.

Informed Consent Statement

Participation in the study was voluntary, and participants were informed about the purpose of the study and their right to withdraw at any time. No written consent was required, as the study was anonymous and did not involve the collection of personally identifiable or sensitive data.

Data Availability Statement

The prompt dataset, routing evaluations, and aggregated assessment results are available from the corresponding author upon reasonable request. The data are not publicly available due to organizational confidentiality considerations.

Conflicts of Interest

The author declare no conflict of interest.

Appendix A. Prompt-Scoring Router Protocol

The prompt-scoring router was instructed to evaluate each input prompt before response generation and to return structured JSON scores for the seven routing criteria. The router did not evaluate generated answers; it evaluated only the prompt and its routing-relevant characteristics.
You are a routing agent in an enterprise AI system.
Your task is to assess whether a given prompt requires escalation to a stronger model, or can be handled by a cheaper model.
Evaluate the prompt from the perspective of a routing decision. Important principles:
1. Do not inflate scores only because the task sounds professional.
2. Typical operational, communicative, and template-based tasks should not automatically receive high C1/C2/C3 scores.
3. High standardization (C6) favors the cheaper model.
4. High cost sensitivity (C4) and high time sensitivity (C5) favor the cheaper model.
5. Assess the need for escalation, not the general importance of the task.
6. Do not compare generated answers. Evaluate only the input prompt before response generation.
Scales (1–5):
C1_Accuracy—required accuracy relative to the expected capabilities of the cheaper model
1 = the cheaper model is very likely to be fully sufficient
2 = the cheaper model should be sufficient with only minor risk
3 = borderline task, with no clear model advantage
4 = higher accuracy clearly favors the stronger model
5 = the required accuracy definitely justifies escalation to the stronger model
C2_Business_Risk—potential business consequences of an error
1 = a possible error has little business significance
2 = an error would be undesirable, but not critical
3 = an error would have moderate business significance
4 = an error could have serious consequences
5 = the potential consequences of an error definitely justify cautious escalation
C3_Reasoning_Depth—required depth of reasoning
1 = simple, routine, or template-based task
2 = minor analysis or organization of information
3 = moderate reasoning
4 = complex, multi-step reasoning
5 = deep reasoning clearly justifies escalation to the stronger model
C4_Cost_Sensitivity—importance of minimizing processing cost
1 = cost has little significance
2 = cost has minor significance
3 = cost has moderate significance
4 = cost is important and favors the cheaper model
5 = cost is very important and strongly favors the cheaper model
C5_Time_Sensitivity—importance of obtaining a rapid response
1 = response time has little significance
2 = time has minor significance
3 = time has moderate significance
4 = a rapid response is important
5 = response speed strongly favors the cheaper model
C6_Standardization—degree to which the task is template-based, predictable, or grounded in a standard response structure
1 = the response requires a non-standard approach
2 = the response is rather non-standard
3 = the response is partially standardized
4 = the response is highly template-based
5 = the response is strongly standardized and predictable
C7_Creativity—need for a creative, conceptual, or non-standard approach
1 = creativity is not needed
2 = a small degree of creativity may help
3 = moderate creativity is useful
4 = high creativity is useful
5 = the task clearly requires a more creative or exploratory model
Guidelines:
- Simple summaries, routine emails, lists, and standard messages should usually receive lower C1/C2/C3 scores and higher C4/C5/C6 scores.
- Legal, strategic, compliance-related, security-related, financial, or otherwise high-risk tasks should usually receive higher C1/C2/C3 scores.
- Do not assign multiple criteria a score of 5 without a clear justification.
- Return only valid JSON. Do not include markdown, explanations outside JSON, or code fences.
Required JSON format:
{
“C1_Accuracy”: {
“score”: 1,
“justification”: “…”
},
“C2_Business_Risk”: {
“score”: 1,
“justification”: “…”
},
“C3_Reasoning_Depth”: {
“score”: 1,
“justification”: “…”
},
“C4_Cost_Sensitivity”: {
“score”: 1,
“justification”: “…”
},
“C5_Time_Sensitivity”: {
“score”: 1,
“justification”: “…”
},
“C6_Standardization”: {
“score”: 1,
“justification”: “…”
},
“C7_Creativity”: {
“score”: 1,
“justification”: “…”
},
“overall_comment”: “…”
}

Appendix B. LLM-as-a-Judge Sufficiency-Evaluation Protocol

Response sufficiency was evaluated using a structured LLM-as-a-judge protocol. Each generated answer was evaluated independently by three evaluator profiles. The evaluator did not receive the real model name and was instructed not to compare models. The task was to determine whether the answer was acceptable for organizational use without significant substantive revision.
The three evaluator profiles were defined as follows:
R1: You are a pragmatic business manager. Focus on whether the answer can be used immediately in operations.
R2: You are a quality and compliance reviewer. Be strict about missing constraints, risk, and unsupported claims.
R3: You are an AI adoption manager. Focus on usefulness, completeness, specificity, and whether revisions would be needed.
The general judge instruction was:
You are an independent evaluator of enterprise LLM outputs.
Use a strict but realistic business standard. Do not reward answers that are merely fluent.
A sufficient answer must be practically usable with no significant substantive revision.
Important:
- Generic answers should be penalized.
- Missing required elements should be penalized.
- High-risk tasks require higher precision, caution, and completeness.
- A response that looks professional but avoids the actual task is insufficient.
- A response may be concise, but it must still answer the prompt fully.
- Do not compare models. Evaluate only the given answer.
Return only a JSON object. No markdown. No text outside JSON.
For each response, the following evaluation input was supplied:
Return the evaluation as a JSON object only.
Evaluator profile:
{rater_profile}
Evaluate the answer using these dimensions from 1 to 5:
1 = very poor
2 = weak
3 = acceptable but requires noticeable revision
4 = good and usable with minor edits
5 = very strong and ready to use
JSON keys required:
- instruction_following_1_5
- completeness_1_5
- specificity_1_5
- business_usability_1_5
- risk_safety_1_5
- style_format_1_5
- major_issue
- main_issue
- comment
Definitions:
instruction_following_1_5: Did the answer follow all important prompt instructions?
completeness_1_5: Did it cover all required elements?
specificity_1_5: Is it concrete and tailored, not generic?
business_usability_1_5: Could it be used in business without significant revision?
risk_safety_1_5: Is it safe, cautious, and appropriate for the stated risk level?
style_format_1_5: Is the format and tone appropriate?
major_issue = 1 if the answer has a serious problem requiring substantive revision; otherwise 0.
main_issue = one of:
- none
- incomplete
- not_following_prompt
- too_generic
- factual_or_reasoning_problem
- risk_or_compliance_problem
- style_or_format_problem
- other
comment = short justification, max 25 words.
Business function:
{business_function}
Task type:
{task_type}
Risk level:
{risk_level}
PROMPT:
{prompt_text}
ANSWER TO EVALUATE:
{answer_text}
The judge returned the following JSON structure:
{
  “instruction_following_1_5”: 1,
  “completeness_1_5”: 1,
  “specificity_1_5”: 1,
  “business_usability_1_5”: 1,
  “risk_safety_1_5”: 1,
  “style_format_1_5”: 1,
  “major_issue”: 0,
  “main_issue”: “none”,
  “comment”: “…”
}
The binary sufficiency label was then derived deterministically from the judge scores. The core quality score was calculated as:
A v g C o r e = I n s t r u c t i o n F o l l o w i n g + C o m p l e t e n e s s + S p e c i f i c i t y + B u s i n e s s U s a b i l i t y + R i s k S a f e t y 5
The style and format score was recorded but was not included in the core quality average. A response was classified as non-sufficient if any of the following conditions held:
M a j o r I s s u e = 1
or
m i n ( I n s t r u c t i o n F o l l o w i n g , C o m p l e t e n e s s , S p e c i f i c i t y , B u s i n e s s U s a b i l i t y , R i s k S a f e t y ) 2 .
For high-risk prompts, the response was classified as sufficient only if:
R i s k S a f e t y 4 ,
B u s i n e s s U s a b i l i t y 4 ,
and
A v g C o r e 3.8 .
For medium-risk prompts, the response was classified as sufficient only if:
B u s i n e s s U s a b i l i t y 4
and
A v g C o r e 3.6 .
For low-risk prompts, the response was classified as sufficient only if:
B u s i n e s s U s a b i l i t y 3
and
A v g C o r e 3.4 .
The final response-level sufficiency label was determined by majority vote across the three evaluator profiles. A response was classified as sufficient if at least two out of the three evaluator profiles classified it as sufficient.

Appendix C. Reproducibility Checklist

The following elements were used to improve the reproducibility of the empirical procedure:
  • Prompt dataset: 500 prompts with metadata fields for prompt ID, business function, task type, risk level, industry context, and prompt text.
  • Prompt-scoring criteria: seven criteria C1–C7, scored on a five-point scale.
  • AHP procedure: pairwise comparisons by organizational experts, geometric aggregation of expert judgments, consistency-ratio calculation, and normalized group weights.
  • Routing rule: SAW aggregation with criteria C1, C2, C3, and C7 favoring the stronger model, and C4, C5, and C6 favoring the cheaper model.
  • Decision extensions: confidence margin ε = 0.10 and risk-veto rule for high-risk cases.
  • Models: GPT-5-nano as prompt-scoring router, GPT-4o-mini as cheaper response model, and GPT-5 as stronger response model.
  • Cost accounting: response-model and router-model costs calculated from recorded input and output token counts.
  • Latency accounting: router latency and response latency recorded separately.
  • Evaluation: structured LLM-as-a-judge sufficiency assessment with constrained JSON output.
  • Main output files: prompt scores, cheaper-model responses, stronger-model responses, strategy-level metrics, stratified metrics, sensitivity analysis, and prompt-level routing outcomes.
The router and evaluator prompts should be reported in full to enable replication. The reported results should be interpreted in light of the absence of independent human validation and the limited two-model response setup.

References

  1. Kumar, P. Large Language Models (LLMs): Survey, Technical Frameworks, and Future Challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
  2. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
  3. Wamba-Taguimdje, S.-L.; Fosso Wamba, S.; Kala Kamdjoug, J.R.; Tchatchouang Wanko, C.E. Influence of Artificial Intelligence (AI) on Firm Performance: The Business Value of AI-Based Transformation Projects. Bus. Process Manag. J. 2020, 26, 1893–1924. [Google Scholar] [CrossRef]
  4. Sestino, A.; De Mauro, A. Leveraging Artificial Intelligence in Business: Implications, Applications and Methods. Technol. Anal. Strateg. Manag. 2022, 34, 16–29. [Google Scholar] [CrossRef]
  5. Le Dinh, T.; Vu, M.-C.; Tran, G.T.C. Artificial Intelligence in SMEs: Enhancing Business Functions through Technologies and Applications. Information 2025, 16, 415. [Google Scholar] [CrossRef]
  6. Kanbach, D.K.; Heiduk, L.; Blueher, G.; Schreiter, M.; Lahmann, A. The GenAI Is out of the Bottle: Generative Artificial Intelligence from a Business Model Innovation Perspective. Rev. Manag. Sci. 2024, 18, 1189–1220. [Google Scholar] [CrossRef]
  7. Romeo, E.; Lacko, J. Adoption and Integration of AI in Organizations: A Systematic Review of Challenges and Drivers towards Future Directions of Research. Kybernetes 2026, 55, 1286–1307. [Google Scholar] [CrossRef]
  8. Sánchez, M.A. Exploring Value Creation of Generative Artificial Intelligence in Organizations: A Systematic Review. Strateg. Bus. Res. 2025, 1, 100015. [Google Scholar] [CrossRef]
  9. Brynjolfsson, E.; Li, D.; Raymond, L. Generative AI at Work. Q. J. Econ. 2025, 140, 889–942. [Google Scholar] [CrossRef]
  10. Noy, S.; Zhang, W. Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. Science 2023, 381, 187–192. [Google Scholar] [CrossRef]
  11. Dell’Acqua, F.; McFowland, E., III; Mollick, E.; Lifshitz, H.; Kellogg, K.C.; Rajendran, S.; Krayer, L.; Candelon, F.; Lakhani, K.R. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality. Organ. Sci. 2026, 37, 403–423. [Google Scholar] [CrossRef]
  12. Papagiannidis, E.; Mikalef, P.; Conboy, K. Responsible Artificial Intelligence Governance: A Review and Research Framework. J. Strateg. Inf. Syst. 2025, 34, 101885. [Google Scholar] [CrossRef]
  13. Schneider, J.; Abraham, R.; Meske, C.; Vom Brocke, J. Artificial Intelligence Governance for Businesses. Inf. Syst. Manag. 2023, 40, 229–249. [Google Scholar] [CrossRef]
  14. Vidgof, M.; Bachhofner, S.; Mendling, J. Large Language Models for Business Process Management: Opportunities and Challenges. In Proceedings of the International Conference on Business Process Management, Utrecht, The Netherlands, 11–15 September 2023; pp. 107–123. [Google Scholar]
  15. Bernardi, M.L.; Casciani, A.; Cimitile, M.; Marrella, A. Conversing with Business Process-Aware Large Language Models: The BPLLM Framework. J. Intell. Inf. Syst. 2024, 62, 1607–1629. [Google Scholar] [CrossRef]
  16. Kourani, H.; Berti, A.; Schuster, D.; van der Aalst, W.M.P. Process Modeling with Large Language Models. In Proceedings of the International Conference on Business Process Modeling, Development and Support, Limassol, Cyprus, 3–4 June 2024; pp. 229–244. [Google Scholar]
  17. Apaydin, K.; Zisgen, Y. Local Large Language Models for Business Process Modeling. In Proceedings of the International Conference on Process Mining, Lyngby, Denmark, 14–18 October 2024; pp. 605–609. [Google Scholar]
  18. Kourani, H.; Berti, A.; Schuster, D.; van der Aalst, W.M.P. Evaluating Large Language Models on Business Process Modeling: Framework, Benchmark, and Self-Improvement Analysis. Softw. Syst. Model. 2025, 1–36. [Google Scholar] [CrossRef]
  19. Kampik, T.; Warmuth, C.; Rebmann, A.; Agam, R.; Egger, L.N.P.; Gerber, A.; Hoffart, J.; Kolk, J.; Herzig, P.; Decker, G.; et al. Large Process Models: A Vision for Business Process Management in the Age of Generative AI. KI-Künstliche Intell. 2025, 39, 81–95. [Google Scholar] [CrossRef]
  20. Yue, M.; Zhao, J.; Zhang, M.; Du, L.; Yao, Z. Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 21691–21728. [Google Scholar]
  21. Chen, L.; Zaharia, M.; Zou, J. How Is ChatGPT’s Behavior Changing over Time? Harv. Data Sci. Rev. 2024, 6. [Google Scholar] [CrossRef]
  22. Šakota, M.; Peyrard, M.; West, R. Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 606–615. [Google Scholar]
  23. Ong, I.; Almahairi, A.; Wu, V.; Chiang, W.-L.; Wu, T.; Gonzalez, J.E.; Kadous, M.; Stoica, I. RouteLLM: Learning to Route LLMs from Preference Data. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025; Volume 2025, pp. 34433–34448. [Google Scholar]
  24. Song, W.; Huang, Z.; Cheng, C.; Gao, W.; Xu, B.; Zhao, G.; Wang, F.; Wu, R. Irt-Router: Effective and Interpretable Multi-Llm Routing via Item Response Theory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 15629–15644. [Google Scholar]
  25. Shah, S.; Shridhar, K. Select-Then-Route: Taxonomy Guided Routing for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Suzhou, China, 4–9 November 2025; pp. 425–441. [Google Scholar]
  26. Kazimieras Zavadskas, E.; Antucheviciene, J.; Chatterjee, P. Multiple-Criteria Decision-Making (MCDM) Techniques for Business Processes Information Management. Information 2018, 10, 4. [Google Scholar] [CrossRef]
  27. Saaty, T.L. A Scaling Method for Priorities in Hierarchical Structures. J. Math. Psychol. 1977, 15, 234–281. [Google Scholar] [CrossRef]
  28. Saaty, T.L. Decision Making with the Analytic Hierarchy Process. Sci. Iran. 2008, 1, 83–98. [Google Scholar] [CrossRef]
  29. Hwang, C.-L.; Yoon, K. Multiple Attribute Decision Making: Methods and Applications a State-of-the-Art Survey; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  30. Kaliszewski, I.; Podkopaev, D. Simple Additive Weighting—A Metamodel for Multiple Criteria Decision Analysis Methods. Expert Syst. Appl. 2016, 54, 155–161. [Google Scholar] [CrossRef]
  31. Ciardiello, F.; Genovese, A. A Comparison between TOPSIS and SAW Methods. Ann. Oper. Res. 2023, 325, 967–994. [Google Scholar] [CrossRef]
  32. Chakrabortty, R.K.; Abdel-Basset, M.; Ali, A.M. A Multi-Criteria Decision Analysis Model for Selecting an Optimum Customer Service Chatbot under Uncertainty. Decis. Anal. J. 2023, 6, 100168. [Google Scholar] [CrossRef]
  33. Nowak, M.; Mierzwiak, R.; Butlewski, M. Occupational Risk Assessment with Grey System Theory. Cent. Eur. J. Oper. Res. 2020, 28, 717–732. [Google Scholar] [CrossRef]
Table 1. Expert panel involved in AHP-based criterion weighting.
Table 1. Expert panel involved in AHP-based criterion weighting.
ExpertOrganizational RolePerspective Represented in AHPRelevance to Routing Decisions
E1Chief Executive OfficerStrategic management and cost–risk trade-offsDefines organizational tolerance for cost, risk, and escalation to stronger models
E2Operations ManagerOperational usefulness, response time, and standardizationAssesses whether model outputs are sufficient for routine business processes
E3AI Implementation ManagerAI adoption, model capabilities, and implementation feasibilityEvaluates technical feasibility, model differentiation, and practical deployment constraints
Table 2. Structure of the business-prompt dataset.
Table 2. Structure of the business-prompt dataset.
DimensionCategoriesNumber of Prompts
Business functionLegal/Compliance53
IT/Security53
Finance52
HR52
Sales51
Marketing50
Operations50
Procurement47
Strategy47
Customer Support45
Task typeData interpretation58
Creative generation56
Decision support56
Business email52
Risk assessment50
Classification50
Process improvement47
Summarization47
Report synthesis44
Policy drafting40
Risk levelHigh170
Low167
Medium163
Table 3. Model configuration used in the empirical evaluation.
Table 3. Model configuration used in the empirical evaluation.
ComponentModelRole in the StudyMax Output TokensReasoning SettingCost Basis
Prompt-scoring routerGPT-5-nanoScores each prompt on C1–C7 before routing and returns structured JSON scoresAPI/script default; not explicitly overridden in the notebookNot explicitly set in the notebook for prompt scoringUSD 0.05/1 M input tokens; USD 0.40/1 M output tokens
Cheaper response modelGPT-4o-miniGenerates responses for the cheaper-model alternative2000Not usedUSD 0.15/1 M input tokens; USD 0.60/1 M output tokens
Stronger response modelGPT-5Generates responses for the stronger-model alternative3500Minimal reasoning effortUSD 1.25/1 M input tokens; USD 10.00/1 M output tokens
Sufficiency evaluatorGPT-5-nanoEvaluates response sufficiency in the LLM-as-a-judge protocol using three evaluator profiles900Minimal reasoning effortEvaluation-only component; judge cost was not included in routing-strategy cost
Table 4. AHP-derived weights of routing criteria.
Table 4. AHP-derived weights of routing criteria.
CriterionWeightWeight [%]
C1 Accuracy0.259125.91
C4 Cost sensitivity0.208620.86
C5 Time sensitivity0.189418.94
C2 Business risk0.133713.37
C6 Standardization0.09229.22
C3 Reasoning depth0.08748.74
C7 Creativity0.02972.97
Table 5. Mean prompt-level scores assigned by the routing model.
Table 5. Mean prompt-level scores assigned by the routing model.
CriterionMean Score
C6 Standardization4.73
C4 Cost sensitivity3.88
C2 Business risk3.36
C3 Reasoning depth3.30
C1 Accuracy3.07
C5 Time sensitivity2.92
C7 Creativity2.04
Table 6. Prompt-level routing structure under SAW with confidence margin and risk veto.
Table 6. Prompt-level routing structure under SAW with confidence margin and risk veto.
Selected ModelNumber of PromptsShare [%]
GPT-4o-mini31062.0
GPT-519038.0
Total500100.0
Table 7. Routing decisions by prompt risk level.
Table 7. Routing decisions by prompt risk level.
Risk LevelGPT-4o-MiniGPT-5Total
Low1589167
Medium13924163
High13157170
Total310190500
Table 8. Agreement among automated LLM-as-a-judge evaluator profiles.
Table 8. Agreement among automated LLM-as-a-judge evaluator profiles.
ScopeModelFleiss’ KappaNumber of Evaluated Items
All responsesAll models0.4451000
By modelGPT-4o-mini0.460500
By modelGPT-50.373500
Table 9. Aggregate performance of fixed, heuristic, MCDM-based, and learned routing strategies.
Table 9. Aggregate performance of fixed, heuristic, MCDM-based, and learned routing strategies.
StrategySufficient ResponsesSR [%]Wilson CI for SR [%]Total Cost [USD]ACP [USD]CSR [USD]Avg. Latency [s]P95 Latency [s]Cheap Share [%]Strong Share [%]Cost Reduction vs. Always-Strong [%]
Always-strong47394.692.3–96.32.06020.004120.004369.1915.240.0100.00.0
SAW margin + risk veto47294.492.0–96.11.28990.002580.0027324.0334.2262.038.037.4
Keyword-risk46793.490.9–95.31.67420.003350.003598.7415.1221.678.418.7
TF-IDF centroid46492.890.2–94.80.60110.001200.001306.7112.1274.825.270.8
Multiplicative SAW46392.690.0–94.61.11150.002220.0024023.6133.8270.629.446.0
TOPSIS46392.690.0–94.61.14370.002290.0024723.6333.8268.831.244.5
SAW without margin46392.690.0–94.61.16660.002330.0025223.7133.9868.231.843.4
Logistic regression45891.688.8–93.70.55040.001100.001206.7712.0977.422.673.3
Token-threshold45490.887.9–93.00.80040.001600.001767.2613.2864.635.461.1
Always-cheap43486.883.6–89.50.08330.000170.000195.819.20100.00.096.0
Table 10. Paired exact McNemar tests comparing the proposed routing strategy with alternative strategies.
Table 10. Paired exact McNemar tests comparing the proposed routing strategy with alternative strategies.
ComparisonDiscordant PairsExact McNemar p-ValueInterpretation
SAW margin + risk veto vs. always-cheap46/8<0.001Proposed strategy significantly higher sufficiency
SAW margin + risk veto vs. always-strong3/41.000No significant difference
SAW margin + risk veto vs. keyword-risk7/20.180No significant difference
SAW margin + risk veto vs. SAW without margin10/10.012Proposed strategy significantly higher sufficiency
SAW margin + risk veto vs. multiplicative SAW11/20.022Proposed strategy significantly higher sufficiency
SAW margin + risk veto vs. TOPSIS11/20.022Proposed strategy significantly higher sufficiency
SAW margin + risk veto vs. token-threshold23/5<0.001Proposed strategy significantly higher sufficiency
SAW margin + risk veto vs. TF-IDF centroid11/30.057Difference close to significance threshold
SAW margin + risk veto vs. logistic regression17/30.003Proposed strategy significantly higher sufficiency
Table 11. Robustness of routing results under alternative confidence-margin values.
Table 11. Robustness of routing results under alternative confidence-margin values.
Margin εSufficient ResponsesSR [%]Total Cost [USD]ACP [USD]Cheap Share [%]Strong Share [%]Avg. Latency [s]
0.00047294.41.33080.0026660.040.024.10
0.02547294.41.30640.0026161.039.024.05
0.05047294.41.30010.0026061.438.624.04
0.10047294.41.28990.0025862.038.024.03
0.15047294.41.27440.0025562.837.223.99
0.20047294.41.26720.0025363.236.823.97
0.30047294.41.24660.0024964.235.823.95
0.50047194.21.21240.0024266.034.023.89
Table 12. Perturbation-based sensitivity analysis of AHP-derived criterion weights.
Table 12. Perturbation-based sensitivity analysis of AHP-derived criterion weights.
QuantileSR [%]ACP [USD]Cheap Share [%]Strong Share [%]
2.5th percentile94.40.0025460.637.0
Median94.40.0025862.038.0
97.5th percentile94.40.0026463.039.4
Table 13. Failure modes of enterprise LLM routing and mitigation mechanisms.
Table 13. Failure modes of enterprise LLM routing and mitigation mechanisms.
Case TypeDefinitionExpected ConsequenceMitigation in Framework
False cheap allocationPrompt routed to cheaper model, although the stronger model would be safer or more sufficientPotential quality loss or business-risk exposureIncrease risk-veto sensitivity or reduce the confidence margin
False strong allocationPrompt routed to the stronger model, although the cheaper model would be sufficientUnnecessary cost increaseIncrease the margin threshold or strengthen cost/time criteria
Borderline routingVery small SAW score gap between cheaper and stronger alternativesDecision may be unstable under small score or weight changesMonitor score-gap distribution and review the margin threshold
Risk-compensation caseHigh business risk offset by cost, time sensitivity, or standardization in additive SAWUnsafe allocation under a purely compensatory ruleUse the risk-veto rule as a non-compensatory escalation mechanism
Standardized high-volume taskRoutine prompt escalated because of an isolated high criterion scoreCost inefficiency at scaleAdd template-level caching or fixed, cheap policy for known routine tasks
Table 14. Comparison of alternative MCDM-based routing variants.
Table 14. Comparison of alternative MCDM-based routing variants.
StrategySufficient ResponsesSR [%]Total Cost [USD]ACP [USD]CSR [USD]Cheap Share [%]Strong Share [%]Cost Reduction vs. Always-Strong [%]
SAW margin + risk veto47294.41.28990.002580.0027362.038.037.4
SAW without margin46392.61.16660.002330.0025268.231.843.4
Multiplicative SAW46392.61.11150.002220.0024070.629.446.0
TOPSIS46392.61.14370.002290.0024768.831.244.5
Table 15. Criterion ablation analysis of the SAW-based routing framework.
Table 15. Criterion ablation analysis of the SAW-based routing framework.
VariantSufficient ResponsesSR [%]Total Cost [USD]ACP [USD]CSR [USD]Cheap Share [%]Strong Share [%]Cost Reduction vs. Always-Strong [%]
SAW full47294.41.28990.002580.0027362.038.037.4
Without C1 Accuracy47294.41.30940.002620.0027761.638.436.4
Without C2 Business risk47294.41.31760.002640.0027961.039.036.0
Without C3 Reasoning depth47294.41.27370.002550.0027062.837.238.2
Without C4 Cost sensitivity47294.41.35160.002700.0028658.841.234.4
Without C5 Time sensitivity47094.01.25630.002510.0026763.636.439.0
Without C6 Standardization47394.61.45410.002910.0030754.245.829.4
Without C7 Creativity47294.41.30180.002600.0027661.238.836.8
Cost-only47394.61.28610.002570.0027262.437.637.6
Cost–time–standardization only47394.61.28910.002580.0027362.437.637.4
Risk-only47194.21.40570.002810.0029856.643.431.8
Risk–quality-only47294.41.63840.003280.0034744.855.220.5
Table 16. Performance of SAW with confidence margin and risk veto by business function.
Table 16. Performance of SAW with confidence margin and risk veto by business function.
Business FunctionNSufficient ResponsesSR [%]ACP [USD]GPT-4o-Mini Share [%]GPT-5 Share [%]
Customer Support454395.60.003055.644.4
Finance525096.20.002755.844.2
HR524892.30.002657.742.3
IT/Security535298.10.002766.034.0
Legal/Compliance534992.50.002664.235.8
Marketing504794.00.002466.034.0
Operations504794.00.002562.038.0
Procurement474493.60.002859.640.4
Sales514894.10.002268.631.4
Strategy474493.60.002463.836.2
Table 17. Performance of SAW with confidence margin and risk veto by task type.
Table 17. Performance of SAW with confidence margin and risk veto by task type.
Task TypeNSufficient ResponsesSR [%]ACP [USD]GPT-4o-Mini Share [%]GPT-5 Share [%]
Business email525198.10.001869.230.8
Classification504794.00.002362.038.0
Creative generation565089.30.002573.226.8
Data interpretation585696.60.002853.446.6
Decision support565292.90.002764.335.7
Policy drafting403792.50.002165.035.0
Process improvement474697.90.003455.344.7
Report synthesis444193.20.002268.231.8
Risk assessment5050100.00.003548.052.0
Summarization474289.40.002261.738.3
Table 18. Performance comparison by prompt risk level.
Table 18. Performance comparison by prompt risk level.
Risk LevelStrategyNSufficient ResponsesSR [%]ACP [USD]GPT-4o-Mini Share [%]GPT-5 Share [%]
LowAlways-cheap167167100.00.0002100.00.0
LowSAW margin + risk veto167167100.00.001194.65.4
LowAlways-strong167167100.00.00390.0100.0
MediumAlways-cheap16316098.20.0002100.00.0
MediumSAW margin + risk veto16316198.80.001585.314.7
MediumAlways-strong16316299.40.00400.0100.0
HighAlways-cheap17010762.90.0002100.00.0
HighSAW margin + risk veto17014484.70.00507.692.4
HighAlways-strong17014484.70.00440.0100.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nowak, M. A Multi-Criteria Decision Framework for Enterprise LLM Routing. Information 2026, 17, 539. https://doi.org/10.3390/info17060539

AMA Style

Nowak M. A Multi-Criteria Decision Framework for Enterprise LLM Routing. Information. 2026; 17(6):539. https://doi.org/10.3390/info17060539

Chicago/Turabian Style

Nowak, Marcin. 2026. "A Multi-Criteria Decision Framework for Enterprise LLM Routing" Information 17, no. 6: 539. https://doi.org/10.3390/info17060539

APA Style

Nowak, M. (2026). A Multi-Criteria Decision Framework for Enterprise LLM Routing. Information, 17(6), 539. https://doi.org/10.3390/info17060539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop