1. Introduction
The dynamic development of large language models (LLMs) is transforming the way organizations carry out knowledge-based work, communication, information analysis, and decision-support processes. In enterprise environments, these models are increasingly used not only for content generation but also for document summarization, the preparation of managerial reports, customer service support, the drafting of business communication, the analysis of descriptive data, and the performance of tasks requiring synthesis and reasoning. Consequently, the practical problem is no longer limited to the question of whether organizations should use LLMs, but rather concerns how they can use them effectively, responsibly, and economically across different types of business tasks. In particular, enterprises face the need to determine when a less expensive and faster model is sufficient, and when the use of a more advanced, more costly model is justified because it may offer higher-quality responses.
The importance of this problem is growing alongside the expansion of the range of available models and the increasing differentiation of their parameters in terms of cost, response time, reasoning capability, and the quality of generated outputs. In business practice, users rarely operate under conditions in which only a single model is available. Much more often, they have access to a portfolio of models among which choices must be made depending on the nature of the query. This creates a need for prompt-routing mechanisms, understood as rules for directing queries to different models in a manner consistent with organizational priorities. These priorities are not exclusively technical in nature. They encompass trade-offs among response quality, processing cost, speed of operation, the level of error risk, the need for standardization, and the business consequences of an inappropriate response.
Existing approaches to the problem of LLM model selection and routing, however, remain dominated by a technical perspective. The primary focus is most often placed on quality benchmarks, computational efficiency, latency reduction, inference cost optimization, or improvements in system parameters. Although these aspects are important, they do not fully reflect the realities of managing the use of artificial intelligence in organizations. Enterprises do not assess model outputs solely through the prism of average technical quality, but also from the standpoint of their operational usefulness, acceptable level of risk, compliance with organizational requirements, and the economic justification of deployment. From a managerial perspective, the key question, therefore, becomes not so much which model is objectively the best, but rather which model is more appropriate for a specific task under a given profile of organizational preferences.
At this point, a research gap becomes apparent. Existing LLM routing research has developed several technically advanced approaches, including cascading, confidence-based routing, embedding-based similarity matching, learned routing policies, and cost-aware model selection. These methods are valuable for improving the cost–quality trade-off, but they usually treat routing as a predominantly technical optimization problem. Less attention has been paid to routing as an organizational decision-support problem in which the choice of model must be explainable, auditable, and aligned with managerial preferences concerning risk tolerance, response sufficiency, cost constraints, standardization requirements, and operational speed. In enterprise settings, the relevant question is therefore not only whether a router can maximize benchmark performance, but also whether its decisions can be justified in terms that are understandable to organizational stakeholders. This article addresses this gap by proposing an interpretable multicriteria routing framework that translates managerial criteria into computable prompt-level scores and model-selection rules. Under organizational conditions, the fundamental issue is not whether a model generates the best possible response in absolute terms, but whether the response is sufficient for practical application without incurring unjustified costs.
The novelty of this study should therefore be understood in relation to two adjacent but distinct streams of work. The first stream develops technical LLM routers, cascades, learned routing policies, and benchmark-oriented cost–quality optimization methods. These approaches are important because they show that not all prompts require the most capable model and that adaptive routing can reduce inference cost. However, they typically define the routing problem primarily in terms of predictive performance, expected quality, model confidence, or cost. The second stream concerns enterprise AI governance and managerial decision support, where decisions must be explainable, auditable, and aligned with organizational risk tolerance. The present study contributes to the intersection of these streams. It does not claim to outperform learned semantic routers in all benchmark settings. Instead, it proposes an interpretable multicriteria governance layer in which model-selection decisions are explicitly linked to organizational criteria such as business risk, required accuracy, reasoning depth, cost sensitivity, time sensitivity, standardization, and creativity.
The aim of this article is to propose and empirically evaluate a multicriteria decision-support framework for enterprise LLM routing. The framework combines AHP-based elicitation of organizational criterion weights with SAW-based prompt-level model selection. Its purpose is not to replace machine-learning-based semantic routers, but to provide an interpretable and auditable routing layer for enterprise contexts in which model choice must reflect explicit organizational preferences. The empirical evaluation examines whether such a router can approximate the sufficiency level of a stronger model while reducing token-level costs relative to an always-strong strategy and improving response sufficiency relative to an always-cheap strategy.
The contribution of the article is threefold. First, it reframes enterprise LLM routing as a multicriteria organizational decision problem rather than only a technical optimization problem. In this framing, routing is driven not only by expected response quality and cost, but also by explicit managerial preferences concerning risk, response-time sensitivity, standardization, and creativity. Second, it operationalizes these preferences as prompt-level routing criteria and combines expert-derived AHP weights with SAW-based model selection, including a confidence margin and a risk-veto mechanism that reduces the compensatory weakness of simple additive aggregation. Third, it provides an empirical proof-of-concept evaluation on a stratified business-prompt dataset and compares the proposed routing rule with fixed strategies and several heuristic or alternative multicriteria baselines. The contribution is therefore not a universal claim that AHP/SAW is superior to learned routers, but evidence that an auditable multicriteria layer can support cost-aware and managerially interpretable model allocation in enterprise settings.
The empirical part of the study is guided by three research questions. RQ1: Can an interpretable multicriteria router achieve response sufficiency close to the always-strong strategy while reducing token-level costs? RQ2: Does the proposed routing strategy outperform simple heuristic and alternative MCDM baselines in terms of the cost–sufficiency trade-off? RQ3: Are the routing decisions stable under changes in criterion weights, confidence-margin settings, and alternative aggregation methods?
The remainder of the article is structured as follows. The next section presents a review of the literature on the use of LLMs in organizations, model routing, and multicriteria decision support. This is followed by a presentation of the research methodology and the construction of the proposed decision model. The subsequent section discusses the results of the empirical study, while the conclusion offers a discussion of the findings, managerial implications, study limitations, and directions for future research.
2. Literature Review
Large language models (LLMs) are currently conceptualized in the literature as the next phase in the development of AI systems capable of performing a wide range of linguistic, analytical, and creative tasks, with their significance extending beyond classical NLP applications and increasingly encompassing organizational and managerial contexts [
1,
2]. Studies on the business value of AI emphasize that the impact of these technologies on firm performance does not arise solely from their technical parameters, but from their capacity to reconfigure processes, support decision-making, and enhance organizational efficiency [
3,
4,
5]. More recent literature reviews also stress that generative AI is becoming an important component of business model innovation, value creation, and organizational transformation, while at the same time giving rise to challenges related to implementation, governance, and the assessment of usefulness in business practice [
6,
7,
8].
The significance of this issue is reinforced by empirical studies showing that generative AI can improve the productivity of knowledge workers, although these effects vary markedly across tasks and user groups. Brynjolfsson, Li, and Raymond [
9] demonstrated that the use of generative AI in customer service increases productivity by approximately 15% on average, whereas Noy and Zhang [
10] identified significant gains in both productivity and quality in writing tasks. In turn, Dell’Acqua et al. [
11] describe the phenomenon of the “jagged technological frontier,” indicating that AI may improve performance in some tasks while worsening it in others, even when those tasks appear superficially similar. It is precisely this unevenness of effects that is particularly important from the enterprise perspective: it suggests that not every prompt and not every task should be handled by the same model or by the same AI usage strategy [
9].
In parallel, a stream of research has been developing that emphasizes the importance of governance, accountability, and the organizational embeddedness of AI systems. Papagiannidis, Mikalef, and Conboy [
12] argue that responsible AI governance requires not only a set of general principles, but also procedural, relational, and structural practices that make it possible to operationalize oversight over the technology. Schneider, Abraham, Meske, and vom Brocke [
13], in turn, conceptualize AI governance for businesses as a problem of governing data, models, and AI systems through explicit decisions about who governs what and how. This perspective is directly relevant to enterprise LLM routing because model-selection decisions also require clear responsibility, auditability, and governance mechanisms.
Vidgof, Bachhofner, and Mendling [
14] were among the first to systematically describe the opportunities and challenges of using LLMs in BPM, indicating that the potential of these models spans many stages of the process lifecycle, while simultaneously requiring new rules of application. Subsequent studies have presented more specific applications: Bernardi et al. [
15] proposed the BPLLM framework for process-aware decision support; Kourani et al. [
16] developed an approach for generating process models from textual descriptions; Apaydin and Zisgen [
17] investigated the use of local language models for process modeling; and Kourani et al. [
18] introduced a benchmark and self-improvement analysis of models in process-modeling tasks. Kampik et al. [
19], in turn, formulated a broader vision of “large process models,” in which LLMs are intended to support contextual modeling and the improvement of business processes. The common denominator of these studies is clear: the effectiveness of LLMs in enterprise applications is task-dependent, and model selection should be linked to the type of task, the required level of reliability, and the organizational context of use.
The literature closest to the problem addressed in this article concerns the routing and cascading of language models. Yue et al. [
20] developed an ICLR-published LLM cascade for cost-efficient reasoning, using answer consistency from weaker models as a signal for deciding whether escalation to a stronger model is necessary. Chen, Zaharia, and Zou [
21], in their TMLR-published FrugalGPT study, showed that cascaded or adaptive compositions of LLM APIs can improve the cost–performance trade-off. Šakota, Peyrard, and West [
22] proposed FORC, a meta-model-based approach to cost-effective language model choice across multiple tasks. Ong et al. [
23] introduced RouteLLM, an ICLR-published framework that learns to route between weaker and stronger LLMs using preference data. Song et al. [
24] proposed IRT-Router, an ACL-published multi-LLM routing approach that models LLM abilities and query difficulty using Item Response Theory. Shah and Shridhar [
25] proposed Select-then-Route, a taxonomy-guided approach that first narrows the model pool and then applies adaptive routing or cascading within that pool. Taken together, these studies demonstrate that LLM routing is an active and increasingly mature research area focused on cost, performance, confidence, interpretability, and latency trade-offs.
This literature is highly valuable, but it also reveals important limitations from the perspective of enterprise governance. Most routing approaches optimize primarily for technical or economic targets, such as benchmark accuracy, expected reward, inference cost, latency, model ability, or query difficulty. Organizationally meaningful criteria, including business risk, standardization requirements, auditability, and managerial acceptability, are usually not modeled explicitly. Moreover, learned, embedding-based, or preference-based routers may be effective, but their internal logic can be difficult for non-technical stakeholders to interpret and audit. These limitations do not invalidate technical routers. Rather, they indicate that enterprise environments may require an additional decision-support layer that translates organizational preferences into transparent routing rules. The present study, therefore, does not claim that linear multicriteria aggregation is theoretically superior to learned or semantic routers, but investigates whether an explicitly parameterized multicriteria layer can provide a transparent and empirically competitive routing rule when managerial preferences must be visible in the decision process.
This is the point at which multicriteria decision-support methods become relevant. Their role in this study is not to replace semantic similarity, preference learning, or confidence estimation. Rather, they provide a formal mechanism for making organizational priorities explicit. In enterprise AI governance, this explicitness is important because routing decisions may need to be justified after the fact: why a prompt was escalated to a stronger model, why a routine task was assigned to a cheaper model, or why business risk overrode cost sensitivity. AHP and SAW are therefore used here because of their transparency, auditability, and compatibility with managerial decision-making, not because they are assumed to be universally more accurate than learned routing policies.
At this point, multicriteria decision-support methods become particularly useful [
26]. The classic works of Saaty [
27,
28] and Hwang and Yoon [
29] laid the foundations for the AHP and SAW methods, which make it possible to translate decision-makers’ preferences into a formal rule for the evaluation and selection of alternatives. Later studies confirm that SAW remains a transparent, interpretable, and convenient method in situations requiring the aggregation of multiple criteria, whereas AHP is a useful tool for determining criterion weights on the basis of expert judgments [
30,
31]. Importantly, MCDM methods have already been applied to the selection of AI-based systems, including chatbots for customer service, demonstrating that decisions concerning the choice of AI tools can be effectively formalized as multicriteria problems [
32]. Related management research also shows that formal decision models can integrate measurable and qualitative factors through expert knowledge and explicit weighting procedures under uncertainty [
33]. However, the available literature still lacks convincing attempts to apply AHP and SAW to prompt routing between LLMs in enterprise environments while taking into account criteria of managerial significance rather than exclusively technical ones.
The use of AHP and SAW in the present study is therefore motivated by transparency and organizational fit rather than by an assumption of universal predictive superiority. AHP provides a structured procedure for eliciting and documenting stakeholder preferences, while SAW offers a simple and auditable aggregation rule that can be inspected by non-technical decision-makers. At the same time, the linear and additive nature of SAW imposes important assumptions, including preferential independence among criteria and the absence of strong threshold or veto effects. These assumptions are particularly relevant in LLM routing because criteria such as business risk, response-time sensitivity, standardization, and creativity may interact. For this reason, the empirical part of the study treats SAW as a baseline multicriteria routing mechanism and evaluates its robustness through sensitivity analysis, alternative aggregation methods, a confidence-margin rule, and a risk-veto extension.
In summary, the literature is currently developing along three related but still weakly integrated streams: research on the use of LLMs and GenAI in organizations, research on model routing and cascading, and research on governance and multicriteria decision support. The first stream demonstrates the growing importance of LLMs for productivity, knowledge creation, and business transformation; the second provides techniques for improving the cost–quality trade-off; and the third offers tools for formalizing organizational preferences. However, what remains insufficiently developed is an approach that would integrate these three perspectives and conceptualize LLM routing as a managerial problem, in which the decision regarding model selection depends on cost, quality, risk, response time, and the requirement for standardization, while the effectiveness of the solution is assessed through the lens of the business adequacy of the response. It is precisely this gap that the present article seeks to address.
3. Materials and Methods
The study follows a design-and-evaluation approach. Its objective is to develop and empirically evaluate an interpretable routing mechanism that supports the selection of an LLM in an enterprise environment under explicit organizational preferences. The proposed mechanism assumes that the choice between a cheaper and a stronger model should depend not only on expected response quality and cost, but also on the business risk of an error, required reasoning depth, response-time sensitivity, standardization requirements, and the need for creativity.
The empirical procedure consisted of seven stages: identification of managerial routing criteria, elicitation of criterion weights using AHP, construction of a structured business-prompt dataset, prompt-level scoring by a lightweight LLM-based router, SAW-based routing with confidence-margin and risk-veto extensions, generation of responses by the reference models, and evaluation of routing strategies using response sufficiency, token-level cost, latency, robustness, and statistical-comparison metrics.
From a managerial perspective, the study does not ask which model is objectively best in all circumstances. Instead, it asks whether an auditable routing rule can allocate prompts between model tiers in a way that approximates the sufficiency level of the stronger model while reducing token-level cost and preserving alignment with organizational decision criteria.
Stage 1. Identification of managerial decision criteria for LLM model selection
For prompt evaluation, the following set of decision criteria was used:
C1—required substantive accuracy (describes how high the correctness and precision of the response must be for the outcome to be business-useful; the greater the required accuracy, the greater the need to use a more advanced model);
C2—risk of the business consequences of error (describes the potential effects of generating an incomplete, misleading, or incorrect response; an error in a draft marketing post has different significance than an error in a compliance analysis, HR policy, communication with a strategic client, or the interpretation of a document);
C3—required depth of reasoning (describes whether the task requires simple information processing or multi-step reasoning, synthesis, and logical analysis; the greater the depth of reasoning required, the greater the likelihood that the organization will prefer a more powerful model);
C4—sensitivity to processing cost (describes how important cost savings are from the company’s perspective for a given type of query; not every task requires maximizing quality at any cost. In many large-scale processes, unit cost is the priority);
C5—task sensitivity to response time (describes how important it is to obtain the result quickly; in operational, contact-intensive, or high-volume tasks, speed may be just as important as quality);
C6—required standardization and compliance of the response (describes whether the response must strictly conform to the adopted style, structure, company policy, or communication standard; in organizations, a large part of AI’s value derives not from creativity, but from repeatability, consistency, and the scalability of communication);
C7—required creativity/openness of generation (describes the extent to which the task requires a creative, non-standard, or exploratory approach).
The indicated set of criteria combined four managerial logics: cost efficiency, risk control, quality of the decision-making process, and operational effectiveness.
Stage 2. Determination of criterion weights by organizational experts
Criterion weights were elicited from a three-person expert panel representing complementary organizational perspectives: senior management, operational management, and AI implementation. The panel included the Chief Executive Officer, an operations manager, and an AI implementation manager from the studied enterprise. The experts were selected because the routing decision involves both strategic cost–risk trade-offs and operational considerations related to response usefulness, time sensitivity, standardization, and AI adoption (
Table 1). Before completing the pairwise comparisons, the experts received a short instruction on the Saaty nine-point scale and on the interpretation of each routing criterion. The purpose of this step was to reduce ambiguity in the comparison task and to ensure that criteria such as business risk, cost sensitivity, and standardization were interpreted consistently.
To translate organizational preferences into a formal decision model, the Analytic Hierarchy Process (AHP) method was applied. This method makes it possible to determine criterion weights on the basis of pairwise comparisons made by experts. In the first step, each expert compares every criterion with every other criterion by answering the question of which of them is more important from the organization’s perspective and to what extent. The comparisons were conducted using the classical nine-point Saaty scale, in which 1 denotes equal importance of both criteria, 3 a moderate preference of one criterion over the other, 5 a strong preference, 7 a very strong preference, and 9 an extreme preference; the values 2, 4, 6, and 8 represent intermediate judgments. For the
-th expert, a pairwise comparison matrix is constructed:
where
denotes the relative importance of criterion
with respect to criterion
, subject to the following conditions:
Because the assessments were provided by several experts, it was necessary to aggregate them into a single group matrix. For this purpose, the geometric mean of the experts’ judgments was applied, which is the standard solution in the group version of AHP. For each matrix element, the following was adopted:
where
denotes the number of experts. In the present study,
was adopted. As a result, a group comparison matrix was obtained:
Based on matrix
, the vector of criterion weights was determined. In computational practice, the method of the normalized geometric mean of rows was applied. First, for each criterion, the geometric mean of the judgments in the row was calculated:
and the obtained values were then normalized, yielding the weight of each criterion:
The weight vector may therefore be written as:
subject to the normalization condition:
The obtained weights reflect the relative importance of the individual criteria from the organization’s perspective. The higher the value of
, the greater the influence of the given criterion on the subsequent decision regarding the selection of the LLM. An important element of the AHP method is also the assessment of the consistency of expert judgments. For this purpose, the maximum eigenvalue of the comparison matrix was calculated:
On this basis, the consistency index was determined:
followed by the consistency ratio:
where RI denotes the Random Index, which depends on the number of criteria analyzed. For
, the value
is typically adopted. In the literature, pairwise comparisons are considered sufficiently consistent when
. If the value of
exceeds the threshold of 0.10, this indicates that the expert judgments are characterized by excessive inconsistency and should be re-examined. In addition to the consistency assessment, the stability of the weight vector was examined through a perturbation-based sensitivity analysis. The analysis tested whether moderate changes in the AHP-derived criterion weights altered the routing decisions or materially changed the cost–sufficiency trade-off. This step was introduced because AHP weights may be sensitive to expert judgments and because different organizational stakeholders may legitimately assign different priorities to cost, risk, quality, and response-time criteria. The final empirical evaluation therefore reports not only the aggregated weight vector, but also the robustness of routing results under weight perturbations. In the present study, the resulting weight vector was subsequently used in the next stage of the procedure, namely in the SAW method employed for routing prompts between the less expensive model and the more capable model.
Stage 3. Development of the research prompt dataset
The empirical dataset consisted of 500 business prompts designed to represent heterogeneous enterprise use cases. The prompts were structured across four descriptive dimensions: business function, task type, risk level, and industry context. This design was adopted to reduce the risk that the evaluation would reflect only a narrow class of short, low-risk, or highly standardized queries. The dataset covered ten business functions: Legal/Compliance, IT/Security, Finance, HR, Sales, Marketing, Operations, Procurement, Strategy, and Customer Support. It also covered ten task types: data interpretation, creative generation, decision support, business email drafting, risk assessment, classification, process improvement, summarization, report synthesis, and policy drafting. The prompts were additionally assigned to three risk levels: low, medium, and high. The prompt set was intentionally balanced to include both routine tasks likely to be sufficient for a cheaper model and more complex or risk-sensitive tasks likely to require escalation to a stronger model. This structure made it possible to evaluate not only the aggregate performance of routing strategies, but also their behavior across different business functions, task categories, and risk levels (
Table 2).
The prompt dataset was constructed using a stratified template-based procedure. Each prompt was generated from a structured specification containing four metadata fields: business function, task type, risk level, and industry context. The purpose of this procedure was to avoid an evaluation dominated by a single class of prompts, such as short, low-risk summaries or routine emails. The prompt specifications were distributed across business functions, task categories, and risk levels so that the router would be tested on both routine and escalation-worthy cases. The dataset did not contain confidential company data, personal data, or real customer records. Instead, it used synthetic but business-plausible scenarios representing typical enterprise tasks such as summarization, policy drafting, decision support, risk assessment, data interpretation, customer communication, and process improvement.
The construction process followed four steps. First, the set of business functions, task types, and risk levels was defined. Second, synthetic business scenarios and output constraints were assigned to these categories. Third, prompts were generated so that each prompt was self-contained and could be answered without access to external organizational documents. Fourth, the resulting dataset was checked for basic completeness: each prompt had to specify a task, a business context, and an expected output form. This procedure improves reproducibility because the dataset can be reconstructed from explicit metadata and prompt-construction rules, while also limiting the risk that results depend on idiosyncratic real-company documents.
Stage 4. Prompt-level scoring by the routing model
Each prompt was evaluated before response generation by a lightweight LLM-based routing model. In the empirical implementation, GPT-5-nano was used as the prompt-scoring router. This model was separate from the cheaper response-generation model, GPT-4o-mini, and from the stronger response-generation model, GPT-5. This separation was introduced to reduce the circularity risk that would arise if the cheaper response model were also responsible for estimating whether its own capabilities were sufficient.
The router did not evaluate the generated answers. Instead, it evaluated the input prompt before model selection and returned a structured JSON object containing scores for seven criteria: required accuracy, business risk, reasoning depth, cost sensitivity, response-time sensitivity, standardization, and creativity. Each criterion was scored on a five-point ordinal scale, where higher values indicated a stronger presence of the corresponding property. These scores constituted a computable prompt-level vector used by the routing rule. The following scale was used:
C1—Accuracy
To what extent does the required accuracy of the response exceed the typical capabilities of the less expensive model?
1—The less expensive model will probably be fully sufficient.
2—The less expensive model should be sufficient with only minor risk.
3—Borderline task, with no clear advantage.
4—Higher accuracy clearly favors the more expensive model.
5—The required accuracy definitely justifies the more expensive model.
C2—Business_Risk
To what extent does the potential error justify escalation to the more expensive model?
1—A possible error has little business significance.
2—An error would be undesirable, but not critical.
3—An error would have moderate significance.
4—An error could have serious consequences.
5—An error definitely justifies cautious escalation.
C3—Reasoning_Depth
To what extent does the task require reasoning beyond the typical capabilities of the less expensive model?
1—A simple, routine, or template-based task.
2—Minor analysis or organization of information.
3—Moderate reasoning.
4—Complex, multi-step reasoning.
5—Deep reasoning definitely justifies the more expensive model.
C4—Cost_Sensitivity
How important is it to complete the task at the lowest possible cost?
1—Cost has little significance.
2—Cost has minor significance.
3—Cost has moderate significance.
4—Cost is important and favors the less expensive model.
5—Cost is very important and strongly favors the less expensive model.
C5—Time_Sensitivity
How important is it to obtain a rapid response?
1—Response time has little significance.
2—Time has minor significance.
3—Time has moderate significance.
4—A rapid response is important.
5—Response speed strongly favors the less expensive model.
C6—Standardization
To what extent is the task template-based, predictable, and grounded in a standard response structure?
1—The response requires a non-standard approach.
2—The response is rather non-standard.
3—The response is partially standardized.
4—The response is highly template-based.
5—The response is strongly standardized and predictable.
C7—Creativity
To what extent does the task require a creative, conceptual, or non-standard approach?
1—Creativity is not needed.
2—A small degree of creativity may help.
3—Moderate creativity.
4—High creativity is useful.
5—The task clearly requires a more creative model.
In this way, for each prompt, a score vector is obtained: .
Because prompt scoring is itself an inference step, the router’s computational overhead was explicitly recorded. For each prompt-scoring call, the experiment stored input tokens, output tokens, router cost, and router latency. These values were later included in the cost and latency evaluation of routing strategies. As a result, the reported cost advantage of the routing framework is not based on a zero-cost router assumption, but on the actual token-level cost of both prompt scoring and response generation. The full prompt-scoring router protocol is provided in
Appendix A.
Stage 5. Routing of prompts to one of the analyzed LLMs using the SAW method
To assign each prompt to either the less expensive model or the more capable model, the Simple Additive Weighting (SAW) method was applied. This method consists of calculating the total weighted score of each decision alternative and then selecting the alternative with the highest final value. In the present study, the alternatives are two LLMs: the less expensive model and the more capable model. For each prompt
, the score vector obtained in Stage 4 is known:
as well as the criterion weight vector determined using the AHP method in Stage 2:
It was assumed that high values of criteria C1, C2, C3, and C7 favor the selection of the more capable model, whereas high values of criteria C4, C5, and C6 favor the selection of the less expensive model. To maintain a uniform evaluation logic, two synthetic scores were calculated for each prompt: one for the less expensive model and one for the more capable model. The score for the more capable model was determined as:
whereas the score for the less expensive model was determined as:
Such a construction means that criteria favoring a given model increase its score, whereas criteria supporting the alternative model are reversed according to the transformation
. As a result, both alternatives can be evaluated within the same aggregation logic. The final decision rule is as follows:
Additionally, the difference between the scores of the two models was calculated as:
where positive values indicate that the prompt is more strongly associated with the stronger model, whereas negative values indicate that the cheaper model is favored. In the basic SAW variant, the prompt would be routed to the stronger model when
and to the cheaper model otherwise. However, because near-zero score gaps may indicate ambiguous routing cases, the final routing variant introduced a confidence margin
. In the empirical implementation,
. If the advantage of the stronger model did not exceed this margin, the prompt was routed to the cheaper model. This rule reflects a conservative cost-aware assumption: escalation is justified only when the stronger model has a sufficiently clear advantage.
The following margin-based decision rule was applied:
In addition, a risk-veto extension was introduced for prompts identified before routing as high-risk or risk-sensitive cases. The risk-veto flag was based on the prompt’s ex ante risk annotation and the router’s business-risk assessment. If a prompt was marked as a risk-veto candidate, the routing mechanism escalated it to the stronger model unless the case was already clearly assigned to the stronger model by the SAW score. This extension addresses the limitation of purely additive aggregation, in which high business risk could otherwise be compensated by cost or time sensitivity. The final routing strategy evaluated in the study is therefore referred to as SAW with confidence margin and risk veto.
Stage 6. Response generation and token-level cost measurement
After prompt scoring and routing-score calculation, responses were generated for each prompt using two reference response models: GPT-4o-mini as the cheaper model and GPT-5 as the stronger model. The prompt-scoring router was GPT-5-nano and was used only before response generation. Thus, the routing model, the cheaper response model, and the stronger response model were treated as three separate components of the pipeline.
Table 3 summarizes the model configuration used in the empirical evaluation, including the role of each model in the experimental pipeline, the main inference settings, and the cost basis used for token-level cost calculation.
Cost values correspond to the token-pricing assumptions used in the experimental script. Routing-strategy costs include the prompt-scoring router cost and selected response-model generation cost. The sufficiency evaluator was used only for offline evaluation and was not included in the routing-strategy cost.
For each of the 500 prompts, responses were generated by both reference models. This made it possible to construct and compare several routing strategies on the same prompt set: always-cheap, always-strong, SAW without margin, SAW with confidence margin and risk veto, and additional heuristic or alternative multicriteria baselines.
Costs were calculated at the token level rather than using fixed average per-prompt assumptions. For each response-generation call, the experiment recorded input tokens, output tokens, response cost, and response latency. For routing-based strategies, the total cost of handling a prompt included both the router cost and the cost of the selected response model:
For fixed strategies, the total cost consisted only of the response-generation cost of the selected model:
This design ensured that the cost advantage of routing was evaluated under a conservative assumption in which the router was not treated as free. Latency was measured analogously: for routing strategies, the reported latency included the router latency and the latency of the selected response model, whereas for fixed strategies, it included only response-generation latency.
Stage 7. Evaluation of routing strategies
The final stage evaluated whether the proposed routing mechanism provided a favorable cost–sufficiency trade-off relative to fixed model-selection strategies and alternative routing baselines. The evaluation was conducted at the prompt level using the same set of 500 business prompts. For each prompt, the analysis compared the response that would be obtained under each routing strategy. Response sufficiency was evaluated using a structured LLM-as-a-judge protocol with three independent evaluator profiles. The evaluators assessed whether a response was acceptable for use in organizational conditions without significant substantive revision. To reduce model-label bias, responses were evaluated using anonymized model labels rather than the actual model names. For each response, the evaluator profiles returned structured quality scores referring to instruction following, completeness, specificity, business usability, risk and safety, and style or format adequacy. They also indicated whether the response contained a major issue requiring substantive revision. Individual binary sufficiency labels were then derived deterministically from these scores using the risk-adjusted thresholds reported in
Appendix B. The final response-level sufficiency classification was determined by majority vote. A response was classified as sufficient if at least two of the three evaluator profiles classified it as sufficient. Agreement among the three automated evaluator profiles was calculated to assess the consistency of sufficiency judgments. Fleiss’ kappa was reported for all evaluated responses and separately for the cheaper and stronger response models.
The use of LLM-as-a-judge was adopted for scalability and consistency, but it is also a methodological limitation. The evaluation does not replace human expert assessment of organizational usefulness, contextual appropriateness, or politically sensitive business communication. To reduce evaluation arbitrariness, the judge used a structured sufficiency rubric and returned a constrained JSON output. The evaluator was instructed to assess whether the answer was acceptable for practical enterprise use, followed the prompt, was coherent, did not omit critical requested elements, and did not introduce obvious risky claims. The evaluator did not compare alternative model answers directly, but assessed each answer against the prompt and sufficiency criteria. Nevertheless, the absence of independent human validation means that the reported sufficiency rates should be interpreted as a structured automated evaluation rather than as final evidence of human-perceived business usefulness. This step was included because response sufficiency is a utility-oriented measure and may involve judgment under uncertainty. The main effectiveness metric was the sufficiency rate (SR):
where
denotes the number of prompts for which strategy
produced a sufficient response, and
denotes the total number of prompts.
Cost was measured at the token level. For each strategy, the following cost indicators were calculated:
where
denotes the average cost per prompt and
denotes the cost per sufficient response. For routing strategies, the total cost included both the router cost and the selected response-model cost. For fixed strategies, the total cost included only the cost of the selected response model.
To compare the economic efficiency of strategies that improved sufficiency relative to the always-cheap baseline, the incremental cost of sufficiency gain was calculated as:
This metric expresses the additional cost required to increase the sufficiency rate by one percentage point relative to the always-cheap strategy. In addition to cost and sufficiency metrics, latency was measured for each strategy. For fixed strategies, latency corresponded to response-generation latency. For routing strategies, latency included both router latency and selected response-model latency. Average latency and 95th percentile latency were reported. The proposed SAW routing strategy was evaluated against the following baselines: always-cheap, always-strong, keyword-risk routing, token-threshold routing, TF-IDF centroid routing, logistic-regression routing, SAW without confidence margin, multiplicative SAW, and TOPSIS. The baseline strategies were operationalized as follows. The keyword-risk baseline escalated prompts containing predefined risk-sensitive terms or belonging to high-risk categories to the stronger model. The token-threshold baseline routed prompts to the stronger model when the prompt length exceeded a predefined input-token threshold. The TF-IDF centroid baseline represented prompts using TF-IDF vectors and assigned them according to similarity to cheap- and strong-oriented centroids derived from routing labels. The logistic-regression baseline used prompt-level features to predict whether the stronger model should be selected. The logistic-regression baseline was evaluated using stratified cross-validation to avoid training and testing the classifier on the same prompt instances.
Alternative MCDM baselines included SAW without confidence margin, multiplicative SAW, and TOPSIS, using the same criterion scores and AHP-derived weights as the proposed method. These baselines were included to compare the proposed framework not only with fixed strategies, but also with simple heuristic, learned, and alternative multicriteria routing rules. This comparison was introduced to avoid evaluating the proposed method only against trivial fixed strategies. Statistical uncertainty was addressed in three ways. First, Wilson confidence intervals were reported for sufficiency rates. Second, confidence intervals were calculated for the average cost per prompt. Third, paired exact McNemar tests were used to compare sufficiency outcomes between the proposed routing strategy and alternative strategies at the prompt level. Finally, several robustness checks were conducted. The analysis included confidence-margin sensitivity, perturbation-based weight sensitivity, alternative aggregation methods, criterion ablation, and stratified performance analysis by business function, task type, and risk level. These checks were used to assess whether the results depended on a narrow configuration of weights, margin settings, or prompt categories.