Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation

Cheng, Yuheng; Lin, Yuchen; Wu, Yanwei; Huang, Lida; Chen, Tao; Weng, Wenguo; Zhang, Xiaole

doi:10.3390/fire9040143

Open AccessArticle

Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation

by

Yuheng Cheng

^1,2,

Yuchen Lin

^1,2,

Yanwei Wu

³

,

Lida Huang

^1,2

,

Tao Chen

^1,2,

Wenguo Weng

^1,2 and

Xiaole Zhang

^1,2,*

¹

School of Safety Science, Tsinghua University, Beijing 100084, China

²

Institute of Public Safety Research, Tsinghua University, Beijing 100084, China

³

Department of Engineering Physics, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Fire 2026, 9(4), 143; https://doi.org/10.3390/fire9040143

Submission received: 3 February 2026 / Revised: 20 March 2026 / Accepted: 24 March 2026 / Published: 26 March 2026

(This article belongs to the Special Issue Fire Risk Management and Emergency Prevention)

Download

Browse Figures

Versions Notes

Abstract

The Analytic Hierarchy Process (AHP) is widely used in Multi-Criteria Decision Analysis (MCDA), yet its strong reliance on expert judgment constrains its scalability and may introduce variability in weighting outcomes, particularly in high-stakes applications such as wildfire risk assessment. In this study, we investigate how Large Language Models (LLMs) can function as decision-support agents in an AHP-style hierarchical evaluation task derived from validated wildfire literature. Based on this structure, four representative LLM-assisted strategies are examined: Direct LLM Scoring (DLS), Multi-Model Debate Scoring (MDS), Full-Document Prompting (FDP), and Indicator-Guided Prompting (IGP). To evaluate their effectiveness, we benchmark LLM-generated rankings against expert-defined ground truth across 16 sub-criteria. Using the mean correlation coefficient R as the key evaluation metric, with reported values expressed as mean ± standard deviation across models: DLS shows no correlation with expert rankings (R = 0.009 ± 0.070), MDS yields marginal gains (R = 0.181), and FDP remains unstable (R = 0.081 ± 0.189). By contrast, IGP, which incorporates retrieval-informed structured prompting, shows the highest agreement with the expert reference among the four compared strategies (R = 0.598 ± 0.065), suggesting that structured contextual guidance may improve the performance of LLM-assisted weighting within the evaluated benchmark. This study suggests that, within the evaluated wildfire benchmark and the tested set of hosted LLMs, LLMs may serve as useful decision-support tools in MCDA tasks when guided by structured inputs or coordinated through multi-agent mechanisms. The proposed framework provides an interpretable basis for exploring LLM-assisted risk evaluation in the present wildfire benchmark, while further validation is needed before extending it to other environmental or safety-critical contexts.

Keywords:

Analytic Hierarchy Process; multi-criteria decision analysis; large language models; wildfire risk assessment

1. Introduction

The Analytic Hierarchy Process (AHP) [1], as a foundational tool in Multi-Criteria Decision-Making (MCDM), is widely employed for solving complex decision-making problems. With the increasing attention given to risk assessment in recent years [2,3,4], AHP has been widely used as a structured decision-making method that effectively breaks down complex problems into hierarchical levels [5], making them clearer, more comprehensible, and manageable. By constructing a hierarchical structure and quantifying the relative importance of elements at each level, AHP provides decision-makers with systematic and comprehensive decision support [6]. However, AHP is inherently limited by its strong reliance on expert judgment, as the quality and consistency of the resulting matrix are highly susceptible to expert subjectivity and cognitive bias, potentially compromising the reliability of weight derivation.

To address these limitations, a range of enhancements and hybridizations of AHP across various domains has been proposed. A combination of Analytic Hierarchy Process and Technique for order performance by similarity to ideal solution (TOPSIS) was adopted to quantify decision factors, effectively reducing weight bias caused by expert subjectivity [7]. On this basis, a comprehensive method that integrates Strengths, Weaknesses, Opportunities, and Threats (SWOT) analysis, AHP, and TOPSIS was used to solve energy policy decision-making problems [8]. Among them, SWOT analysis provides a scientific basis for the construction of the AHP indicator set, making the decision-making process more systematic and comprehensive. In the field of environmental science, to enhance the visualization effect of AHP analysis, AHP with Geographic Information System (GIS) have been combined to evaluate the reserve capacity of groundwater [9], which utilizes the spatial analysis function of GIS to visually display the distribution and characteristics of groundwater resources, and verifies the case through ROC curve (Receiver Operating Characteristic curve) based on the analysis results, improving the accuracy and reliability of the evaluation. In the field of economics, a comprehensive framework based on the fuzzy AHP-VIKOR (Multi Criteria Compromise Solution Sorting Method) method was proposed to first use AHP to generate standard weights, and then use VIKOR to comprehensively rank based on the selected weights, effectively solving the problem of multi criteria decision-making in supplier selection [10]. In the field of transportation, the TFN-AHP (Triangular Fuzzy Analytic Hierarchy Process) framework was adopted [11] to use triangular fuzzy numbers to represent the relative importance between evaluation factors rather than clear numbers. During the instance verification process, it was found that this framework has higher accuracy than traditional AHP methods. In terms of wildfire assessment, AHP is a commonly used evaluation method. GIS-MCDA (Multi Criteria Decision Analysis) with AHP has been utilized to construct a wildfire risk map with high accuracy, providing strong support for the prevention and management of wildfires [12]. Although these approaches have improved the applicability of AHP across different domains, most hybrid AHP frameworks still depend on expert-defined pairwise comparisons and predefined indicator systems. As a result, the fundamental decision logic of AHP remains largely unchanged, which limits adaptability and may introduce subjective bias in dynamic decision environments.

In recent years, the rapid advancement of LLMs has introduced new perspectives to this field [13,14]. Through extensive training on large datasets, LLMs have developed the capability to understand and generate complex data [15,16]. These models can integrate vast amounts of existing knowledge, offering robust support for the judgment basis in AHP, thus enhancing the reliability and scientific rigor of decision outcomes [17,18]. Furthermore, LLMs can integrate knowledge from large textual corpora and generate structured responses to complex queries, which may provide useful support for decision analysis tasks. However, their outputs remain probabilistic and context-dependent, and the reliability of the results may vary depending on prompt design and input structure. Therefore, the integration of LLMs into AHP-style workflows offers potential for supporting multi-criteria evaluation, while also requiring careful prompt design and validation to ensure stable and interpretable results.

Unlike fuzzy or ensemble AHP variants, which mainly refine how expert judgments are represented while still relying on expert-defined pairwise comparisons, AI-assisted approaches aim to reduce this dependency by shifting the weighting step from manual scoring to model-driven inference. Given a consistent set of reference evidence, large language models can generate criterion importance scores without requiring experts to provide explicit pairwise ratings, thereby offering a practical alternative when expert elicitation is costly or difficult to scale.

Recent empirical studies have demonstrated the potential and limitations of LLMs in practical assessment contexts. The capacity of LLMs to conduct data quality assessment (DQA) using a revised Pedigree Matrix has been evaluated, reporting high accuracy across temporal, geographic, and technology coverage indicators [19]. In the clinical domain, a systematic review of 761 studies, evaluating LLMs in medical applications, reveals a strong reliance on general-domain models (e.g., ChatGPT) and a dominant focus on accuracy metrics [20]. However, the review also identified substantial methodological heterogeneity and emphasized the need for standardized and ethically aligned evaluation frameworks.

However, in structured decision-making tasks, LLMs often encounter several challenges, e.g., uneven distribution of input information, blurred semantic focus, and contextual drift. These issues become particularly prominent when prompts are lengthy or lack structural clarity, leading to significant fluctuations in model outputs and impairing both the accuracy and consistency of the results. To address these issues, a widely adopted approach is the Retrieval-Augmented Generation (RAG) [21] mechanism, which has been widely adopted to enhance knowledge accuracy and contextual relevance [22,23,24,25]. As shown in Figure 1, the core idea of RAG lies in retrieving relevant external knowledge and subsequently embedding the retrieved content into the prompt in a structured manner, thereby guiding the model to focus on high-value, task-relevant information during the reasoning and generation processes. In structured decision-support scenarios, such retrieval grounding helps anchor model reasoning to domain knowledge and reduces contextual drift during complex evaluation tasks. Together, these findings highlight both the promise and challenges of applying LLMs to structured evaluation tasks.

In this study, we explore the potential of LLMs as intelligent agents for supporting AHP-style decision-making by constructing a two-level hierarchical task derived from validated wildfire literature. To evaluate the performance of LLMs under varying input conditions and decision structures, we design a series of controlled experiments, including: (1) cross-model judgment analysis, which compares the importance rankings generated by different LLMs against the expert reference under the same task setting; and (2) methodological comparison, in which we proposed four representative LLM-augmented AHP strategies, Direct LLM Scoring (DLS), Multi-Model Debate Scoring (MDS), Full-Document Prompting (FDP), and Indicator-Guided Prompting (IGP), and their resulting decisions are evaluated in terms of accuracy and output performance. The novelty of this study lies in exploring different strategies for integrating large language models into AHP-style hierarchical decision-making tasks. Rather than proposing a new variant of the Analytic Hierarchy Process itself, this study investigates how large language models can function as decision-support agents within a hierarchical MCDA framework. Specifically, we examine how different LLM prompting strategies influence the generation of criterion importance scores and the resulting ranking structures.

2. Materials and Methods

2.1. Wildfire Risk Assessment Task and AHP Modeling Structure

To explore the potential of LLMs in supporting structured decision-making processes, this study designs a wildfire risk assessment task as a representative multi-criteria decision analysis problem. Wildfire risk is a typical high-stakes and high-complexity decision environment, encompassing a wide range of environmental, climatic, and spatial factors. This task formulation enables the construction of a hierarchical indicator system, benchmarking against expert-defined baselines, and the application of semantically rich reasoning, all of which serve as critical dimensions for evaluating the capability of LLM in complex, structured assessment contexts.

To replicate the logical structure of AHP while alleviating its stringent mathematical constraints, we adopt an AHP-inspired hierarchical weighting framework implemented through LLM prompting and construct a two-level hierarchical evaluation model [12]. At the first level, LLMs estimate the relative importance of the four primary criteria. At the second level, they evaluate the importance of sub-criteria within each primary criterion. The final weight of each sub-criterion is calculated by multiplying its local weight by the normalized global weight of its parent criterion. This design preserves the hierarchical reasoning process of traditional AHP while replacing pairwise comparison matrices with direct importance scoring. Consequently, the framework does not include pairwise comparison matrices or the associated consistency ratio checks used in classical AHP.

Based on this framework, we define four primary wildfire risk evaluation dimensions: forest structure, topography, environment, and climate. Each of these categories contains 3 to 5 sub-criteria, i.e., (1) Species composition. (2) Development stage. (3) Stand crown closure. (4) Aspect. (5) Slope. (6) Elevation. (7) Topographic wetness index. (8) Distance from settlement. (9) Distance from agriculture. (10) Distance from road. (11) Distance from river. (12) Population density. (13) Temperature. (14) Precipitation. (15) Wind speed. (16) Solar radiation. For instance, forest structure encompasses species composition, development stage, and stand crown closure; environment includes distance from settlements, roads, agriculture, and rivers.

2.2. Design Framework for LLM-Augmented Weighting Strategies

In this study, we adopted five LLMs with diverse training corpora and architectural features: ChatGPT-4o (OpenAI, San Francisco, CA, USA), Gemini-2.0 (Google LLC, Mountain View, CA, USA), Baichuan-3 (Beijing Baichuan Intelligence Technology Co., Ltd., Beijing, China), Kimi-1.5 (Beijing Moonshot AI Technology Co., Ltd., Beijing, China), and ChatGLM-4 (Beijing Zhipu Huazhang Technology Co., Ltd., Beijing, China). These models were selected to represent diverse architectures, training corpora, and deployment ecosystems, enabling evaluation of whether the behavior of LLM-augmented decision-making strategies remains consistent across heterogeneous platforms. All models were accessed through their official interfaces, and identical prompt instructions were applied to ensure a fair comparison [26]. For brevity, the models are hereafter referred to by their short names without version numbers unless otherwise specified.

To simulate the situation in which experts consult multiple studies when forming judgments, the prompts provide the models with indicator systems and corresponding weight information reported in several wildfire risk assessment studies. The LLMs then synthesize the indicator classifications and weight information from different sources to generate importance scores for the evaluation criteria. The resulting rankings are subsequently compared with the expert-derived weight rankings to assess the level of agreement between LLM-assisted decision outputs and expert judgments under different prompting strategies.

Studies have shown that prompt engineering exerts a substantial influence on the output of LLMs [27,28,29]. As shown in Table 1, to comprehensively evaluate the effectiveness of various LLM-augmented AHP strategies, we designed a controlled experimental framework evaluating four representative decision-making protocols: Direct LLM Scoring (DLS), Multi-Model Debate Scoring (MDS), Full-Document Prompting (FDP), and Indicator-Guided Prompting (IGP).

These four methods span a spectrum, from open-ended to highly structured prompting, from single-agent to multi-agent reasoning, and from context-light to context-rich inputs. Each method follows a distinct prompting design: DLS, simple instruction-based input with unordered factor lists; MDS, collaborative prompting across multiple LLMs with iterative discussion; FDP, domain-specific literature is appended to the prompt as context; IGP, structured input based on extracted indicators and RAG-style formatting. To reduce lexical variability and ensure consistency, all prompts were constructed in standardized English and passed through formatting normalization steps prior to model input. These methods provide a comprehensive foundation for analyzing trade-offs in LLM-enabled decision support.

2.2.1. Direct LLM Scoring (DLS)

Among the four representative decision-making protocols, Direct LLM Scoring (DLS) decomposes the decision problem into a standard AHP hierarchy, i.e., goal, primary criteria, and sub-criteria, and embeds these components into a prompt template for zero-shot inference by LLMs. Our DLS framework retains the core principles of baseline scoring while adopting a hierarchical architecture to enhance interpretability and enable nuanced comparisons. In the initial phase, we established an evaluation schema comprising four factor groups: forest structure, topographic characteristics, environmental settings, and climatic parameters. In the DLS strategy, weight extraction was based solely on the numerical scores generated by the LLMs, and not on any reasoning text. Each LLM was instructed to assign an importance score from 1 to 9 to each group, with higher values indicating greater inferred influence on wildfire initiation and spread. This coarse-level scoring was designed to capture the models’ conceptualization of fire dynamics and reveal their inherent prioritization patterns.

Building upon these primary assessments, we then implemented a secondary, more granular scoring procedure within each principal category. As shown in Figure 2, the “forest structure” category was subdivided into three exemplar sub-factors, i.e., species composition, developmental stage, and stand crown closure, each of which the LLMs subsequently rated on the same nine-point scale. This two-tiered strategy was designed to extract detailed attributions of importance at the sub-factor level, thereby illuminating the models’ internal reasoning with greater specificity.

To facilitate meaningful cross-level synthesis and direct benchmarking against the established baseline, all raw scores underwent a systematic normalization process. Specifically, the initial category-level scores were first converted into normalized weight coefficients whose sum equals unity, ensuring that each dimension’s relative contribution could be directly compared. Concurrently, the sub-factor scores within each category were standardized, such that their intra-category sum likewise conformed to a unity constraint. The ultimate composite importance measure for each sub-factor was obtained by multiplying its standardized sub-factor score by the corresponding category weight. This hierarchical normalization scheme guaranteed coherence across scales and provided a rigorous basis for evaluating the congruence between LLM-derived weightings and baseline expert judgments in the context of wildfire risk assessment. Table 2 presents an illustrative example of the Direct LLM Scoring (DLS) approach. For completeness, the prompt designs for the remaining methods are included in the Prompt section of the Supporting Information.

2.2.2. Multi-Model Debate Scoring (MDS)

Multi-Model Debate Scoring (MDS) addresses DLS’s limitations by treating multiple LLMs as autonomous reasoning agents, each scoring and justifying the decision independently. These agents then engage in iterative, structured debate rounds, during which they exchange viewpoints, critique peer outputs, and progressively refine their responses. The MDS approach seeks to emulate the collaborative decision-making processes of human expert panels by leveraging complementary model capabilities. Through iterative rounds of “position articulation, debate interaction, consensus refinement”, the method aims to enhance the objectivity, rationality, and comprehensiveness of model-based scoring outcomes. This deliberative process facilitates the emergence of a consensus ranking that is more robust and better aligned with expert judgment.

In the MDS method, the adopted five leading LLMs (i.e., ChatGPT, ChatGLM, Baichuan, Kimi and Gemini) are treated as “virtual experts” to participate in scoring the importance of wildfire risk assessment indicators. Each model first independently scores the four primary dimensions of wildfire risk, forest structure, topography, environmental conditions, and climate variables, on a scale from 1 to 9, providing reasoning and justifications for their scores.

The scoring process follows a two-tier structure. The entire hierarchical scoring workflow, along with the evolution of model assessments across both levels, is summarized in Figure 3. The first tier involves scoring the importance of the four primary dimensions, while the second tier involves scoring the specific variables within each dimension. During the first-tier scoring, each model initially provides its own scores and reasoning. Subsequently, through multiple rounds of discussion, the models engaged in cross-questioning and exchange of viewpoints regarding the rationale behind their scores.

In each subsequent round, every LLM first elaborated in detail on the reasoning for its own ratings, and these articulated justifications were then provided as additional input prompts to peer models. Model consistency is expressed using the Pearson coefficient, with a value of 100% indicating that the output scores of all LLMs are completely identical. Under this structured feedback mechanism, each model refined its scores based on insights gained from the arguments presented by other models.

In each subsequent round, every LLM first explained the rationale for its own ratings, and these justifications were then provided as additional input prompts to peer models. Under this structured feedback mechanism, each model refined its scores based on the arguments generated by other models. Model consistency was measured using the Pearson coefficient, with 100% indicating that the output scores of all participating LLMs were completely identical. The debate proceeded iteratively until a predefined convergence criterion was satisfied. A 95% convergence threshold was adopted as a practical stopping rule to indicate near-consensus and to avoid unnecessary additional debate rounds once score updates became marginal. Specifically, the process was terminated when the change in aggregated scores between two consecutive rounds fell below the threshold, suggesting that further interaction was unlikely to produce meaningful revision of the final ranking. This rule was intended to balance deliberative refinement against the added computational cost of repeated model interaction. Nevertheless, convergence should not be interpreted as equivalent to correctness, as repeated interaction may also introduce convergence pressure and lead to artificial consensus. Therefore, the MDS mechanism is treated here as an exploratory debate-based scoring strategy rather than as a guaranteed path to improved expert alignment. A sensitivity analysis with alternative convergence thresholds may provide a useful robustness check for the MDS strategy and is left for future work.

2.2.3. Full-Document Prompting (FDP)

Full-Document Prompting (FDP) is designed to strengthen the decision-making capabilities of LLMs in complex environmental assessment tasks by embedding task-relevant scientific literature directly into the contextual prompt. Building upon the DLS framework, FDP augments the base scoring template with comprehensive domain-specific background materials, including peer-reviewed research articles, technical reports, and authoritative datasets. These materials are integrated in full rather than as isolated excerpts, ensuring that the LLM is exposed to the complete methodological context, definitions, and empirical findings relevant to wildfire risk assessment, comprising a total of 11 research papers [30,31,32,33,34,35,36,37,38,39,40].

The core premise of FDP is that providing LLMs with rich, authentic, and scientifically validated contextual information can anchor their reasoning in established domain knowledge. By grounding the evaluation process in authoritative sources, FDP is intended to improve the scientific basis, transparency, and interpretability of model-generated judgments in wildfire risk assessment. However, because the literature is incorporated in largely unfiltered full-text form, the resulting context may also include information that is not equally relevant to the scoring task, thereby increasing the risk of interpretive distraction.

For the FDP strategy, the literature content was provided as a continuous input through the hosted web interfaces of the evaluated models, without manual summarization, segmentation, or truncation. Because the exact token length and platform-side context handling were not systematically archived, their effects cannot be fully quantified retrospectively and are acknowledged here as a limitation.

2.2.4. Indicator-Guided Prompting (IGP)

Indicator-Guided Prompting (IGP) adopts a structured knowledge interface inspired by retrieval-augmented workflows. This approach begins by constructing a domain-specific indicator extraction module, which identifies high-consensus core risk factors from existing literature and assigns them initial weights, wherein structured indicators and approximate weight references are extracted offline from validated literature sources and embedded into the prompt as explicit, semantically anchored cues. These structured indicators are then embedded into the prompt context alongside the evaluation task, which transforms the LLM’s implicit knowledge into a static, high-relevance context window, effectively guiding its reasoning toward domain-aligned interpretations. We introduce a structured knowledge interface for the AHP-based scoring task, operationalized in our IGP method through indicator extraction, filtering, and prompt-level organization.

Compared with the FDP approach, which provides full-document content with limited prior structuring, the IGP method emphasizes an explicit knowledge-preparation pipeline, including indicator extraction and prompt structuring.

(1) Indicator extraction: Relevant criteria and sub-criteria with high domain consensus are first extracted from existing literature. Where available, expert-assigned weights or relative importance intervals are also included to form a set of candidate knowledge elements.

(2) Prompt structuring: Extracted indicators are organized and embedded into the input prompt alongside a task-specific instruction, such as: Based on the retrieved expert knowledge, please score the importance of the following wildfire risk criteria using a 1–9 AHP scale.

We extracted key information from the AHP indicator–weight tables reported in the collected literature. Specifically, for each paper, we parsed the fields of main criterion, sub-criterion, main weight, and sub-weight, and standardized them into a six-column structure: paper_id, paper_title, main criterion, sub-criterion, main weight, and sub weight. If a paper only provided sub-criterion weights, the main criterion and main weight fields were left blank. Among the 11 reviewed papers [30,31,32,33,34,35,36,37,38,39,40], seven (IDs: 02, 03, 04, 06, 07, 09, and 10) contained explicit indicator weights, yielding a total of 83 structured records. A reproducible data file is provided in the Supporting Information, serving as the standardized input and reference for the subsequent LLM–AHP comparative analysis. An example record is shown in Table 3.

2.3. Experimental Settings

Within each experimental strategy, identical prompt instructions were applied across models to standardize the input conditions as much as possible and improve the fairness of cross-model comparison, while different strategies adopted distinct prompt designs according to their methodological requirements. The detailed experimental design is illustrated in Table 4, and the prompt templates are provided in the Supporting Information. Because all models were accessed as hosted services rather than locally deployed systems, inference parameters such as temperature, decoding strategy, and sampling configuration followed the default settings of the corresponding platforms. Detailed hardware configurations were therefore not directly accessible in the present experimental setting. Further reproducibility details, including model versions, access mode, output usage, and the score processing pipeline, are summarized in Appendix A.

Under each experimental strategy, the models generated importance scores for the evaluation criteria according to the hierarchical indicator structure, from which the corresponding weights and ranking results were derived. Except for the MDS strategy, which involved iterative multi-round interaction before reaching a final consensus output, each model produced one final result under each strategy. For method-level comparison, the results from the five LLMs under the same strategy were summarized as mean values with corresponding standard deviations (mean ± SD). Here, the standard deviation reflects variability across models within the same strategy rather than repeated-run variability of an individual model.

For strategies involving contextual knowledge, including FDP and IGP, the relevant wildfire risk assessment literature was incorporated directly into the prompt context to provide structured domain information that supports the models in evaluating criterion importance.

The expert baseline used in this study was derived from a published wildfire risk assessment study that reports a complete hierarchical indicator system and corresponding expert-assigned weights. This study was selected as the benchmark because it provides explicit weights for all sub-criteria together with a clearly defined hierarchical structure, thereby enabling a transparent comparison between expert-derived rankings and LLM-generated results. The baseline weights were extracted directly from the reported results of the reference study and converted into ranking orders according to their relative importance. It should be noted that the expert baseline used in this study was derived from a single published source. Therefore, the reported results should be interpreted as agreement with that specific reference baseline rather than robustness across multiple plausible expert judgment settings. If a different but equally valid expert-derived study were adopted as the benchmark, the comparative outcomes might change accordingly.

A one-to-one mapping was established between the 16 sub-criteria used in the present study and those reported in the benchmark study [12]. Since the reference provides explicit weights for each sub-criterion, no secondary estimation or interpolation was required during the extraction process. The resulting rankings serve as the expert baseline against which the outputs of different LLM-assisted strategies are compared. For transparency and auditability, the final baseline weights and rankings for all sub-criteria are shown in Table 5 and Table 6.

2.4. Evaluation Metrics

All four methods were evaluated under identical decision scenarios with rigorously standardized input conditions to ensure fair comparisons. Performance assessment focused on two primary dimensions: accuracy and output performance. In this study, score refers to the direct numeric assessment assigned by an LLM to a criterion or sub-criterion, weight refers to the normalized importance derived from these scores, and rank denotes the ordering of criteria based on their weights. A criterion refers to a first-level evaluation factor, whereas a sub-criterion refers to a lower-level factor under a given criterion. Accuracy was assessed based on the agreement between LLM-generated results and the expert baseline, as measured by the adopted correlation metrics. Output performance was assessed at both the strategy level and the individual model level.

(1) Accuracy Metrics:

Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and Kendall’s tau (τ) were used to evaluate the agreement between LLM-generated results and the expert baseline derived from the reference study. Although the comparative interpretation in this study involves ranking structures, the underlying outputs are normalized weight values rather than purely ordinal labels. For this reason, Pearson’s R is treated as the primary summary metric in this study, because the model outputs are first expressed as continuous normalized weights derived from scores rather than as purely ordinal labels. Under this formulation, R captures the overall linear association between model-assigned and expert-assigned weights, whereas Spearman’s ρ and Kendall’s τ are more directly informative for ordinal structure. Taken together, these metrics allow the evaluation framework to reflect both numerical agreement in weighting structure and ordinal agreement in ranking relations. In the present study, the main comparative conclusions are directionally consistent across all three metrics. The comparisons in this study are based primarily on correlation metrics and descriptive statistics, consistent with the descriptive-comparative design of the evaluation framework. Since each model–strategy condition contributed a single final output rather than a distribution of repeated within-model trials, formal significance testing and interval estimation were not incorporated into the present analysis.

(2) Output performance:

For method-level performance comparison, the results obtained from the tested LLMs under the same strategy were aggregated, with higher average values indicating stronger overall agreement with the expert baseline. Except for the MDS strategy, which involved iterative multi-round interaction, each model generated a single result under each strategy. Therefore, the reported mean values represent aggregation across models within the same strategy rather than averages over repeated runs of an individual model. To identify the best-performing evaluation model within each method, we applied a maximum-correlation criterion, selecting the LLM that achieved the highest correlation value among all tested models. This dual approach ensures a balanced evaluation of strategy-level comparative performance and individual model capability.

3. Results

3.1. Performance of Direct LLM Scoring (DLS)

Through the experiments, the weights of both criteria and sub-criteria were determined. A weighted aggregation of these values yielded the final indices for all factors. Figure 4 compares the rankings derived from ChatGPT under the DLS weighting strategy with those obtained from expert judgments. Comparisons for the other LLMs, i.e., ChatGLM, Baichuan, Kimi, and Gemini with human experts are presented in Figures S1–S4 in the Supporting Information. The horizontal axis represents the importance ranking of indicators derived from the LLM-based weighting framework, while the vertical axis corresponds to the importance ranking of the same indicators as determined by human experts. This two-dimensional layout enables an intuitive evaluation of the alignment between the LLM-derived and expert-established orderings.

A dashed diagonal line is overlaid to represent the theoretical ideal, where every LLM-generated ranking would perfectly match the baseline, causing all data points to fall directly on this line. The actual outcomes are displayed as discrete markers, each denoting the paired rankings of a single sub-factor across the two systems. The dispersion of these markers away from the dashed line provides a visual cue of misalignment between LLM prioritization and expert benchmarks.

To quantitatively assess these discrepancies, we performed linear regression analysis on the set of points, yielding the red regression line shown in the Figure 4. The deviation of this fitted line from the ideal dashed line, especially its slope, serves as a diagnostic metric for the systematic differences between LLM-derived and expert-based rankings. In essence, the slope variation and the low correlation coefficient (R = 0) highlights the systematic bias or misalignment of the LLM’s prioritization relative to expert-derived judgments.

As shown in Table 7, the consistency between different LLMs and expert scores varies substantially. In terms of the Pearson correlation coefficient (R), Gemini performs best, albeit at only 0.1, followed by ChatGPT, while Kimi exhibits a negative correlation. The trends for the Spearman rank correlation coefficient (ρ) and Kendall rank correlation coefficient (τ) are similar: Gemini remains the highest (ρ = 0.120, τ = 0.103), followed by ChatGPT (ρ = 0.032, τ = 0.050), with ChatGLM and Baichuan showing correlations close to zero, and Kimi again yielding negative values. These results indicate that wildfire risk factor assessments generated via the DLS method differ considerably from expert evaluations.

3.2. Performance of Multi-Model Debate Scoring (MDS)

Over the course of successive scoring rounds, the variance among model-generated scores steadily decreased, indicating progressive internal convergence of judgments across the participating models. After each round of debate, the models revise their scores to reflect the incorporation or rebuttal of other models’ viewpoints. Expanding upon the initial assessment of the four primary dimensions, we conducted a subsequent evaluation targeting the constituent sub-factors within each category. In this second tier, each sub-factor was independently scored by the participating models, and these scores were then combined with the normalized weights derived from the first-tier analysis to yield a composite importance metric for every sub-factor.

As shown in Figure 5, the criterion layer and sub-criterion layer in the Multi-Model Debate Scoring process achieved over 95% internal consistency after three and two rounds of iteration, respectively. The consistency at the criterion layer increased from 42.88% to 100%, while that at the sub-criterion layer improved from 89.55% to 96.43%. At the outset of the deliberation of the LLM agents, regarding the four primary wildfire risk evaluation dimensions, substantial discrepancies were evident in the importance ratings assigned to the Topography and Environmental Conditions dimensions, whereas the scores for the remaining two dimensions exhibited relatively close agreement, as shown in Figure 6, After three rounds of interaction, ultimately, the models reach a consensus on the scores for the four dimensions, with the final unified scores being: forest structure = 8, topography = 7, environment = 8, and climate = 9.

It is noteworthy that the final scores differ from the initial scores provided by the five models, indicating that the models absorbed and integrated each other’s viewpoints during the discussion process. This reflects increasing consensus within the debate process, although such internal convergence does not necessarily imply stronger agreement with the expert baseline.

Building upon the primary-level weights, we further evaluated the sub-factors within each dimension and aggregated them with the parent-level weights to compute the final global weight for each sub-criterion. Figure 7 illustrates sub-criteria evaluation, significant discrepancies were observed among the models such as distance from settlement, distance from agriculture, and distance from road. These divergences underscore the heterogeneous reasoning tendencies inherent to each large language model when confronted with nuanced risk variables.

As shown in Figure 8, the chart compares the rankings of 16 wildfire risk factors generated by the MDS method with those assigned by human experts. The correlation analysis shows that the Pearson correlation coefficient R = 0.181, Spearman’s rank correlation coefficient ρ = 0.185, and Kendall’s rank correlation coefficient τ = 0.095, all indicating a low level of agreement between the model and expert rankings. In terms of trends, the slope of the fitted line is significantly smaller than that of the 1:1 reference line, with a considerable deviation between the two, reflecting the model’s limited ability to capture the ranking patterns of experts. Meanwhile, the wide confidence interval suggests substantial variability in the ranking results across different factors, indicating substantial variation across factors. The scatter distribution further reveals that most factors deviate notably from the reference line, with some factors showing particularly large discrepancies, indicating that the MDS method achieved limited agreement with the expert baseline in the current task despite improved internal convergence during debate rounds.

Figure 9 presents the distribution of indicators at the criterion layer. Among them, forest structure accounts for the largest proportion at 28.1%, followed by topography and environment at 25.0%, while climate ranks lowest at 21.9%. Building on the criterion layer, the proportions of individual indicators at the sub-criterion layer were further calculated. In wildfire risk assessment, forest structure and stand crown closure are the most important, each at 9.8%, whereas indicators such as distance from agriculture and distance from road have the smallest share, at 4.7%. In addition, the weighting results show that the factors within each criterion have very similar weights, reflecting convergence within the debate process rather than necessarily stronger external alignment with expert judgment. Within the forest structure criterion, the maximum difference is 1.2%; for topography, it is 2.6%; for environment, 0.8%; and for the remaining criterion, 0.6%. This pattern likely arises because, during the debate, LLMs influence one another and converge toward a compromise outcome.

3.3. Performance of Full-Document Prompting (FDP)

In the FDP strategy, source documents were provided in full-document form rather than being manually segmented or summarized in advance. The source document was provided as a single continuous input sequence according to the prompt design. Since all models were accessed through hosted platform interfaces, the effective input length depended on the corresponding platform constraints. No separate manual truncation rule was applied in the present study.

We evaluated Full-Document Prompting (FDP) across five state-of-the-art LLMs, i.e., ChatGPT, ChatGLM, Baichuan, Kimi, and Gemini, by comparing the importance rankings generated under FDP strategy with the consensus expert baseline.

As shown in Figure 10, using ChatGPT as an example, illustrates the comparison between LLMs and human experts in ranking the importance of evaluation indicators. Comparisons for ChatGLM, Baichuan, Kimi, and Gemini with human experts are shown in Figures S5–S8 in the Supporting Information.

As shown in Table 8, quantitative evaluations reveal that the Full-Document Prompting (FDP) method exhibits relatively weak agreement with large language model outputs with human expert judgment. The absolute values of R for ChatGLM, Baichuan, and Kimi are all below 0.2, indicating only limited linear association between the FDP-based rankings and the expert-derived rankings. Among the five LLMs, Gemini achieved the best evaluation results, with R = 0.292, Spearman’s ρ = 0.259, and Kendall’s τ = 0.154, although the overall agreement with expert assessments remains weak. These findings suggest that, if the goal is to achieve stronger agreement with the expert reference, reliance solely on the information density of full-document prompts is insufficient, and more explicit structuring of prior knowledge may be necessary.

3.4. Performance of Indicator-Guided Prompting (IGP)

The IGP method centers on a structured knowledge interface, aiming to establish a more explicit bridge between literature-derived domain knowledge and LLM-assisted decision-making tasks. The method begins by constructing a domain-specific indicator extraction module, which identifies high-consensus core risk factors from existing literature and assigns them initial weights. These structured indicators are then embedded into the prompt context along with the evaluation task. In this way, IGP effectively transfers the emphasis on “weight logic” from traditional Multi-Criteria Decision Analysis (MCDA) frameworks into the reasoning processes of language models, thereby enhancing the controllability, interpretability, and consistency of model outputs [41].

In the IGP strategy, the extracted indicator information was provided as structured contextual guidance to emulate the type of reference information that human experts may use in hierarchical evaluation. The final weights were still derived from the scores generated by the LLMs themselves, rather than being directly fixed by the prompt content. Therefore, the injected information should be understood as expert-informed guidance rather than mandatory target weights, although its use may partly increase alignment with the expert benchmark. A separate ablation analysis of different prompt information components was beyond the scope of the present study and is left for future work.

As shown in Table 9, under the IGP method, correlations between LLM outputs and the expert reference are higher than those observed under the other evaluated strategies, with all models achieving R values above 0.53. ChatGLM shows the highest consistency (R = 0.697, ρ = 0.697, τ = 0.533), while Kimi, although the lowest among the five (R = 0.532, ρ = 0.532, τ = 0.350), still performs substantially better than under other methods. The tighter clustering of results suggests that IGP is associated with higher overall agreement and lower inter-model variation within the evaluated model set. Compared with less structured strategies, grounding the scoring process in explicit indicator information and prior knowledge organization reduces divergent interpretations of factor importance and improves cross-model consistency, enabling different LLMs to produce more uniform, expert-aligned decision patterns while avoiding the outlier behaviors observed in approaches such as DLS.

As shown in Figure 11, the IGP method yields a moderate-to-strong alignment between ChatGPT’s ranking of wildfire risk factors and that of human experts, with R = 0.590 (p = 0.016), Spearman’s ρ = 0.602 (p = 0.0136), and Kendall’s τ = 0.460 (p = 0.0132). Notably, the two most important indicators—Species Composition and Development Stage—are ranked identically by ChatGPT and the experts. Most of the remaining factors also lie close to the 1:1 reference line, indicating stronger overall correlation and more consistent prioritization. Figures S9–S12 in the Supporting Information present analogous plots for the other four models, allowing direct visual comparison of their ranking consistency with expert assessments under the same IGP approach.

The comparatively stronger performance of IGP in this study may be related to its explicit embedding of weighted indicators, which helps constrain the model’s attention to pre-identified domain factors and reduce distractions from irrelevant content. By converting tacit expert consensus into a formalized, hierarchical prompt structure, IGP not only improves predictive accuracy but also facilitates interpretability: the source and weight of each indicator are transparent and traceable. Additionally, the modular nature of IGP’s indicator set suggests potential adaptability to other risk assessment contexts, although this was not evaluated in the present study.

Another notable advantage of IGP lies in the high interpretability and reusability of its weight foundation. On one hand, expert knowledge is explicitly embedded through structured indicators, enabling traceability of the model’s reasoning basis and improving transparency. On the other hand, the structured form of the method may facilitate adaptation when switching to a different model or application context, but its transferability and generalization potential were not validated in the present study.

That said, the performance of IGP still depends on the quality of initial indicator selection and weight assignment. If the input indicators are inaccurate or the weights deviate from domain consensus, the model may be steered toward systematic bias. Furthermore, the current IGP approach primarily relies on static knowledge injection, limiting the model’s ability to perceive emerging literature or adapt to local contextual variations, which may affect its performance in dynamic or highly localized scenarios.

Overall, within the present study, the IGP method showed higher agreement with the expert reference than the other evaluated strategies, supporting its potential usefulness for LLM-assisted multi-criteria environmental risk assessment in this benchmark. By explicitly injecting structured knowledge into prompt design, IGP may improve the interpretability of model outputs and provide a practical basis for further exploration in scientific decision-making contexts.

4. Discussion

4.1. Reliability and Validity of the Evaluation Framework

The comparison of different LLM-assisted decision strategies in this study is grounded in a wildfire risk assessment task with a clearly defined hierarchical structure and an expert-derived reference baseline. The benchmark study provides a complete indicator system, corresponding weights, and explicit hierarchical relationships, thereby offering a relatively stable basis for comparing model outputs with expert judgments. Unlike open-ended generation tasks, the object of evaluation in this study is not the generated text itself, but the importance rankings produced by the models in a multi-criteria decision-making context. This setting is therefore more appropriate for examining the differences among prompting strategies in structured judgment tasks.

Mean values were computed across the tested LLMs under the same strategy, and the reported standard deviations reflect inter-model dispersion rather than repeated-run variability of an individual model; for MDS, only one final consensus output was produced, so standard deviation and model-wise maximum values are not applicable. Because the compared outputs are continuous normalized weights derived from model-assigned scores, Pearson’s R is used here as the primary summary metric, whereas Spearman’s ρ and Kendall’s τ provide complementary evidence for ordinal agreement. As shown in Table 10, among the four methods, IGP achieves the highest average correlations with expert rankings (mean R = 0.598, ρ = 0.600, τ = 0.435), with ChatGLM reaching the maximum absolute values for all three metrics (R = 0.697, ρ = 0.697, τ = 0.533). In contrast, DLS shows near-zero mean correlations, with its best-performing model (Gemini) only reaching R = 0.10. FDP and MDS offer moderate improvements over DLS but remain well below IGP in both mean and maximum absolute values. These results suggest that grounding prompts in explicit indicator information is more effective for improving agreement with the expert baseline, whereas unstructured or partially structured approaches generally show weaker agreement.

As shown in Figure 12, the results compare the importance weights of second-layer indicators between the best-performing model under the IGP method—ChatGLM—and human experts (top panel), along with their differences (bottom panel). While ChatGLM assigns higher weights than experts to indicators such as distance from road and distance from settlement, and lower weights to indicators such as species composition and aspect, the differences for all indicators remain within 10%. This demonstrates that, under the IGP method, ChatGLM not only achieves the highest overall correlation with expert assessments but also maintains relatively close evaluations across individual indicators, reflecting strong agreement with expert evaluations across individual indicators.

From a computational perspective, LLM-based scoring operates with near-linear inference complexity relative to document length and supports parallel execution across multiple documents. In contrast, expert-based multi-criteria assessment typically involves iterative reading, cross-comparison, and consensus-building processes, which are time-intensive and difficult to scale.

Among the evaluated strategies, DLS exhibits the highest computational efficiency but shows relatively weak agreement with the expert baseline. IGP introduces additional structural design effort during prompt construction, yet this cost remains negligible compared with expert deliberation time. MDS requires multiple iterative rounds, increasing inference overhead while improving internal convergence across debate rounds, but showing only limited agreement with the expert baseline in the present benchmark.

The results therefore highlight a clear trade-off between structural control and efficiency. IGP represents a balanced solution, achieving stronger agreement with the expert baseline while maintaining practical scalability. When integrated into expert workflows, such systems may improve throughput and support more structured decision assistance without displacing domain expertise.

4.2. Effects of Model Configuration and Prompting Strategy on Performance

The experimental results suggest that prompt structure exerts a stronger influence on ranking alignment than model platform alone. Although the five LLMs used in this study differ in training background and deployment ecosystem, their relative performance patterns remain broadly consistent across strategies: weakly structured prompting yields limited agreement with expert rankings, whereas strategies with explicit task guidance produce substantially better alignment. This indicates that, in hierarchical MCDA-style tasks, the organization and framing of input information are more consequential than differences among model platforms themselves.

A particularly notable result is the performance gap between FDP and IGP. At first glance, the relatively low performance of FDP may appear inconsistent with prior studies showing that long-context prompting can be highly effective [42]. However, the present task differs substantially from long-context retrieval-oriented settings. In our experiments, the full documents contain heterogeneous information, including content that is only partially relevant to the indicator-weighting task. Under such conditions, directly providing the entire document does not necessarily improve structured reasoning. Instead, it may introduce informational noise, increase the burden of evidence selection, and weaken the salience of task-relevant signals, which may partly explain the relatively limited performance of FDP in the present setting. By contrast, IGP presents the models with explicitly organized indicator categories and associated contextual references, thereby constraining the reasoning space and facilitating more focused multi-criteria judgment.

This difference is closely related to task type. The present study is not a retrieval or evidence-localization task, but a structured decision-support task requiring the model to compare multiple indicators, integrate dispersed evidence, and generate coherent importance rankings. In such scenarios, simply expanding the amount of contextual input is less effective than improving the structure of that input. The stronger performance of IGP therefore suggests that guided contextualization is more suitable than unrestricted full-document exposure for hierarchical weighting problems.

Model configuration also plays a secondary but non-negligible role. Since all models were accessed through official platforms with default inference settings, platform-level differences in decoding behavior may have affected absolute performance. Nevertheless, the overall ranking of strategies remains stable across models, implying that the observed differences are primarily attributable to strategy design rather than to isolated model-specific advantages. Taken together, these findings indicate that structured prompting and explicit indicator organization are central to improving LLM performance in MCDA-oriented decision tasks.

From an operational perspective, the four strategies differ not only in agreement with the expert reference but also in procedural burden. Under the present experimental workflow, DLS, FDP, and IGP each involved 10 model outputs in total, consisting of one output from each of the five evaluated LLMs at the criteria layer and one output from each model at the sub-criteria layer. In contrast, MDS involved approximately 25 model outputs, including three debate rounds at the criteria layer and two debate rounds at the sub-criteria layer across the same five models, before a final consensus output was reached. This indicates that MDS incurred the highest interaction cost, whereas IGP maintained the same inference-stage output count as DLS and FDP but required additional offline structuring of prior knowledge.

4.3. Practical Implications for Decision Support and Comparison with Expert Assessment

Although the best-performing strategy does not achieve perfect agreement with expert judgments, the observed moderate correlation remains practically meaningful. In hierarchical risk evaluation tasks, such performance suggests that LLM-assisted ranking can provide a rapid and relatively low-cost basis for preliminary decision-making, especially when repeated assessments, broad indicator systems, or time constraints make expert-led evaluation difficult to scale.

Compared with expert-based AHP-style assessment, which typically requires repeated reading, cross-comparison, and iterative judgment, the proposed LLM-assisted strategies can generate structured rankings more efficiently once the prompt design and supporting materials are prepared. Among the evaluated methods, DLS is the simplest but least reliable, whereas IGP requires additional effort in prompt construction but achieves substantially better alignment with expert rankings. This suggests that moderate prompt engineering can yield meaningful gains in decision-support quality without incurring the full time cost of expert elicitation.

By including LLMs from different development ecosystems, this study provides a cross-platform setting for examining the robustness of the proposed decision-support strategies. Such a design makes it possible to evaluate whether the relative effects of different prompting and reasoning strategies remain observable across heterogeneous LLM systems, rather than depending on the behavior of a particular model family. From this perspective, the main contribution of the present study lies in the strategy-level comparison of LLM-assisted hierarchical decision-making across diverse model environments. This comparative framework can be further extended to newer model generations in future work.

5. Conclusions

This study investigates the integration of LLMs into MCDA for wildfire risk assessment, with a focus on adapting the principles of AHP to AI-assisted workflows. Recognizing the practical limitations of traditional AHP, particularly its dependence on expert judgment, susceptibility to inconsistency, and limited scalability, we propose and evaluate four representative LLM-supported strategies: Direct LLM Scoring (DLS), Multi-Model Debate Scoring (MDS), Full-Document Prompting (FDP), and Indicator-Guided Prompting (IGP). These methods reflect varying levels of model involvement, from direct numerical scoring to structured prompting and knowledge-based weight integration.

Using the mean correlation coefficient R as the primary summary metric, we assessed the alignment between model-generated rankings and the expert reference across all 16 sub-criteria. The results show clear differences among the four scoring strategies within the evaluated benchmark. Under the DLS approach, R was 0.009, indicating near-zero agreement with the expert reference. The MDS strategy performed slightly better, reaching R = 0.181, while the FDP method showed relatively weak agreement with R = 0.081. In contrast, the IGP strategy achieved the highest R value (R = 0.598) among the four evaluated strategies, indicating the strongest agreement with the expert reference in the present benchmark. This result suggests that structured prompt engineering may be beneficial for improving agreement with the expert reference in complex, multi-factor evaluation tasks. The results indicate that DLS offers operational simplicity and rapid deployment but shows relatively weak agreement under unstructured or ambiguous task settings. MDS improves scoring consistency by leveraging multiple LLMs to simulate a deliberative evaluation process; however, this comes at the expense of greater computational cost and coordination effort, and the need to align model opinions tends to make the evaluation of individual factors more similar. FDP incorporates full-text literature into the input context to provide rich background knowledge, but may suffer from reasoning dilution and reduced output relevance. In contrast, IGP incorporates relevant domain literature and structured indicator information and, although it requires additional effort in organizing key inputs, provides a more interpretable basis for LLM-assisted preliminary assessment within the present benchmark, particularly when expert resources are limited.

Several limitations should be noted. The analysis relies on a single published wildfire risk study as the expert benchmark, and the findings should therefore be interpreted as a methodological comparison under a controlled reference setting. In addition, the study is confined to wildfire risk assessment, while platform-level differences among hosted LLM services could not be fully controlled. Uncertainty in model outputs was also not explicitly quantified. Despite these limitations, the study provides a comparative perspective on LLM-assisted strategies in AHP-style hierarchical decision tasks. Future work should extend the framework to multiple benchmarks, broader domains, newer model generations, and uncertainty-aware evaluation settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/fire9040143/s1, Text S1: DLS Prompt; Text S2: MDS Prompt; Text S3: FDP Prompt; Text S4: IGP Prompt; Text S5: The key indicator extraction; Figures S1–S12: Figures; Data S1: Experimental data.

Author Contributions

Y.C.: Writing—original draft, Conceptualization, Validation, Software, Methodology, Investigation, Formal analysis. Y.L.: Visualization, Investigation. Y.W.: Data curation, Formal analysis. L.H.: Validation. T.C.: Methodology. W.W.: Formal analysis. X.Z.: Writing—review and editing, Supervision, Project administration, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Key Research and Development Program of China No. 2024YFC3015504-03, National Natural Science Foundation of China (Grant No. 72474116, No. 72521001 and No. 72442008), and the Excellent Young Scientists Fund Program (Overseas).

Data Availability Statement

Data will be made available: https://github.com/Yuheng-Cheng/Large-Language-Model-Augmented-Decision-Strategies-for-Hierarchical-Wildfire-Risk-Evaluation/tree/main (accessed on 9 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Reproducibility Summary

This appendix provides supplementary methodological information to support the transparency and reproducibility of the hosted LLM experiments in this study. It summarizes the evaluated model versions, access mode, prompt source, number of outputs used, and the procedures for score extraction, normalization, and rank generation. Although these details are not part of the main analytical narrative, they are important for clarifying the experimental workflow and the derivation of the reported results.

Table A1. Reproducibility summary of the evaluated LLM experiments.

Item	Description
Models	ChatGPT-4o, Gemini-2.0, Baichuan-3, Kimi-1.5, and ChatGLM-4
Access mode	All models were accessed through their official web platforms
Prompt source	Final prompts are provided in the Data Availability Statement
Prompt language	English
Runs per model per strategy	Except for MDS, one final output per model per strategy was used for evaluation; no repeated runs or averaging across repeated outputs were performed
MDS output rule	One final consensus output was retained after the iterative debate process
Response selection rule	The final returned output was retained for analysis
Score extraction rule	See Data Availability Statement
Normalization procedure	Criteria-layer scores were directly normalized into weights; sub-criteria scores were then normalized within each parent criterion and combined with the corresponding parent-level weights to obtain final global weights
Rank generation pipeline	Final rankings were generated by ordering the resulting weights in descending order

References

Saaty, T.L. Analytic Hierarchy Process. In Encyclopedia of Biostatistics; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2005. [Google Scholar]
Zhang, X.; Ji, Z.; Yue, Y.; Liu, H.; Wang, J. Infection Risk Assessment of COVID-19 through Aerosol Transmission: A Case Study of South China Seafood Market. Environ. Sci. Technol. 2021, 55, 4123–4133. [Google Scholar] [CrossRef]
Jianyao, Y.; Yuan, H.; Su, G.; Wang, J.; Weng, W.; Zhang, X. Machine Learning-Enhanced High-Resolution Exposure Assessment of Ultrafine Particles. Nat. Commun. 2025, 16, 1209. [Google Scholar] [CrossRef]
Cheng, Y.; Jianyao, Y.; Chen, S.; Su, M.; Zhang, Q.; Liuchen, Y.; Zhang, X. Physics-Guided Machine Learning for Evaporation Risk Assessment during Gasoline Spill Accidents. J. Hazard. Mater. 2026, 502, 140610. [Google Scholar] [CrossRef]
Darko, A.; Chan, A.P.C.; Ameyaw, E.E.; Owusu, E.K.; Pärn, E.; Edwards, D.J. Review of Application of Analytic Hierarchy Process (AHP) in Construction. Int. J. Constr. Manag. 2019, 19, 436–452. [Google Scholar] [CrossRef]
Vargas, L.G. An Overview of the Analytic Hierarchy Process and Its Applications. Eur. J. Oper. Res. 1990, 48, 2–8. [Google Scholar] [CrossRef]
Patil, S.K.; Kant, R. A Fuzzy AHP-TOPSIS Framework for Ranking the Solutions of Knowledge Management Adoption in Supply Chain to Overcome Its Barriers. Expert Syst. Appl. 2014, 41, 679–693. [Google Scholar] [CrossRef]
Solangi, Y.A.; Tan, Q.; Mirjat, N.H.; Ali, S. Evaluating the Strategies for Sustainable Energy Planning in Pakistan: An Integrated SWOT-AHP and Fuzzy-TOPSIS Approach. J. Clean. Prod. 2019, 236, 117655. [Google Scholar] [CrossRef]
Rahmati, O.; Nazari Samani, A.; Mahdavi, M.; Pourghasemi, H.R.; Zeinivand, H. Groundwater Potential Mapping at Kurdistan Region of Iran Using Analytic Hierarchy Process and GIS. Arab. J. Geosci. 2015, 8, 7059–7071. [Google Scholar] [CrossRef]
Awasthi, A.; Govindan, K.; Gold, S. Multi-Tier Sustainable Global Supplier Selection Using a Fuzzy AHP-VIKOR Based Approach. Int. J. Prod. Econ. 2018, 195, 106–117. [Google Scholar] [CrossRef]
Lyu, H.-M.; Zhou, W.-H.; Shen, S.-L.; Zhou, A.-N. Inundation Risk Assessment of Metro System Using AHP and TFN-AHP in Shenzhen. Sustain. Cities Soc. 2020, 56, 102103. [Google Scholar] [CrossRef]
Sivrikaya, F.; Küçük, Ö. Modeling Forest Fire Risk Based on GIS-Based Analytical Hierarchy Process and Statistical Analysis in Mediterranean Region. Ecol. Inform. 2022, 68, 101537. [Google Scholar] [CrossRef]
Park, H.; Oh, H.; Gao, F.; Kwon, O. Enhancing Analytic Hierarchy Process Modelling Under Uncertainty with Fine-Tuning LLM. Expert Syst. 2025, 42, e70051. [Google Scholar] [CrossRef]
Chen, X.; Lin, X.; Zou, D.; Xie, H.; Wang, F.L. Understanding Influential Factors for College Instructors’ Adoption of LLM-Based Applications Using Analytic Hierarchy Process. J. Comput. Educ. 2025. [Google Scholar] [CrossRef]
Huang, D.; Yan, C.; Li, Q.; Peng, X. From Large Language Models to Large Multimodal Models: A Literature Review. Appl. Sci. 2024, 14, 5068. [Google Scholar] [CrossRef]
Mahowald, K.; Ivanova, A.A.; Blank, I.A.; Kanwisher, N.; Tenenbaum, J.B.; Fedorenko, E. Dissociating Language and Thought in Large Language Models. Trends Cogn. Sci. 2024, 28, 517–540. [Google Scholar] [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Cheng, Y.; Zhang, Q.; Zhang, X. Exploring the Integration of Large Language Models and Analytical Hierarchy Process (AHP) for Multi-Indicator Importance Analysis in the Safety Research. In Proceedings of the 2024 8th Asian Conference on Artificial Intelligence Technology (ACAIT), Fuzhou, China, 8–10 November 2024; pp. 822–827. [Google Scholar]
Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A Systematic Review of Large Language Model (LLM) Evaluations in Clinical Medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
MacMaster, S.; Sinistore, J. Testing the Use of a Large Language Model (LLM) for Performing Data Quality Assessment. Int. J. Life Cycle Assess. 2025, 30, 2349–2360. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Jin, B.; Yoon, J.; Han, J.; Arik, S.O. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. arXiv 2024, arXiv:2410.05983. [Google Scholar] [CrossRef]
Ma, T.C.; Willis, D.E. What Makes a RAG Regeneration Associated? Front. Mol. Neurosci. 2015, 8, 43. [Google Scholar] [CrossRef] [PubMed]
Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.-S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2024; pp. 6491–6501. [Google Scholar]
Han, C.; Wang, J.; Zhou, W.; Zhou, X. An Intelligent Generation Method for Building Fire Protection Maintenance Work Orders Based on Large Language Models. Fire 2026, 9, 65. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2025, arXiv:2402.07927. [Google Scholar]
Arawjo, I.; Swoopes, C.; Vaithilingam, P.; Wattenberg, M.; Glassman, E.L. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–18. [Google Scholar]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef]
Sevinc, V.; Kucuk, O.; Goltas, M. A Bayesian Network Model for Prediction and Analysis of Possible Forest Fire Causes. For. Ecol. Manag. 2020, 457, 117723. [Google Scholar] [CrossRef]
Eskandari, S. A New Approach for Forest Fire Risk Modeling Using Fuzzy AHP and GIS in Hyrcanian Forests of Iran. Arab. J. Geosci. 2017, 10, 190. [Google Scholar] [CrossRef]
Bentekhici, N.; Bellal, S.-A.; Zegrar, A. Contribution of Remote Sensing and GIS to Mapping the Fire Risk of Mediterranean Forest Case of the Forest Massif of Tlemcen (North-West Algeria). Nat. Hazards 2020, 104, 811–831. [Google Scholar] [CrossRef]
Güngöroğlu, C. Determination of Forest Fire Risk with Fuzzy Analytic Hierarchy Process and Its Mapping with the Application of GIS: The Case of Turkey/Çakırlar. Hum. Ecol. Risk Assess. Int. J. 2017, 23, 388–406. [Google Scholar] [CrossRef]
Sivrikaya, F.; Sağlam, B.; Akay, A.; Bozali, N. Evaluation of Forest Fire Risk with GIS. Pol. J. Environ. Stud. 2014, 23, 187–194. [Google Scholar]
Coban, O.; Erdin, C. Forest Fire Risk Assessment Using Gis and Ahp Integration in Bucak Forest Enterprise, Turkey. Appl. Ecol. Environ. Res. 2020, 18, 1567–1583. [Google Scholar] [CrossRef]
Sari, F. Forest Fire Susceptibility Mapping via Multi-Criteria Decision Analysis Techniques for Mugla, Turkey: A Comparative Analysis of VIKOR and TOPSIS. For. Ecol. Manag. 2021, 480, 118644. [Google Scholar] [CrossRef]
Chuvieco, E.; Lizundia-Loiola, J.; Pettinari, M.L.; Ramo, R.; Padilla, M.; Tansey, K.; Mouillot, F.; Laurent, P.; Storm, T.; Heil, A.; et al. Generation and Analysis of a New Global Burned Area Product Based on MODIS 250 m Reflectance Bands and Thermal Anomalies. Earth Syst. Sci. Data 2018, 10, 2015–2031. [Google Scholar] [CrossRef]
Gheshlaghi, H.A.; Feizizadeh, B.; Blaschke, T. GIS-Based Forest Fire Risk Mapping Using the Analytical Network Process and Fuzzy Logic. J. Environ. Plan. Manag. 2020, 63, 481–499. [Google Scholar] [CrossRef]
Akay, A.E.; Erdoğan, A. GIS-Based Multi-Criteria Decision Analysis for Forest Fire Risk Mapping. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, IV-4/W4, 25–30. [Google Scholar] [CrossRef]
Chuvieco, E.; Mouillot, F.; van der Werf, G.R.; San Miguel, J.; Tanase, M.; Koutsias, N.; García, M.; Yebra, M.; Padilla, M.; Gitas, I.; et al. Historical Background and Current Developments for Mapping Burned Area from Satellite Earth Observation. Remote Sens. Environ. 2019, 225, 45–64. [Google Scholar] [CrossRef]
Bhargava, A.; Witkowski, C.; Shah, M.; Thomson, M. What’s the Magic Word? A Control Theory of LLM Prompting. arXiv 2023, arXiv:2310.04444. [Google Scholar]
Lee, J.; Chen, A.; Dai, Z.; Dua, D.; Sachan, D.S.; Boratko, M.; Luan, Y.; Arnold, S.M.R.; Perot, V.; Dalmia, S.; et al. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? arXiv 2024, arXiv:2406.13121. [Google Scholar] [CrossRef]

Figure 1. Retrieval-Augmented Generation Workflow. “#” denotes the original numbering of retrieved text chunks, which may be reordered after reranking.

Figure 2. Hierarchical Structure of Wildfire Risk Evaluation Criteria.

Figure 3. Hierarchical scoring workflow of MDS method.

Figure 4. Comparison of ChatGPT-derived rankings and expert judgments under the Direct LLM Scoring (DLS) prompting strategy [12].

Figure 5. Inter-model consistency across iterations for the MDS method.

Figure 6. Criteria-level Results across Iterations 1 and 3.

Figure 7. Sub-criteria-level Results across Iterations 1 and 2.

Figure 8. Comparison of Multi-Model Debate Scoring (MDS) and expert judgments.

Figure 9. Hierarchical weight distribution of criteria and sub-criteria in Multi-Model Debate Scoring (MDS).

Figure 10. Comparison of ChatGPT-derived rankings and expert judgments under the Full-Document Prompting (FDP) prompting strategy.

Figure 11. Comparison of ChatGPT-derived rankings and expert judgments under the Indicator-Guided Prompting (lGP) prompting strategy.

Figure 12. Comparison between Indicator-Guided Prompting (lGP)-Based ChatGLM results and Human Expert evaluations.

Table 1. Four Large Language Models (LLMs)-Enabled Analytic Hierarchy Process (AHP) Methods.

Method	Description
Direct LLM Scoring (DLS)	Generates initial priority scores by directly inputting the decision problem into the LLM.
Multi-Model Debate Scoring (MDS)	Employs a collaborative debate mechanism across multiple LLMs, using iterative rounds of interaction to enhance output alignment.
Full-Document Prompting (FDP)	Augments the prompt with full-text domain literature to strengthen the decision rationale.
Indicator-Guided Prompting (IGP)	Constructs a structured context by extracting key indicators from literature to guide model reasoning.

Table 2. Example of Direct LLM Scoring (DLS): Prompt and Representative Model Output.

Input Prompt

You are an expert in forest fire risk assessment, please rate the following factors on a scale of 1–9, where a score of 1 indicates the lowest level of importance in fire risk assessment and a score of 9 indicates the highest level of importance in fire risk assessment. These factors include forest structure, topography, environment, and climate. (Forest Structure includes species composition, development stage and stand crown closure. Topographic includes aspect, slope, elevation and topographic wetness index. Environment includes distance from settlement, distance from agriculture, distance from road, distance from river and population density. Climate includes temperature, precipitation, wind speed and solar radiation). Explain the reasoning behind your scores.

Output (Partial)

ChatGLM: Based on my expertise in forest fire risk assessment, here are the importance ratings for each factor category and sub-factors, along with the detailed reasoning:
Overall Category Ratings

Climate: 9/9 (Highest Importance)
Topography: 8/9 (Very High Importance)
Forest Structure: 7/9 (High Importance)
Environment: 5/9 (Moderate Importance)

…
Summary of Key Relationships

Climate (9) provides the *ignition and spread potential* through weather.
Topography (8) provides the *physical template* that modifies weather effects and fire spread mechanics.
Forest Structure (7) provides the *fuel* that is consumed by the fire.
Environment (5) primarily influences the *likelihood of human-caused ignition* and the *values threatened*.

This hierarchy reflects the core physical drivers of fire behavior (Climate > Topography > Fuels) while acknowledging the critical role of human factors in ignition and protection priorities. Effective fire risk assessment requires integrating all these factors, but their relative importance differs based on whether the focus is on ignition probability, fire behavior potential, or values-at-risk.

“…” indicates omitted intermediate items for brevity. “*” indicates emphasized key phrases in the generated output.

Table 3. The example of key indicator extraction.

Paper_id	02
Paper_title	A new approach for forest fire risk modeling using fuzzy AHP and GIS in Hyrcanian forests of Iran [31]
Main Criterion	Human
Sub-Criterion	Distance from road
Main Weight	0.301

Table 4. Experimental Configuration of the LLM-Based Evaluation Framework.

Component	Setting
Models	ChatGPT-4o, Gemini-2.0, Baichuan-3, Kimi-1.5, ChatGLM-4
Access method	Official platform interfaces
Prompting	Identical prompts across models
Outputs per condition	One final output per model per strategy
Aggregation	Mean ± standard deviation
Output processing	Score normalization and ranking conversion

Table 5. Baseline rankings and weights for each criterion.

Criteria	Weight	Rank
Forest Structure	0.3130	2
Topographic	0.2043	3
Environmental	0.3535	1
Climatic	0.1302	4

Table 6. Baseline rankings and weights for each sub-criterion.

Sub-Criterion	Weight	Rank
Species composition	0.1717	1
Development stage	0.0754	6
Stand crown closure	0.0659	8
Aspect	0.0889	3
Slope	0.0729	7
Elevation	0.0272	14
Topographic Wetness Index	0.0153	16
Distance from Settlement	0.0813	5
Distance from Agriculture	0.0632	9
Distance from Road	0.0924	2
Distance from River	0.0300	12
Population Density	0.0865	4
Temperature	0.0478	10
Precipitation	0.0180	15
Wind speed	0.0283	13
Solar Radiation	0.0360	11

Table 7. Model performance comparison of Direct LLM Scoring (DLS).

Model	R	Spearman’s ρ	Kendall’s τ
ChatGPT	0.032	0.032	0.050
ChatGLM	0.009	0.025	0.017
Baichuan	0.000	−0.006	0.017
Kimi	−0.095	−0.095	−0.043
Gemini	0.100	0.120	0.103

Table 8. Model performance comparison of Full-Document Prompting (FDP).

Model	R	Spearman’s ρ	Kendall’s τ
ChatGPT	0.251	0.194	0.177
ChatGLM	−0.075	−0.089	−0.078
Baichuan	0.064	0.108	0.075
Kimi	−0.129	−0.094	−0.083
Gemini	0.292	0.259	0.154

Table 9. Model performance comparison of Indicator-Guided Prompting (IGP).

Model	R	Spearman’s ρ	Kendall’s τ
ChatGPT	0.590	0.602	0.460
ChatGLM	0.697	0.697	0.533
Baichuan	0.618	0.618	0.450
Kimi	0.532	0.532	0.350
Gemini	0.553	0.553	0.383

Table 10. Model performance comparison of four methods.

Method	Metric	Mean	Standard Deviation	Maximum Absolute Value	LLM with Maximum Absolute Value
Direct LLM Scoring (DLS)	R	0.0092	0.070	0.10	Gemini
	Spearman’s ρ	0.0152	0.077	0.12	Gemini
	Kendall’s τ	0.0288	0.053	0.103	Gemini
Multi-Model Debate Scoring (MDS)	R	0.181	-	0.181	-
	Spearman’s ρ	0.185	-	0.185	-
	Kendall’s τ	0.095	-	0.095	-
Full-Document Prompting (FDP)	R	0.0806	0.189	0.292	Kimi
	Spearman’s ρ	0.0756	0.162	0.259	Gemini
	Kendall’s τ	0.049	0.124	0.177	ChatGPT
Indicator-Guided Prompting (IGP)	R	0.598	0.065	0.697	ChatGLM
	Spearman’s ρ	0.6004	0.064	0.697	ChatGLM
	Kendall’s τ	0.4352	0.071	0.533	ChatGLM

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, Y.; Lin, Y.; Wu, Y.; Huang, L.; Chen, T.; Weng, W.; Zhang, X. Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation. Fire 2026, 9, 143. https://doi.org/10.3390/fire9040143

AMA Style

Cheng Y, Lin Y, Wu Y, Huang L, Chen T, Weng W, Zhang X. Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation. Fire. 2026; 9(4):143. https://doi.org/10.3390/fire9040143

Chicago/Turabian Style

Cheng, Yuheng, Yuchen Lin, Yanwei Wu, Lida Huang, Tao Chen, Wenguo Weng, and Xiaole Zhang. 2026. "Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation" Fire 9, no. 4: 143. https://doi.org/10.3390/fire9040143

APA Style

Cheng, Y., Lin, Y., Wu, Y., Huang, L., Chen, T., Weng, W., & Zhang, X. (2026). Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation. Fire, 9(4), 143. https://doi.org/10.3390/fire9040143

Article Menu

Retrieval over Response: Large Language Model-Augmented Decision Strategies for Hierarchical Wildfire Risk Evaluation

Abstract

1. Introduction

2. Materials and Methods

2.1. Wildfire Risk Assessment Task and AHP Modeling Structure

2.2. Design Framework for LLM-Augmented Weighting Strategies

2.2.1. Direct LLM Scoring (DLS)

2.2.2. Multi-Model Debate Scoring (MDS)

2.2.3. Full-Document Prompting (FDP)

2.2.4. Indicator-Guided Prompting (IGP)

2.3. Experimental Settings

2.4. Evaluation Metrics

3. Results

3.1. Performance of Direct LLM Scoring (DLS)

3.2. Performance of Multi-Model Debate Scoring (MDS)

3.3. Performance of Full-Document Prompting (FDP)

3.4. Performance of Indicator-Guided Prompting (IGP)

4. Discussion

4.1. Reliability and Validity of the Evaluation Framework

4.2. Effects of Model Configuration and Prompting Strategy on Performance

4.3. Practical Implications for Decision Support and Comparison with Expert Assessment

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Reproducibility Summary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI