2.1. Wildfire Risk Assessment Task and AHP Modeling Structure
To explore the potential of LLMs in supporting structured decision-making processes, this study designs a wildfire risk assessment task as a representative multi-criteria decision analysis problem. Wildfire risk is a typical high-stakes and high-complexity decision environment, encompassing a wide range of environmental, climatic, and spatial factors. This task formulation enables the construction of a hierarchical indicator system, benchmarking against expert-defined baselines, and the application of semantically rich reasoning, all of which serve as critical dimensions for evaluating the capability of LLM in complex, structured assessment contexts.
To replicate the logical structure of AHP while alleviating its stringent mathematical constraints, we adopt an AHP-inspired hierarchical weighting framework implemented through LLM prompting and construct a two-level hierarchical evaluation model [
12]. At the first level, LLMs estimate the relative importance of the four primary criteria. At the second level, they evaluate the importance of sub-criteria within each primary criterion. The final weight of each sub-criterion is calculated by multiplying its local weight by the normalized global weight of its parent criterion. This design preserves the hierarchical reasoning process of traditional AHP while replacing pairwise comparison matrices with direct importance scoring. Consequently, the framework does not include pairwise comparison matrices or the associated consistency ratio checks used in classical AHP.
Based on this framework, we define four primary wildfire risk evaluation dimensions: forest structure, topography, environment, and climate. Each of these categories contains 3 to 5 sub-criteria, i.e., (1) Species composition. (2) Development stage. (3) Stand crown closure. (4) Aspect. (5) Slope. (6) Elevation. (7) Topographic wetness index. (8) Distance from settlement. (9) Distance from agriculture. (10) Distance from road. (11) Distance from river. (12) Population density. (13) Temperature. (14) Precipitation. (15) Wind speed. (16) Solar radiation. For instance, forest structure encompasses species composition, development stage, and stand crown closure; environment includes distance from settlements, roads, agriculture, and rivers.
2.2. Design Framework for LLM-Augmented Weighting Strategies
In this study, we adopted five LLMs with diverse training corpora and architectural features: ChatGPT-4o (OpenAI, San Francisco, CA, USA), Gemini-2.0 (Google LLC, Mountain View, CA, USA), Baichuan-3 (Beijing Baichuan Intelligence Technology Co., Ltd., Beijing, China), Kimi-1.5 (Beijing Moonshot AI Technology Co., Ltd., Beijing, China), and ChatGLM-4 (Beijing Zhipu Huazhang Technology Co., Ltd., Beijing, China). These models were selected to represent diverse architectures, training corpora, and deployment ecosystems, enabling evaluation of whether the behavior of LLM-augmented decision-making strategies remains consistent across heterogeneous platforms. All models were accessed through their official interfaces, and identical prompt instructions were applied to ensure a fair comparison [
26]. For brevity, the models are hereafter referred to by their short names without version numbers unless otherwise specified.
To simulate the situation in which experts consult multiple studies when forming judgments, the prompts provide the models with indicator systems and corresponding weight information reported in several wildfire risk assessment studies. The LLMs then synthesize the indicator classifications and weight information from different sources to generate importance scores for the evaluation criteria. The resulting rankings are subsequently compared with the expert-derived weight rankings to assess the level of agreement between LLM-assisted decision outputs and expert judgments under different prompting strategies.
Studies have shown that prompt engineering exerts a substantial influence on the output of LLMs [
27,
28,
29]. As shown in
Table 1, to comprehensively evaluate the effectiveness of various LLM-augmented AHP strategies, we designed a controlled experimental framework evaluating four representative decision-making protocols: Direct LLM Scoring (DLS), Multi-Model Debate Scoring (MDS), Full-Document Prompting (FDP), and Indicator-Guided Prompting (IGP).
These four methods span a spectrum, from open-ended to highly structured prompting, from single-agent to multi-agent reasoning, and from context-light to context-rich inputs. Each method follows a distinct prompting design: DLS, simple instruction-based input with unordered factor lists; MDS, collaborative prompting across multiple LLMs with iterative discussion; FDP, domain-specific literature is appended to the prompt as context; IGP, structured input based on extracted indicators and RAG-style formatting. To reduce lexical variability and ensure consistency, all prompts were constructed in standardized English and passed through formatting normalization steps prior to model input. These methods provide a comprehensive foundation for analyzing trade-offs in LLM-enabled decision support.
2.2.1. Direct LLM Scoring (DLS)
Among the four representative decision-making protocols, Direct LLM Scoring (DLS) decomposes the decision problem into a standard AHP hierarchy, i.e., goal, primary criteria, and sub-criteria, and embeds these components into a prompt template for zero-shot inference by LLMs. Our DLS framework retains the core principles of baseline scoring while adopting a hierarchical architecture to enhance interpretability and enable nuanced comparisons. In the initial phase, we established an evaluation schema comprising four factor groups: forest structure, topographic characteristics, environmental settings, and climatic parameters. In the DLS strategy, weight extraction was based solely on the numerical scores generated by the LLMs, and not on any reasoning text. Each LLM was instructed to assign an importance score from 1 to 9 to each group, with higher values indicating greater inferred influence on wildfire initiation and spread. This coarse-level scoring was designed to capture the models’ conceptualization of fire dynamics and reveal their inherent prioritization patterns.
Building upon these primary assessments, we then implemented a secondary, more granular scoring procedure within each principal category. As shown in
Figure 2, the “forest structure” category was subdivided into three exemplar sub-factors, i.e., species composition, developmental stage, and stand crown closure, each of which the LLMs subsequently rated on the same nine-point scale. This two-tiered strategy was designed to extract detailed attributions of importance at the sub-factor level, thereby illuminating the models’ internal reasoning with greater specificity.
To facilitate meaningful cross-level synthesis and direct benchmarking against the established baseline, all raw scores underwent a systematic normalization process. Specifically, the initial category-level scores were first converted into normalized weight coefficients whose sum equals unity, ensuring that each dimension’s relative contribution could be directly compared. Concurrently, the sub-factor scores within each category were standardized, such that their intra-category sum likewise conformed to a unity constraint. The ultimate composite importance measure for each sub-factor was obtained by multiplying its standardized sub-factor score by the corresponding category weight. This hierarchical normalization scheme guaranteed coherence across scales and provided a rigorous basis for evaluating the congruence between LLM-derived weightings and baseline expert judgments in the context of wildfire risk assessment.
Table 2 presents an illustrative example of the Direct LLM Scoring (DLS) approach. For completeness, the prompt designs for the remaining methods are included in the Prompt section of the
Supporting Information.
2.2.2. Multi-Model Debate Scoring (MDS)
Multi-Model Debate Scoring (MDS) addresses DLS’s limitations by treating multiple LLMs as autonomous reasoning agents, each scoring and justifying the decision independently. These agents then engage in iterative, structured debate rounds, during which they exchange viewpoints, critique peer outputs, and progressively refine their responses. The MDS approach seeks to emulate the collaborative decision-making processes of human expert panels by leveraging complementary model capabilities. Through iterative rounds of “position articulation, debate interaction, consensus refinement”, the method aims to enhance the objectivity, rationality, and comprehensiveness of model-based scoring outcomes. This deliberative process facilitates the emergence of a consensus ranking that is more robust and better aligned with expert judgment.
In the MDS method, the adopted five leading LLMs (i.e., ChatGPT, ChatGLM, Baichuan, Kimi and Gemini) are treated as “virtual experts” to participate in scoring the importance of wildfire risk assessment indicators. Each model first independently scores the four primary dimensions of wildfire risk, forest structure, topography, environmental conditions, and climate variables, on a scale from 1 to 9, providing reasoning and justifications for their scores.
The scoring process follows a two-tier structure. The entire hierarchical scoring workflow, along with the evolution of model assessments across both levels, is summarized in
Figure 3. The first tier involves scoring the importance of the four primary dimensions, while the second tier involves scoring the specific variables within each dimension. During the first-tier scoring, each model initially provides its own scores and reasoning. Subsequently, through multiple rounds of discussion, the models engaged in cross-questioning and exchange of viewpoints regarding the rationale behind their scores.
In each subsequent round, every LLM first elaborated in detail on the reasoning for its own ratings, and these articulated justifications were then provided as additional input prompts to peer models. Model consistency is expressed using the Pearson coefficient, with a value of 100% indicating that the output scores of all LLMs are completely identical. Under this structured feedback mechanism, each model refined its scores based on insights gained from the arguments presented by other models.
In each subsequent round, every LLM first explained the rationale for its own ratings, and these justifications were then provided as additional input prompts to peer models. Under this structured feedback mechanism, each model refined its scores based on the arguments generated by other models. Model consistency was measured using the Pearson coefficient, with 100% indicating that the output scores of all participating LLMs were completely identical. The debate proceeded iteratively until a predefined convergence criterion was satisfied. A 95% convergence threshold was adopted as a practical stopping rule to indicate near-consensus and to avoid unnecessary additional debate rounds once score updates became marginal. Specifically, the process was terminated when the change in aggregated scores between two consecutive rounds fell below the threshold, suggesting that further interaction was unlikely to produce meaningful revision of the final ranking. This rule was intended to balance deliberative refinement against the added computational cost of repeated model interaction. Nevertheless, convergence should not be interpreted as equivalent to correctness, as repeated interaction may also introduce convergence pressure and lead to artificial consensus. Therefore, the MDS mechanism is treated here as an exploratory debate-based scoring strategy rather than as a guaranteed path to improved expert alignment. A sensitivity analysis with alternative convergence thresholds may provide a useful robustness check for the MDS strategy and is left for future work.
2.2.3. Full-Document Prompting (FDP)
Full-Document Prompting (FDP) is designed to strengthen the decision-making capabilities of LLMs in complex environmental assessment tasks by embedding task-relevant scientific literature directly into the contextual prompt. Building upon the DLS framework, FDP augments the base scoring template with comprehensive domain-specific background materials, including peer-reviewed research articles, technical reports, and authoritative datasets. These materials are integrated in full rather than as isolated excerpts, ensuring that the LLM is exposed to the complete methodological context, definitions, and empirical findings relevant to wildfire risk assessment, comprising a total of 11 research papers [
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40].
The core premise of FDP is that providing LLMs with rich, authentic, and scientifically validated contextual information can anchor their reasoning in established domain knowledge. By grounding the evaluation process in authoritative sources, FDP is intended to improve the scientific basis, transparency, and interpretability of model-generated judgments in wildfire risk assessment. However, because the literature is incorporated in largely unfiltered full-text form, the resulting context may also include information that is not equally relevant to the scoring task, thereby increasing the risk of interpretive distraction.
For the FDP strategy, the literature content was provided as a continuous input through the hosted web interfaces of the evaluated models, without manual summarization, segmentation, or truncation. Because the exact token length and platform-side context handling were not systematically archived, their effects cannot be fully quantified retrospectively and are acknowledged here as a limitation.
2.2.4. Indicator-Guided Prompting (IGP)
Indicator-Guided Prompting (IGP) adopts a structured knowledge interface inspired by retrieval-augmented workflows. This approach begins by constructing a domain-specific indicator extraction module, which identifies high-consensus core risk factors from existing literature and assigns them initial weights, wherein structured indicators and approximate weight references are extracted offline from validated literature sources and embedded into the prompt as explicit, semantically anchored cues. These structured indicators are then embedded into the prompt context alongside the evaluation task, which transforms the LLM’s implicit knowledge into a static, high-relevance context window, effectively guiding its reasoning toward domain-aligned interpretations. We introduce a structured knowledge interface for the AHP-based scoring task, operationalized in our IGP method through indicator extraction, filtering, and prompt-level organization.
Compared with the FDP approach, which provides full-document content with limited prior structuring, the IGP method emphasizes an explicit knowledge-preparation pipeline, including indicator extraction and prompt structuring.
(1) Indicator extraction: Relevant criteria and sub-criteria with high domain consensus are first extracted from existing literature. Where available, expert-assigned weights or relative importance intervals are also included to form a set of candidate knowledge elements.
(2) Prompt structuring: Extracted indicators are organized and embedded into the input prompt alongside a task-specific instruction, such as: Based on the retrieved expert knowledge, please score the importance of the following wildfire risk criteria using a 1–9 AHP scale.
We extracted key information from the AHP indicator–weight tables reported in the collected literature. Specifically, for each paper, we parsed the fields of main criterion, sub-criterion, main weight, and sub-weight, and standardized them into a six-column structure: paper_id, paper_title, main criterion, sub-criterion, main weight, and sub weight. If a paper only provided sub-criterion weights, the main criterion and main weight fields were left blank. Among the 11 reviewed papers [
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40], seven (IDs: 02, 03, 04, 06, 07, 09, and 10) contained explicit indicator weights, yielding a total of 83 structured records. A reproducible data file is provided in the
Supporting Information, serving as the standardized input and reference for the subsequent LLM–AHP comparative analysis. An example record is shown in
Table 3.
2.3. Experimental Settings
Within each experimental strategy, identical prompt instructions were applied across models to standardize the input conditions as much as possible and improve the fairness of cross-model comparison, while different strategies adopted distinct prompt designs according to their methodological requirements. The detailed experimental design is illustrated in
Table 4, and the prompt templates are provided in the
Supporting Information. Because all models were accessed as hosted services rather than locally deployed systems, inference parameters such as temperature, decoding strategy, and sampling configuration followed the default settings of the corresponding platforms. Detailed hardware configurations were therefore not directly accessible in the present experimental setting. Further reproducibility details, including model versions, access mode, output usage, and the score processing pipeline, are summarized in
Appendix A.
Under each experimental strategy, the models generated importance scores for the evaluation criteria according to the hierarchical indicator structure, from which the corresponding weights and ranking results were derived. Except for the MDS strategy, which involved iterative multi-round interaction before reaching a final consensus output, each model produced one final result under each strategy. For method-level comparison, the results from the five LLMs under the same strategy were summarized as mean values with corresponding standard deviations (mean ± SD). Here, the standard deviation reflects variability across models within the same strategy rather than repeated-run variability of an individual model.
For strategies involving contextual knowledge, including FDP and IGP, the relevant wildfire risk assessment literature was incorporated directly into the prompt context to provide structured domain information that supports the models in evaluating criterion importance.
The expert baseline used in this study was derived from a published wildfire risk assessment study that reports a complete hierarchical indicator system and corresponding expert-assigned weights. This study was selected as the benchmark because it provides explicit weights for all sub-criteria together with a clearly defined hierarchical structure, thereby enabling a transparent comparison between expert-derived rankings and LLM-generated results. The baseline weights were extracted directly from the reported results of the reference study and converted into ranking orders according to their relative importance. It should be noted that the expert baseline used in this study was derived from a single published source. Therefore, the reported results should be interpreted as agreement with that specific reference baseline rather than robustness across multiple plausible expert judgment settings. If a different but equally valid expert-derived study were adopted as the benchmark, the comparative outcomes might change accordingly.
A one-to-one mapping was established between the 16 sub-criteria used in the present study and those reported in the benchmark study [
12]. Since the reference provides explicit weights for each sub-criterion, no secondary estimation or interpolation was required during the extraction process. The resulting rankings serve as the expert baseline against which the outputs of different LLM-assisted strategies are compared. For transparency and auditability, the final baseline weights and rankings for all sub-criteria are shown in
Table 5 and
Table 6.
2.4. Evaluation Metrics
All four methods were evaluated under identical decision scenarios with rigorously standardized input conditions to ensure fair comparisons. Performance assessment focused on two primary dimensions: accuracy and output performance. In this study, score refers to the direct numeric assessment assigned by an LLM to a criterion or sub-criterion, weight refers to the normalized importance derived from these scores, and rank denotes the ordering of criteria based on their weights. A criterion refers to a first-level evaluation factor, whereas a sub-criterion refers to a lower-level factor under a given criterion. Accuracy was assessed based on the agreement between LLM-generated results and the expert baseline, as measured by the adopted correlation metrics. Output performance was assessed at both the strategy level and the individual model level.
(1) Accuracy Metrics:
Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and Kendall’s tau (τ) were used to evaluate the agreement between LLM-generated results and the expert baseline derived from the reference study. Although the comparative interpretation in this study involves ranking structures, the underlying outputs are normalized weight values rather than purely ordinal labels. For this reason, Pearson’s R is treated as the primary summary metric in this study, because the model outputs are first expressed as continuous normalized weights derived from scores rather than as purely ordinal labels. Under this formulation, R captures the overall linear association between model-assigned and expert-assigned weights, whereas Spearman’s ρ and Kendall’s τ are more directly informative for ordinal structure. Taken together, these metrics allow the evaluation framework to reflect both numerical agreement in weighting structure and ordinal agreement in ranking relations. In the present study, the main comparative conclusions are directionally consistent across all three metrics. The comparisons in this study are based primarily on correlation metrics and descriptive statistics, consistent with the descriptive-comparative design of the evaluation framework. Since each model–strategy condition contributed a single final output rather than a distribution of repeated within-model trials, formal significance testing and interval estimation were not incorporated into the present analysis.
(2) Output performance:
For method-level performance comparison, the results obtained from the tested LLMs under the same strategy were aggregated, with higher average values indicating stronger overall agreement with the expert baseline. Except for the MDS strategy, which involved iterative multi-round interaction, each model generated a single result under each strategy. Therefore, the reported mean values represent aggregation across models within the same strategy rather than averages over repeated runs of an individual model. To identify the best-performing evaluation model within each method, we applied a maximum-correlation criterion, selecting the LLM that achieved the highest correlation value among all tested models. This dual approach ensures a balanced evaluation of strategy-level comparative performance and individual model capability.