Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators

Chang, Shuchih Ernest; Chung, Kai-Chun

doi:10.3390/math14081393

Open AccessArticle

Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators

by

Shuchih Ernest Chang

^*

and

Kai-Chun Chung

^*

Graduate Institute of Technology Management, National Chung Hsing University, Taichung 40227, Taiwan

^*

Authors to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1393; https://doi.org/10.3390/math14081393

Submission received: 20 March 2026 / Revised: 16 April 2026 / Accepted: 17 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue Artificial Intelligence Techniques in the Financial Services Industry)

Download

Browse Figures

Versions Notes

Abstract

In the context of political and financial market turmoil, effectively forecasting financial market trends is crucial for investment decisions. Large language models (LLMs) have been applied in extant research to predict market trends, analyze investor sentiments and interpret financial news, all aiming to help investment decision making. However, LLMs face limitations due to training data heterogeneity, restricting multidimensional perspectives and hindering comparative analysis for optimization. This study proposes a “Dual-Agent LLM Debate Mechanism” framework using a Proponent (LLM1: Gemini Pro 3) and an Opponent (LLM2: ChatGPT 5.2) to address single-LLM forecasting gaps: The Proponent generates a baseline forecast (F1) from an Integrated Context, while the Opponent validates and resolves conflicts with the Proponent via up to three rounds of cross-debate to produce a consensus forecast (F2). A controlled experiment was conducted to analyze 75 financial market indicators (FMIs) across five asset categories, revealing that F2 outperforms F1 in accuracy and directional stability, particularly in highly volatile assets like Cryptocurrencies and 10-Year Government Bonds. Paired-sample t-tests confirmed statistical significance, validating the mechanism’s effectiveness. Our study results demonstrate how cross-debate between LLMs enhances forecasting accuracy through structured optimization.

Keywords:

large language models (LLMs); cross-debate between LLMs; LLM applications; financial market forecast; political and economic statements

MSC:

68T20; 68T37; 68T50; 91B82

1. Introduction

This study aims to explore innovative methods for Large Language Model (LLM) forecasting and to identify valuable phenomena. To construct such a forecast approach, we focus on four critical elements. First, as the application of LLMs in financial forecasting is a current focal point, introducing novel elements into this field has become a vital research trend. Second, we investigate whether a single LLM can overcome its limitations by engaging in a debate mechanism with another LLM model to reach a consensus, thereby realizing the combined intelligence of two smart brains. Third, to achieve a macroscopic and international perspective, this research expands its scope to analyze 75 global financial market indicators (FMIs) across five major asset categories. Finally, by introducing High-Impact Political and Economic Statements (HIPESs) and examining the intersection of these four elements, this study generates beneficial conclusions for financial market forecasting. Traditional econometric models, relying on historical numerical data, face limitations in representing non-linear market dynamics or causal associations within unstructured political narratives. Generative AI with LLMs demonstrates transformative capabilities through their ability to facilitate semantic analysis of unstructured textual data [1], thereby enabling direct assessment of political and economic sentiment [2]. Extant research primarily adopts an end-to-end direct forecasting approach (hereafter named as F1), which automatically produces quantitative results from unprocessed text without disclosing inference mechanisms. To fill this identified gap, this study proposes a key hypothesis: leveraging the powerful semantic understanding and logical reasoning capabilities of LLMs, the adoption of a Dual-Agent LLM Debate Mechanism (DALDM) to simulate expert opinion collisions can enhance forecast accuracy through the execution of the Triple-C debate mechanism: (1) Cross-Validation, (2) Cross-Debate, and (3) Consensus Building. This mechanism aims to enable mutual error correction and adjustment between dual LLMs, thereby improving forecast performance. This study attempts to break away from the traditional single LLM as the sole benchmark, proposing an innovative debate-optimized forecasting framework (i.e., Debate Forecast model, F2). This design aims to empirically explore the experimental framework of generating a Consensus Value recognized by both F1 and F2 through multi-round debate mechanisms, verifying whether the F2 Consensus Value can more accurately and robustly predict the closing prices of FMIs on specific future dates compared to the single-benchmark model (F1). Extant studies often feed financially relevant raw transcripts (e.g., HIPESs) into LLMs to predict the values of critical FMIs. Despite their capacity for semantic comprehension, LLMs may yield inconsistent or unreliable results due to inconsistent logical pathways or ‘hallucination’ in the absence of structured guidance. Such research lacks integration of the “Multi-Agent LLM Debate Mechanism” to incorporate “Optimization Variables” into LLM frameworks.

In today’s globalized economy, HIPESs frequently induce sudden market fluctuations, despite lacking direct ties to overt policy interventions [3]. However, accuracy in forecasting with complex HIPESs remains underexplored, particularly regarding the influence of HIPESs on FMIs like exchange rates and Cryptocurrencies. Their opaque approach limits practical applicability and causal transparency. Our study involves establishing Proponent (LLM1) and Opponent (LLM2) roles to drive iterative debates on political/economic statements, and replacing passive text reading with active forecasting. By enabling multi-perspective analysis akin to expert meetings, it ensures that LLM forecasts are grounded in rigorous logic, not single-view intuition [4]. Expanding from recent research [5], our current study upgrades the “Optimization Variables” from an “Academic Theories” approach to a “Dual-Agent Cross-Debate Mechanism”. By leveraging perspective collisions to test the limits of LLM reasoning, this study validates the correction utility within various asset categories, facilitating a significant advancement of the current research paradigm.

This study addresses the research gap by introducing the “Multi-Agent LLM Debate Mechanism” to enhance LLM performance in forecasting FMIs, particularly in complex financial market and political contexts. Unlike direct forecasting (F1) [5], which generates an F1 Forecast Value through a single LLM (LLM1) reading of HIPES text, the “F2 Cross-debate Forecast” approach integrates Optimization Variables from dual-LLM debates to refine predictions. This process converges into a F2 Consensus Value, reducing divergence and improving reliability. In short, this research aims to bridge the identified research gap by focusing on two key research questions (RQs):

RQ1: Can the “F2 Consensus Value” generated through the LLM Cross-Debate Mechanism—compared to the “F1 Forecast Value” derived from a single LLM reading of HIPES text—noticeably enhance accuracy and stability in forecasting FMIs?
RQ2: Do models F1 and F2 possess distinct advantages in forecasting FMIs across various asset categories (such as Stock Market Indices, Foreign Exchange Rates, Commodities, Cryptocurrencies, 10Y Government Bond Yields)?

Fundamentally positioned as an exploratory proof-of-concept (PoC), this research validates the novel conceptual framework of a “Dual-Agent Cross-Debate Mechanism”. Having showcased specific functional benefits, this mechanism lays a robust foundation, allowing future studies to explore broader applications or structural expansions.

2. Research Background

2.1. LLMs with Debate as Research Tools—LLM Consensus-Based Market Forecasting

Inspired by the research implications of textual information and LLM application use cases [6], this study adopts LLMs as market research tools that are capable of extracting insights from textual inputs. By identifying textual meaning, deriving the Optimization Variable, and incorporating the variable into each round of debate, this study aims to explore and enhance LLM-based forecasting of financial market trends across 75 FMIs. In so doing, this study further adopts cross-debate between LLMs to improve forecast performance (accuracy and stability) achieved through dual-LLM cross-debate consensus. Incorporating these extracted insights into the causality-driven “Optimization Variable” may help LLMs surpass state-of-the-art algorithms in generating precise causal inferences [7]. From a technical perspective, the “Chain-of-Thought” (CoT) approach (i.e., using intermediate reasoning steps to enable structured logical progression) [8], together with the zero-shot reasoning mechanism (e.g., using “Let’s think step by step” to facilitate a prompt-based strategy) [9], can extract structured causal relationships from raw textual data. Advanced LLMs may achieve emergent potential to help financial reasoning predict market price fluctuations [10].

The aforementioned potential advancements may facilitate the integration of neural networks with symbolic reasoning frameworks, enabling hybrid systems for complex problem-solving [11]. Our research fills this gap by establishing opposing roles between Proponent (LLM1) and Opponent (LLM2), with an iterative cross-debate process between LLMs. Each debate round generates and refines the Optimization Variable, which is then injected into the next round. This methodology reconfigures LLMs from intuitive, sentiment-based forecasting to consensus-driven logical reasoning. This study introduces “Cross-Debate Consensus Based Optimization Variables (CDCBOVs)”, which integrate HIPES texts and LLM-derived textual inferences into structured theoretical knowledge repositories. These repositories are later used by multiple LLMs to derive Optimization Variables to subsequently enhance forecasting accuracy through the DALDM framework, as compared to the use of raw textual inputs by only one LLM. Based on three theoretical foundations—including (1) the ‘Dialectical Inquiry’ theory for using reverse questioning to break through singular inference blind spots [12]; (2) the ‘Constructive Controversy’ framework for guiding LLM models from cross-argumentation convergence to synthesis [13]; and (3) the ‘Collaborative Argumentation’ theory for negotiating potential agreement between LLMs [14]—this study uses cross-debate between LLMs to continue post-forecast validation and achieve consensus for potential optimization. Moreover, integrating the multi-agent LLM debate mechanism into AI reasoning has become the latest trend in academic applications, and placing multiple LLMs in a zero-shot adversarial debate scenario can stimulate debugging and convergence capabilities far exceeding those of a single LLM model [15].

The absence of causal structures in data-driven models limits their capacity to enhance LLM forecasting efficacy, making this study opt to adopt the Theory-Guided Data Science (TGDS) paradigm as a solution [4], to emphasize the role of scientific consistency in developing generalized models. This study underscores the significance of textual information as a valuable complement to structured data [16]. This study leverages LLMs as AI-powered instruments for cognitive automation in financial market research [6].

2.2. High-Impact Political and Economic Statements, Their Market Impacts, and Data Structure for Executing LLM Experiments

HIPESs made by influential leaders (e.g., decision-makers in terms of major issues, important state figures) during public speeches are key contributors to global financial market volatility [3]. Public narratives propagate swiftly, influencing economic and market behaviors alongside market confidence [17]. Exemplified by the Economic Policy Uncertainty (EPU) index [18], quantitative evidence shows that political discourse and policy deliberations exert direct influence on investment and output. Central bank communication can be used as a feasible tool of monetary policy due to the essential pricing influence possessed by HIPESs [19]. Furthermore, tweets from the U.S. president regarding the Federal Reserve exert substantial influence on financial markets, demonstrating that an individual actor’s communication can disrupt monetary policy and trigger market volatility [20]. International geopolitical risks are often associated with “threat” narratives, intensifying risk aversion and influencing global financial market activities [21]. National risk perceptions propagate globally through multinational corporate communications [22]. Political risks directly impact enterprises, negatively impacting capital expenditures and hiring activities [23]. Regulatory narratives on climate policies are recognized as key pricing signals, with firms’ responsiveness evident in stock and option pricing [24].

To analyze the effects of verbal communication, a specialized vocabulary framework was constructed to associate linguistic sentiment with financial market valuations [25]. Machine learning models may exhibit superior predictive accuracy compared to conventional linear frameworks in asset price forecasting, offering a technical foundation for analyses grounded in LLMs [26]. By analyzing economically significant textual elements, domain-specific LLMs may help improve forecast accuracy [27,28]. This research highlights the significance of publicly released HIPESs as a key resource for examining their market impacts. While political and economic statements exert substantial market influence, particularly due to their textual content containing essential indicators for forecasting market trends, financial text analysis has evolved from traditional “Frequency Statistics” to contemporary “Semantic Reasoning” approaches. Early methodologies depended on financial sentiment lexicons [25], transforming unstructured textual data into a “Bag of Words” framework for sentiment aggregation. However, that approach overlooks the current shift toward operating within the “big data” paradigm [29]. Traditional econometric approaches face challenges in modeling nonlinear associations and extracting nuanced semantic insights from unstructured textual data (e.g., intricate policy nuances and equivocal rhetorical expressions), which constrains predictive precision. The emergence of generative AI has fundamentally transformed predictive analytics in financial markets. LLMs, represented by ChatGPT, have demonstrated the capacity to surpass traditional statistical methodologies, enabling cognitive and inferential capabilities comparable to those of human experts [30]. This capability allows LLMs to analyze causal relationships and sentiment within HIPES texts. Incorporating unstructured textual sentiment information has been shown to substantially improve the accuracy of stock price forecasts [1,2]. Despite the noise in social media comments, actionable insights can be derived. HIPESs are anticipated to influence asset prices substantially. This marks a transformative advancement in financial forecasting, moving beyond conventional frameworks. This study employs LLMs as the central experimental framework to leverage their potential for semantic reasoning, analytical capabilities, and predictive functions related to financial market trends.

The level of detail and novelty in the input data significantly influences LLM forecasting outcomes. This study focuses on HIPESs as primary source materials, excluding media-processed secondary information. The foundation rests on two core principles: addressing “Intermediary Bias” and preserving “Soft Information” integrity. Primarily, media frequently alter or selectively present information to align with readers’ preconceived notions, leading to biased reporting [31]. Employing curated news by journalists as input may lead to the analysis of filtered perspectives instead of original facts. Direct oversight of the “Data Generating Process” is recommended, necessitating direct access to decision-makers’ original statements to mitigate third-party interference [32]. Moreover, while objective data are easily transferable, subjective information (e.g., tone, hesitation, confidence levels) is inherently tied to the speaker’s identity and is susceptible to degradation during transmission [33]. Manager-specific vocal cues possess distinctive predictive value, offering insights beyond conventional financial reports [34]. Likewise, vocal inflections of national leaders implicitly convey the intensity of policy commitment. This study contends that only primary transcripts processed through LLMs can reliably capture these nuanced, market-impacting signals.

In narrative-driven forecasting within political and economic contexts, LLM-generated input data and timeline frameworks must align with dynamically evolving information contexts. Historical data’s predictive efficacy diminishes over time, thereby constraining its applicability to relatively old HIPESs. To capture short-term dynamics, this study omits extended temporal data, instead prioritizing speech transcripts spanning from “Experiment Day” to “30 days prior” as input data. A rolling window approach is implemented, based on structural break analysis. Historical data preceding the structural break contribute to estimation bias, thereby aligning our approach with the principle of “discard distant, prioritize recent” [35]. Adopting this approach can not only incorporate contemporary market dynamics into the model but also reduce data volume and lower Mean Squared Forecast Error (MSFE). In volatile markets, prioritizing real-time data precision over historical datasets is an optimal strategy. Rolling windows emphasize recent data, improving responsiveness to trend shifts in volatile time series [36]. This method is grounded in the weekly economic index (WEI), which possesses strong predictive power with high accuracy for tracking rapid economic developments associated with policy responses to serious economic events [37,38]. As such, our study employs the weekly time window to forecast 7 days ahead (T + 7) for simulating the post-digestion price adjustments. Moreover, we also adopt daily rolling updates to HIPESs to enable adaptive responsiveness to external shocks, thereby enhancing the reliability and validity of experimental results.

Although deep learning models have shown remarkable effectiveness in financial forecasting, conventional machine learning approaches frequently encounter “black box” critics, limiting their interpretability and causal inference capabilities [26]. Data-centric methodologies may face challenges in addressing policy counterfactuals, as correlational associations do not imply causal relationships [39]. Given that structural shocks such as HIPESs reveal the limitations of models lacking theoretical frameworks, the TGDS paradigm was introduced [4], emphasizing the necessity of scientific rigor within machine learning architectures. Given that existing Natural Language Processing (NLP) applications in financial forecasting remain constrained to sentiment analysis, overlooking intricate logical dependencies, the enhanced causal reasoning capabilities of LLMs are underscored [7]. In this study, Optimization Variables are continuously generated through debate; nevertheless, a gap remains, as existing studies have yet to incorporate textual meaning identification and Optimization Variables into predictive models in a way that progressively enhances both generalization and causal interpretability.

2.3. Evaluation of Forecast Frameworks: Model F1 and Model F2

Leveraging CDCBOVs for assessing market trends through indicator-based analysis, this research establishes objective benchmarks and performance evaluation metrics for accuracy testing and comparative analysis against forecast models (F1 and F2). Closing-market prices of widely used market indicators can serve as the benchmark of “Realized Volatility” [40]. In our research, evaluating forecasting models necessitates assessing variations in loss functions, such as MAE, MSE, and QLIKE, to guarantee accurate forecast rankings while accounting for noisy volatility proxies [41]. To quantify CDCBOVs’ marginal contributions, our study uses out-of-sample evaluation metrics as a central benchmark [42]. These metrics quantify the reduction in predictive error compared to the benchmark, indicating performance improvements. This study adopts these metrics to validate that Cross-Debate optimized variables enhance predictive accuracy and convert statistical precision into trackable market outcomes [43]. Carefully designed experiments can demonstrate the substantial predictive value of CDCBOVs through numerical accuracy analyses, asset return rate statistics, and statistical evaluation of performance distribution.

Quantitative disparities among our forecasting models alone cannot establish statistical significance; comprehensive statistical verification is essential. In the evaluation of predictive accuracy, Diebold and Mariano developed the “DM test,” which is recognized as the benchmark for detecting statistically significant disparities in predictive error metrics [44]. Subsequently, Clark and West built on top of the “DM test” to introduce the “CW test” for nested models to target parameter volatility in large-scale systems [45]. Recent research on generative AI adhered to this framework, demonstrating LLMs’ predictive superiority over conventional models and verifying the link between AI ratings and financial returns through statistical correlation analyses [46,47]. This study employs the “Paired Sample t-test” to assess improvements in predictive accuracy resulting from the incorporation of CDCBOVs. By treating predictive error as paired observations, our study quantifies the discrepancy in the loss function between F1 and F2 across time points and assesses the statistical significance of the mean difference. This approach complies with the rigor of the DM test, stressing comprehensive verification of the Optimization Variable’s predictive efficacy.

2.4. Recap of Research Method Elaboration

This study elaborates a forecasting initiative, the “LLM Cross-Debate Consensus Based Forecasting Initiative,” to transform unstructured text into influential factors with respect to FMIs. This initiative involves structuring the process across five distinct stages, outlined below:

Stage 1, purification of text input:
Stage 1 filters out secondary media sources, prioritizing primary source transcripts of HIPESs spanning the 30-day period preceding the forecasting process.
Stage 2, application of HIPESs:
This stage fills HIPES transcripts directly into LLMs, enabling the generation of FMI forecasting values (F1).
Stage 3, debate-driven variable injection:
This stage incorporates Optimization Variables generated through cross-debate processes into LLMs. By embedding the HIPES extraction and the CDCBOV framework, consensus-based forecasting values are produced (F2).
Stage 4, dynamic forecasting and (T + 7) testing:
Stage 4 employs a rolling window mechanism to capture medium-term market dynamics, which are used to compare F1 and F2 under the (T + 7) time window.
Stage 5, analysis of forecasting results:
This stage quantitatively assesses forecasting results against accuracy metrics, followed by statistical significance analyses to validate meaningful enhancements derived from the “LLM Cross-Debate Consensus Based Forecasting” initiative.

3. Research Method

3.1. Research Framework

3.1.1. Five Categories of Assets, with Each Category Comprising 15 FMIs

To evaluate cross-asset forecasting, this study utilizes 75 market indicators (FMIs), which are organized into five distinct categories—Commodities, Stock Indices (Indices), Foreign Exchange Rates (Exchange Rates), Cryptocurrencies (Crypto), and 10-Year Government Bond Yields (10Y Yield)—with each category consisting of 15 FMIs (see Table 1). The Realized Value (RV), based on the Close Market Prices/Indices retrieved from tradingeconomics.com, is utilized for subsequent analyses. Utilizing the retrieved input data, the framework defines a standardized reference point for Forecast Estimate Index (FEI) series through accuracy evaluation using the FEI metric, performance evaluation through Diff. values, credibility verification via paired t-tests, and volatility examination of 75 FMIs serving as the foundation for empirical analysis.

3.1.2. The Dual-Path Comparative Experiment

The proposed experimental framework evaluates whether utilizing a Dual-Agent LLM Debate Mechanism (DALDM) to dynamically generate the Optimization Variable and ultimately yield a consensus forecast enhances LLMs’ forecasting accuracy and performance. A rigorous control experiment is designed to compare the initial forecast (F1) generated by a single model against the consensus forecast (F2) derived from iterative cross-debates by proposing a “Comparative Framework” to evaluate the differences between the control group (F1) and the experimental group (F2). As shown in Figure 1, the procedure consists of 5 steps, detailed as follows.

Input Data Step:
For each forecasting cycle, the system synthesizes a 30-day news corpus (representing HIPESs) to generate an Integrated Context, which is then fed into the LLM. This makes sure that the operational consistency of both models (F1 and F2) operates under identical input conditions.
Market (i.e., Financial Market) Forecast Step:
F1 is used to forecast the closing prices of 75 FMIs. F1 is dependent exclusively on the Integrated Context for forecasting.
Debate Forecast Step:
By leveraging a Dual-Agent LLM Debate process that progresses sequentially through Triple-C (Cross-Validation, Cross-Debate, and Consensus Building) execution stages, F2 dynamically generates the Optimization Variable to ensure logical robustness and ultimately produce a Consensus Value.
Realized Value (RV) Step:
On the target day, the realized closing prices of the 75 FMIs are collected and designated as the RV.
Compare Result Step:
Forecast accuracy and performance are assessed via FEI series, Diff. metrics, paired t-tests, and volatility measures, creating a benchmarking framework rooted in 75 FMIs. This evaluation assesses whether F2 demonstrates statistically significant superiority over F1 and identifies broader implications to enhance the research’s practical utility.

3.1.3. Independent Inference Processes Guided by 9 Prompts

As depicted in Figure 2, we propose a ‘Multi-Stage Debate Pipeline’ consisting of 9 isolated chat boxes and 9 dedicated prompts. This design utilizes ‘Segmented Inference’ to prevent the attention drift common in prolonged single-thread conversations, thereby ensuring the (F1) and (F2) generation processes are rigorous and mutually exclusive. In practice, the system employs strict ‘One-to-One Mapping’ and ‘Model Alternating Assignment’: Prompt (n) strictly corresponds to Chat Box (n). Furthermore, odd-numbered sequences (1, 3, 5, 7, 9) are driven by LLM(1), while even-numbered sequences (2, 4, 6, 8) are driven by LLM(2). This alternating design ensures that each stage of the debate is executed with high precision in an independent, controlled environment.

3.1.4. Market Forecast Framework

As shown in Figure 3, the system performs a critical ‘Semantic-to-Theory Mapping’ led by the Proponent (LLM1) using Prompt (1) in Chat Box 1. Here, LLM1 digests massive unstructured market data (Integrated Context) to generate numerical forecasts. The results are strictly defined as (F1) and formatted as [FMIi]:[F1i]. Displayed at the bottom of the figure, this matrix covers 75 key FMIs across five asset categories, serving as the baseline anchor for the subsequent debate rounds.

3.1.5. Debate Process Framework

This subsection integrates the workflow framework of Figure 4 and the operational principles of Figure 5 to elaborate on the core mechanism of the multi-agent debate system. This study establishes a rigorous automated decision-making process through standardized “role definition,” “path branching,” and a “three-stage closed-loop” mechanism, aiming to transform initial Forecast Values (Forecast Values (F1)) into final decisions with high reliability (Consensus Values (F2)) via iterative logical debates. Figure 5 further demonstrates the decision criteria and conditional logic of the mechanism (using financial market indicator BTC as an example).

As shown in Figure 4 and Figure 5, this experiment adopts a “heterogeneous model collaboration” strategy to simulate real-world decision-making debates. In the debate, 2 LLMs are assigned opposite roles (Proponent and Opponent).

Proponent (LLM1/Gemini 3 Pro):
After F1’s execution, it starts to be responsible for “defending and correcting.” Its decision process begins by reviewing the Opponent’s status tag [ST]. If marked “Disagree,” the Proponent analyzes the confidence score [SC] and rebuttal [RE], then reads the revised value [RN] to anchor the reasonable numerical range. Based on causal analysis, it adjusts the value if the Opponent’s objection is valid or reinforces its original forecast if the critique is deemed invalid.
Opponent (LLM2/ChatGPT 5.2):
Tasked with “auditing and challenging,” it conducts a strict logic audit (Logic Audit) on the Proponent’s input (initial [F1] or revised [RN]). Using the Integrated Context and internal theoretical cognition, it calculates [SC] to define [ST]: if [SC] < 0.7, it marks the status as “Disagree,” providing [RE] and generating [RN] to challenge the Proponent’s argument; if [SC] meets the threshold, it marks it as “Consensus” to terminate the debate.

As shown in Figure 4 and Figure 5, a “heterogeneous model collaboration” strategy is used in this study to simulate real-world decision-making debates between 2 LLMs using the Disagree and Agree mechanisms (paths), described as follows.

Disagree Mechanism (Red Path):
If any of the 75 FMIs has a score below 0.7 ([SC] < 0.7), the system triggers the red path, forcing a new round of debate for unmet indicators to generate [RN] correction values, or forcibly entering the convergence phase if the round limit is reached.
Agree Mechanism (Green Path):
Only when all 75 indicators meet [SC] ≧ 0.7 and are marked as [ST] = Agreed will the system trigger the green path, initiating Global Early Termination to skip remaining debate rounds and directly route the batch of Forecast Values to the Consensus-Building stage.

As shown in Figure 4, the debate process is sequentially divided into three stages, forming a complete closed-loop mechanism from questioning to contention to convergence:

The Cross-Validation Stage:
Initiated by Chat Box 2, LLM(2) receives the initial Forecast F1 and Integrated Context, executing a strict Logic Consistency Check. The system reviews each of the 75 FMIs individually. If logical inconsistencies or insufficient confidence (SC < 0.7) are detected, it marks the case as Disagree, generates the first correction value RN1, and forces the debate to commence. Only when all 75 FMIs meet [SC] ≧ 0.7 will the system trigger Global Early Termination, skipping remaining rounds.
The Cross-Debate Stage:
If the previous stage triggers Disagree, the system initiates a maximum of three rounds of iterative debate. LLM(1) (Proponent) and LLM(2) (Opponent) engage in alternating attacks and defenses (Chat Box 3~Chat Box 8). The Dual-Input Mechanism ensures each round employs iterative input, compelling models to simultaneously read the Integrated Context and the Opponent’s previous output (Input Reference), thereby ensuring the debate focuses on numerical discrepancies. The Dual-Termination Logic follows the Global Indicator Status: if the Opponent persists in doubt, the debate is forced to conclude at the third round; only when all 75 FMIs meet [SC] ≧ 0.7 will the system trigger Global Early Termination, skipping remaining rounds.
The Consensus-Building Stage:
This stage marks the conclusion of the debate process, with LLM(1) in Chat Box 9 executing the final decision via the Conditional Input Strategy. The system first dynamically locks the input source (either Proponent Context or Opponent Context) based on the debate’s termination status. Subsequently, this study adopts the Dynamic Convergence Mechanism: for cases reaching Agreed consensus, the system directly adopts the validated numerical value; for Disagree cases with unresolved disputes, it calculates the arithmetic average of RNs to reconcile the final positions of both sides. Finally, the system removes intermediate parameters like RE and SC, packaging the converged value as F2 with a consensus status (State 8), completing the full logical loop from prediction to decision. The finalized Consensus Values (F2) are output after optimization.

3.1.6. Specifications of Prompts (Prompt 1~Prompt 9)

This research outlines the technical specifications of 9 core prompts (refer to Table 2) to secure consistent input parameters and replicable outcomes. Prompt (1) is designed to serve as the “75 FMIs value Generator,” converting structured “Integrated Context” text into “formatted [FMIi]:[F1i]” for F1. This study ensures standardized prompt input and the reproducibility of experiments by adopting the core functional specifications (see Table 2), which are divided into two major prompt (functional) modules based on the nature of forecasts.

Prompt (1) for Direct Forecast (F1):
Panel A’s Prompt (1) specializes in executing Semantic-to-Forecast Mapping, corresponding to the Market Forecast Stage in Figure 3, which converts unstructured market information into an initial forecast (F1).
Prompts (2)~(9) for Debate Forecast (F2):
Panel B’s Prompts (2)~(9) constitute the Adversarial Debate Calibrator, aligning with the three-stage process in Figure 4: Prompt (2) initiates the Cross-Validation stage for risk detection, Prompts (3)~(8) drive the Cross-Debate stage through iterative counterarguments, and Prompt (9) executes the Consensus-Building stage for final convergence. This module refines the initial forecast (F1) into a Robust Consensus Value (F2) via confidence score (SC) and iterative mechanisms.

More supplemental information for Table 2, Figure 4 and Figure 5 is available in Appendix A. By clearly distinguishing the boundaries between Initialization and Calibration, this study establishes the functional independence and reproducibility of prompts across stages.

3.1.7. Input–Output Data Flow Matrix

Building upon the prompt function architecture described in the previous subsection, this subsection further elaborates the internal data flow mechanism of the system (see Table 3). To ensure reproducibility and traceability of the experimental process and decision path, Table 3 explicitly defines the input configuration and output variables for the agent of each stage. The system establishes a rigorous Variable Inheritance Chain, where the output variable of each round in Panel B’s debate loop (e.g., RN1) is forcibly set as the input data for the subsequent round, forming a continuous logical relay. Additionally, the matrix lists the corresponding Output File Name, confirming that the data evolution from the initial forecast (F1) to the final consensus (F2) is fully traceable.

3.1.8. Output Formats and Input References

Building on the previous subsection, this subsection presents detailed structured output specifications in Table 4 to ensure the accuracy and logical consistency of the multi-agent system during iterative debates. Unlike merely describing data flow, this study strictly regulates the “Read-Write Logic” for agents in each round: agents must precisely read the output of the previous round as the Input Reference and encapsulate their responses according to a mandatory Format Schema. This approach ensures that every correction from the initial forecast (F1) to the final consensus (F2) possesses standardized syntax structures and duly data traceability. Important symbols used in Table 4 are described as follows.

[i]: Financial Market Indicator (FMI) ID (1~75).
[F1]: Direct Forecast Value = Forecast Values (F1).
[SC]: Confidence score (0.0~1.0), quantifying the level of agreement with the Input Reference.
[ST]: Status (i.e., “Disagree” or “Consensus”).
[RE]: Reasoning. The qualitative text explaining the causal logic or theoretical constraints behind the revision and supporting the debate/rebuff.
[RN]: Revised Number. The new Forecast Value proposed in the current round ([RN(t)]).
[F2]: Debate Forecast Value = Consensus Values (F2).

3.1.9. Forecast Horizons and Rolling Validation Framework

This subsection compares the performance between F1 and F2 across varied forecasting timeframes within Test (T + 7) to assess forecasting robustness (see Figure 6). Test (T + 7) adopts a 1-day rolling validation mechanism, where a 30-day historical lookback period serves as the Integrated Context of input for predicting varying time distances.

Test (T + 7) (Forecast horizon spanning 7 days):
This test simulates short-term forecasting scenarios by operating on a baseline date (T) to forecast market values at T + 7 (e.g., 11/25 from 11/18). This test also evaluates LLM sensitivity to new market inputs and compares the forecasting accuracy between F1 and F2 under short-term conditions.
Method for Rolling Validation:
A seven-day replication framework mitigates date-specific biases, which generates 14 forecast datasets to support effective quantitative evaluation.

3.2. Data Collection Instruments and Techniques

This section details the data coverage, encompassing news sources, 75 FMIs, and the FMI closed-market values (i.e., Realized Value, RV) approach. More information about data aggregation, optimization variable transformation and generation, market forecasting, and LLM configuration details are outlined in the following subsections.

3.2.1. Rolling Window-Based Data Aggregation

The transformation process from ‘Single News Unit’ to ‘Integrated Context’ is illustrated in Figure 7.

The Process of Single News Unit:
Core textual data are sourced from HIPESs made by international leaders, systematically extracted from Rev.com’s verbatim transcripts, and then standardized into .txt files containing (News Date), (News Title), and (News Transcripts). These serve as foundational units for aggregation.
The Process of Integrated Context:
This process aggregates 30 days of “Single News Unit texts” (T-30 to T-1) to capture cumulative semantic context, which is then merged into a single long-text file, forming the complete input basis for LLM inference.

3.2.2. Mechanism of Market Forecast (F1)

Figure 8 illustrates the mechanism for generating the “Market Forecast Values” of F1, in the Market Forecast step of this experimental process. This study utilizes an LLM as the semantic processing engine. In Chatbox1 of LLM(1), “Integrated Context” and “Prompt (1)” are input, allowing LLM(1) to automatically extract meaning from news articles and generate Forecast Values for the closing prices of 75 FMIs on a specific day.

3.2.3. Mechanism of Debate Process Forecast (F2)

This subsection elaborates on the actual operational mechanism of the multi-agent debate system. Based on the data flow matrix in Table 3 and the output format specifications in Table 4, this study divides the prediction optimization process into Triple-C consecutive execution stages: Cross-Validation, Cross-Debate, and Consensus Building. As shown in Figure 9, Figure 10, Figure 11 and Figure 12, the system’s core lies in the strict “Logic Gate” mechanism controlling stage transitions. Specifically, the system uses the confidence score (SC) as a switch to initiate debate. Once the initial prediction (F1) score is found to be below standard (SC < 0.7), the system initiates the debate loop, requiring both sides to reference each other’s viewpoints for multiple rounds of forecast adjustments until the score meets the standard or consensus is reached (SC ≥ 0.7), ultimately converging to a robust decision value (F2). More information on Triple-C stages is detailed as follows.

The Cross-Validation Stage:
This stage, as illustrated in Figure 9 and guided by Table 3, initiates a Dual-Input Mechanism led by LLM(2). It injects the original Integrated Context and initial F1 Forecast Value from Market Forecast (1) to simulate an auditor’s review of existing predictions. LLM(2) validates 75 FMIs without re-predicting, generating a structured Opponent Context (1): [i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1]. For example, BTC’s [125,000]/[0.60]/[Disagree]/[Reason: xxxx]/[118,000] shows that when SC1 (0.60) falls below the 0.7 threshold, the system flags [Disagree], proposes a revised value [RN1] (118,000), and triggers the red debate path (shown in Figure 4 and Figure 5) via the Logic Gate.
The Cross-Debate Stage:
LLM1: Proponent (Google Gemini 3 Pro)
This stage operates as a closed “Iterative Dialectic Loop” (see Figure 10 and Figure 11 and Table 3), where LLM(1) acts as the Proponent and LLM(2) acts as the Opponent, enforcing a Dual-Input Mechanism to reference static Integrated Context and dynamic Input Reference (created from the counterparty’s prior arguments). This ensures consistent factual grounding, driving predictive values toward consensus via logical iteration. LLM(1) conducts causal analysis to refine [RN] and [RE], encapsulating outputs as Proponent Context (1~3) for subsequent review.
LLM2: Opponent (OpenAI ChatGPT 5.2 Thinking)
The Opponent’s scenario (Figure 11) represents the review process in even number prompts (Table 3: P2, P4, P6, P8). Its input combines Integrated Context and Proponent Context (1~3), transforming LLM(2) into a rigorous reviewer. Prompts (P4, P6, and P8) enforce “Theoretical Constraints” to validate defense reasoning. Core logic updates SC as a Logic Gate: if SC < 0.7, a new [RN] is generated and archived as Opponent Context (2~4) to trigger the next debate round; if SC ≥ 0.7, consensus is achieved, leading to the final Consensus-Building phase.
The Consensus-Building Stage:
This stage terminates iterative debate and establishes consensus. When confidence scores (SC ≥ 0.7) are met, the system activates convergence via a Conditional Input Strategy (Table 3), selecting the latest revised file (Latest RN) from the Opponent/Proponent Context (2~4/1~3). LLM(1) executes Threshold Verification and Convergence (Prompt 9) to finalize decision values, outputting [i]:[F2]/[ST8] (Final Consensus Status) as the Market Forecast (F2) result and completing the prediction-to-decision closed-loop.

In terms of LLM model configuration and experimental environment, this study employs a Heterogeneous Model Collaboration Architecture to ensure debate diversity and objectivity. By incorporating Google Gemini 3 Pro (LLM1) and OpenAI ChatGPT 5.2 Thinking (LLM2) via their official web-based GUIs, this study prioritizes generalizability and user-friendliness. All LLM hyperparameters were kept at system defaults with the same LLM versions remaining intact during the experimental timeframe. Both LLM1 and LLM2 utilize paid subscription tiers, specifically the Google AI Pro and ChatGPT Plus plans. The system achieves Proponent Construction and Opponent Review. LLM1 leverages its long-context processing capability to handle unstructured market data (Integrated Context) and debate history, while LLM2 uses Chain-of-Thought Reasoning to enforce theoretical constraints and conduct rigorous Critical Review. The experimental setup combines Gemini’s breadth with respect to information integration with ChatGPT’s depth in logical reasoning, creating an automated forecast system with both breadth and depth.

3.3. Analysis Methods

This study utilizes a multidimensional validation framework to assess the accuracy and resilience of F1 and F2. The framework encompasses the FEI metric to evaluate accuracy, Diff. values to compare performance, paired t-tests to validate credibility, and volatility analysis of the 75 FMIs, serving as the foundation for empirical investigation.

3.3.1. Mathematical Framework of the Forecast Estimate Index (FEI)

The FEI measures forecasting accuracy using a standardized Min-Max Ratio algorithm to ensure the FEI score falls within a range from 0 to 1. A score closer to 1 reflects greater accuracy, whereas 0 denotes substantial deviation. The formula assigns the smaller value as the numerator and the larger as the denominator, mitigating directional bias in cases of overestimation or underestimation.

Definition 1.

Forecast Estimate Index (FEI). Let P represent the forecasted value and RV represent the realized closing values of any one of the 75 FMIs on a specific day, and then we can define FEI as:

F E I = \frac{MIN (P, R V)}{MAX (P, R V)}

(1)

Score = 1: A perfect forecast accuracy (i.e., P = RV).

Score close to 1: A high forecast accuracy (i.e., P close to RV).

Score close to 0: A low forecast accuracy (P and RV differ significantly).

As shown in Figure 13, the FEI calculation (1) is structured by this study into 3 hierarchical levels for macro-level analysis, including (Overall Average FEI, FEI_i), (Daily Average FEI, FEI_i,t), and (Daily Item-Level FEI, FEI_i,j,t). Important symbols used in subsequent FEI definitions and descriptions are presented as follows.

Model index i (i = 1 for F1, i = 2 for F2);

Asset index j (j = 1 to 75, for all 75 assets of the 75 FMIs);

Time index t (t = 1 to 7, for the 1st~7th day of the experiment).

Figure 13. Mathematical framework of FEIs.

Daily Item-Level FEI:

Definition 2.

(Daily Item-Level FEI, FEI_i,j,t). To calculate the daily forecast accuracy of an asset j, such as gold or crude oil, which is related to 1 of the 75 FMIs, for a specific date t, and reflect individual asset performance, we define FEI_i,j,t as follows.

F E I_{i, j, t} = \frac{MIN (P_{i, j, t}, R V_{j, t})}{MAX (P_{i, j, t}, R V_{j, t})}

(2)

2.: Daily Average FEI:

Definition 3.

(Daily Average FEI, FEI_i,t). To aggregate the “Daily Item-Level FEI” values of all 75 FMIs on a given day, we define Daily Average FEI to provide a macro-level assessment of forecast accuracy across five global asset categories. Specifically, Daily Average FEI is defined as follows.

F E I_{i, t} = \frac{1}{75} \sum_{j = 1}^{75} (F E I_{i, j, t})

(3)

3.: Overall Average FEI:

Definition 4.

(Overall Average FEI, FEI_i). This study defines Overall Average FEI as a core metric to validate whether F2 significantly outperforms F1 in the whole experimental period (e.g., Test (T + 7)’s 7-day rolling window). Overall Average FEI (FEI_i) is derived by averaging the “Daily Average FEI” over 7 consecutive days to represent the experiment’s long-term stability and overall accuracy. This study formally defines Overall Average FEI (FEI_i) as follows.

{\bar{F E I}}_{i} = \frac{1}{7} \sum_{t = 1}^{7} (F E I_{i, t})

(4)

3.3.2. Item-Level Difference (Diff.) Framework

The Design of the Item-Level Difference Framework

Definition 5.

(Item-Level Difference, Diff.). The “Daily Item-Level FEI” values over seven days are averaged to evaluate the performance gap between F1 and F2 across the 75 FMIs. This study derives the difference (Diff.) by calculating F1 minus F2, with the result of 0 as the baseline. A greater absolute value of Diff. (|Diff.|) signifies a larger forecasting performance gap. Positive Diff. values for F1 and negative Diff. values for F2 indicate superior performance for F1 and F2, respectively.

D i f f_{j, t} = {\bar{F E I}}_{F 1, j, t} - {\bar{F E I}}_{F 2, j, t}

(5)

Diff. > 0: F1 outperforms F2; the farther from 0, the greater the advantage.

Diff. = 0: Equal performance; no clear superiority.

Diff. < 0: F2 outperforms F1; the farther from 0, the greater the advantage.

2.: Quartile Analysis for Strength Distribution

Definition 6.

(Quartile Analysis). This study employs a multidimensional hierarchical mechanism to validate robustness. By using quartile thresholds (0%, 25%, 50%, 75%), this study categorizes the absolute values of Diff. (i.e., |Diff.|) into four strength intervals. This quartile distribution can help quantify and compare the forecast performance between F1 and F2 across 75 FMIs, emphasizing their relative advantages in forecasting 75 FMIs. As such, four quartile intervals are defined as follows.

Q_{k} = {Quantile}_{k} (|D i f f_{j, t}|), k \in {0.25, 0.50, 0.75}

(6)

3.3.3. Paired Sample Correlation and Paired Sample t-Test

SPSS 20 is used to calculate paired sample correlation coefficients, which can assess model synchronization between F1 and F2, for validating statistical robustness. Consistent market response and rational paired comparison can be confirmed through significant high correlation. In this study, we subsequently conduct a paired sample t-test to evaluate statistical significance results (e.g., p-values), with significant outcomes (p < 0.05) demonstrating significant performance differences between F1 and F2, excluding random fluctuations.

3.3.4. Market Characteristics and Volatility Stratification

This study uses Asset Volatility as a key metric to quantify “Forecast Difficulty,” with standard deviations of daily return for the 75 FMIs calculated and averaged. Subsequently, we apply (6) Quartile Analysis to divide results into four quantiles (0%, 25%, 50%, 75%) of volatility for highlighting performance disparities. This stratification can help this study clarify whether F2 demonstrates greater potential in managing high-volatility assets than in low-volatility environments.

Definition 7.

(Seven-day volatility,

σ_{j}^{(7)}

). For the purpose of assessing forecast difficulty, we calculate the 7-day volatility for each asset j, as the standard deviation of the asset’s daily returns over a one-week period.

σ_{j}^{(7)} = \sqrt{\frac{1}{7} \sum_{t = 1}^{7} {(r_{j, t} - {\bar{r}}_{j})}^{2}}

(7)

σ_{j}^{(7)}

: The 7-day volatility of asset j.

r_{j, t}

: The daily return of asset j on day t.

{\bar{r}}_{j}

: The average daily return of asset j over the 7-day period.

4. Experiments Results

4.1. Sample Description and Research Rigor

4.1.1. Time Series Used for Experiments

This research establishes the data architecture for “Integrated Context” and “Optimization Variable,” with Table 5 providing an overview of the experimental timeline, data collection periods, HIPESs and forecast horizons, forecast execution dates, verification closure dates (i.e., RV Date), as well as the daily news count, establishing a transparent, replicable experimental benchmark.

This study evaluates F1 and F2 through Short-Term Immediate Forecasting, named as Test (T + 7), focusing on short-term price volatility through a rolling window approach. The predictive model employs a 7-day forecasting window, utilizing closing prices from the RV Date for evaluation.

4.1.2. Input Data for Experiments

The input data for each experimental day utilizes HIPESs (extracted from Rev.com’s verbatim transcripts) dated during the past 30 days. As shown in Table 5, the total number of news used in each forecast experiment (i.e., the daily news count) ranges from 20 to 23.

4.1.3. Rules and Sources for Final Realized Value Acquisition

Indicators and Corresponding Values for Benchmarking:
Five categories of assets with each category consisting of 15 FMIs result in a total of 75 FMIs, which are evenly spread across five asset categories. The values of those 75 FMIs were obtained from the Trading Economics platform (tradingeconomics.com).
Rules of Closing Price:
Daily closing prices are used as the baseline. Non-trading days (weekends, holidays) apply a carry-forward rule, retaining the previous valid trading day’s closing price to preserve time series integrity.

4.1.4. Seven-Day Dynamic of (F2) Debate Disagreements

The experiment (F2) optimizes 75-metric forecast accuracy via iterative debate, tracking “Disagree” counts to measure consensus convergence (see Table 6). The count drops sharply from 514 to 144, 108, and finally reaching consensus (i.e., disagree count = 0), validating the “Convergence Funnel Effect” and demonstrating the dual-LLM debate mechanism’s efficient “Cognitive Filtering” capability. This non-linear decay trajectory proves the system’s ability to rapidly eliminate forecast uncertainty within limited iterations, avoiding divergent cycles.

4.1.5. Compositional Structure of (F2) Forecasts

Table 7 decomposes the final output of the F2 Debate Forecast model, revealing that daily 75-metric forecasts are hybrid vectors derived from three distinct decision paths. The table details the composition of daily prediction vectors, where the total (N_total = 75) equals the sum of three components. In addition, Table 7 presents three panels/groups (Panels A, B, and C) of forecast results, which are listed together with corresponding descriptions, as follows.

Agreed, LLM(1): F1’s initial forecast adopted without debate due to high confidence.

Debate, LLM(1): F1’s value retained after successful defense during debate.

Debate, LLM(2): F1’s initial judgment rejected, replaced by F2-debated revised values.

F1’s Stable Dominance (Agreed, LLM(1)):
LLM(1) dominates in a certain Crypto (e.g., Days 1~7), proving its potential applicability with respect to Crypto. The system may avoid unnecessary debates on low-complexity indicators to save computational resources.
Dual-LLM Cognitive Complementarity:
Dual-LLM architecture mitigates risks of overfitting/hallucination during volatility (in all 7 days). LLM(2) dynamically corrects F1’s biases, enhancing decision robustness in extreme scenarios. This underscores the critical value and benefit provided by the proposed “F2 Debate Forecast” approach.
Dual-LLM Cognitive Specialization:
LLM1 (F1) excels in certain mean-reverting assets (e.g., 10Y Yield and Commodities) via intuitive reasoning. LLM2 (F2) corrects nonlinear market biases across all five categories of assets, balancing trend capture and volatility resilience.

4.2. Measurement Results: Analytical Assessment

This section comprehensively examines F2’s enhanced capabilities over F1 in forecasting precision and robustness via a comprehensive quantitative assessment of overall metrics, asset allocation patterns, and performance discrepancies (see Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13).

4.2.1. FEI Comparison (Daily Average and Overall Average) Between F1 and F2

Table 8 presents the FEI comparison results, showing that F2 exhibits superiority in forecasting accuracy over F1 within every experimental scenario of the Test (T + 7) framework. More detailed analyses are outlined as follows.

Comprehensive Superiority:
In Test (T + 7), (Panel A) shows (F2) achieving higher (Daily Average FEI) than (F1) on all 7 days (Day 1~Day 7). (Panel B) reveals (F2)’s (Overall Average FEI) reaching 0.808 in Test (T + 7) (vs. 0.777 for (F1)), demonstrating that the Optimization Variable generated by the debate mechanism noticeably enhances the model’s ability to capture market dynamics. Analysis of the 75 FMIs’ (Overall Average FEI) further shows (F2)’s values align closer to actual observations (with FEI approaching 1), highlighting the substantive contribution of the LLM Cross-Debate Mechanism to quantitative forecasting.
Robustness and Stability:
Across various forecast horizons listed in Table 8, F2 maintains FEI above 0.791 at all 7 days (time windows), peaking at 0.816 on Day 2 in Test (T + 7). The consistent superior performance of F2 confirms the LLM Cross-Debate Mechanism’s robustness in enhancing forecasting accuracy across various time windows.
Alternative Forecasting Accuracy Metrics—MAE and MSE:
In calculating widely used forecasting accuracy metrics MAE and MSE, all errors contribute to these metrics unequally. Larger errors have a more significant effect on MAE than smaller ones, and even more so for MSE. Though FEI evaluations mitigate such uneven contributions suffered by MAE/MSE, for comparison purpose this study still calculates MAE/MSE (based on the same data deriving Table 8) to derive Table 9 and Table 10. The results in Table 8, Table 9 and Table 10 consistently demonstrate that F2 outperforms F1 in overall forecasting accuracy.
Supplementary Weekly Replications in Overall Average FEI Validation:
This research is grounded in the central empirical findings from Table 8 (Week 1), augmented by three supplementary weekly assessments to strengthen reliability and counteract the effects of time-dependent variability. The findings reveal that the F2 framework consistently outperforms F1 in Overall Average FEI in all four weekly experiments, with the comparison results as follows: Week 1 (F2’s 0.808 > F1’s 0.777), Week 2 (F2’s 0.807 > F1’s 0.776), Week 3 (F2’s 0.801 > F1’s 0.772), and Week 4 (F2’s 0.796 > F1’s 0.771). These weekly experimental outcomes offer robust evidence showing that the F2 framework provides consistent and substantial predictive superiority, coupled with systematic robustness.

4.2.2. Win Statistics (Daily Average) by Asset Category and FEI Intensity

This research assesses the 7-day average FEI values of 75 FMIs, and in Table 11, the higher FEI scores (F1 vs. F2) are designated as “wins” and the performance is analyzed across five asset categories. The analysis begins with overall win frequency (Panel A), followed by FEI intensity distributions (Panels B~E).

Frequency of Overall Wins (Panel A):
In Test (T + 7), F2 achieved 46 wins over F1’s 28 wins. This underscores F2’s superior forecasting performance across five asset categories, notably in Commodities, Exchange Rates, Cryptocurrencies, and 10Y Yield, where it exhibits dominance. While F2 demonstrates relative advantages in four asset categories, while F1 maintains competitive strengths in Indices.
High-Accuracy Range (Panel B):
F2 secured 31 wins, outperforming F1’s 26. However, a deeper analysis reveals F2’s wins are heavily concentrated in Exchange Rates and 10Y Yield. This confirms that F2’s superior performance in high-accuracy scenarios stems from the Optimization Variable enabled by the Debate Forecast mechanism, which enhances its effectiveness in specific financial assets.
Mid-to-High-Accuracy Range (Panel C):
F2 secures eight indicators, outperforming F1’s two. While the number of indicators in Panel C has noticeably decreased (from Panel B), they are distributed across all five asset categories. Within this range, F1’s performance is already sufficiently robust, while F2’s Debate Forecast mechanism provides marginal gains but does not establish a dominant advantage.
Mid-to-Low Range (Panels D) and Low-Accuracy Range (Panels E):
F1’s effective win count is zero, while F2 records five and two wins in Panels D and E, respectively. This indicates F2’s resilience in extreme FMI forecasts, compensating for F1’s lack of forecast resilience and accuracy.
Summary:
Table 11’s comprehensive analysis reveals F2’s structural breakthrough in “operational resilience” and “domain-specific advantages.” Through the Optimization Variable’s Debate Mechanism correction, forecast accuracy for over half of financial assets is noticeably enhanced. This confirms F2’s ability to maintain full interval stability while possessing targeted forecast advantages.

4.2.3. Correction Analysis: How (F2) Debate Enhances (F1) Forecasts

To validate the Dual-Agent LLM Debate Mechanism’s impact on forecast accuracy, Table 12 analyzes 514 cross-debated samples (from a total of 525) to isolate debate contributions. Samples are categorized as “Debate, LLM(1)” (reverting to initial forecasts) or “Debate, LLM(2)” (adopting revised forecasts). By cross-referencing with Daily Item-Level FEI, the study evaluates accuracy improvements post-debate. A “Debate Validity” framework assesses alignment between final decisions (F2) and optimal FEI models (F1/F2), classifying debates as Valid (effective) or Invalid (ineffective). More information from Table 12 is presented after the explanations of its relevant notations, which are detailed as follows.

Valid Debate:
- Debate, LLM(2), Valid: Adopted LLM(2) revisions; FEI validation confirms “accuracy > LLM(1)”.
- Debate, LLM(1), Valid: Retained LLM(1); FEI shows its performance ≥ LLM(2), blocking suboptimal proposals.
Invalid Debate:
- Debate, LLM(2), Invalid: Mistakenly adopted LLM(2); LLM(1) FEI was better than LLM(2)’s (over-correction).
- Debate, LLM(1), Invalid: Retained LLM(1); LLM(2) had higher accuracy but was mistakenly ignored.
Metrics:
- Total Valid: Absolute count of successful corrections/defenses via debate.
- Total Valid Ratio: Proportion of effective debates relative to total debates.

High Total Valid and High Total Valid Ratio:
Crypto assets (Total Valid: 89; Ratio: 90.8%) show “high-frequency high-accuracy” performance, with (F2) correcting (F1)’s systematic biases via the cross-debate mechanism. The dual-agent debate framework demonstrates superior adaptability in volatile, nonlinear markets (e.g., Crypto), achieving 89 valid corrections vs. nine invalid cases.
Moderate Total Valid and Moderate Total Valid Ratio:
Commodities (Total Valid: 71; Ratio: 69.6%), Exch. Rates (Total Valid: 68; Ratio: 64.8%), and 10Y Yield (Total Valid: 68; Ratio: 64.8%) exhibit moderate validity with high debate frequency. (F2) acts as a “precision sniper,” focusing on trend reversals while trusting (F1)’s baseline judgments for Commodities, Exch. Rates, and 10Y Yield.
Low Total Valid and Low Total Valid Ratio:
Indices (Total Valid: 29; Ratio: 27.9%) highlight over-correction risks. High debate frequency but low validity (Total Valid: 29; Total Invalid: 75) suggest (F2)’s interventions introduce noise in efficient markets. Model optimization requires reducing debate thresholds to avoid overfitting.

4.2.4. Forecast Performance Differential Analysis

Based on the afore-described win frequency information, Table 13 expands the analysis to quantify forecast performance disparities. Daily Item-Level FEI scores are computed as seven-day arithmetic averages, where differences (Diff. = F1 − F2) represent performance gaps. Quartile Analysis categorizes the performance disparities into four intensity tiers (Panels A~D), reflecting varying degrees of model superiority. Panel A (Top 75~100%) underscores F2’s most pronounced performance advantages, while Panel D (Bottom 0~25%) reflects negligible performance discrepancies.

High-Intensity Interval (Panel A):
In the top 25% of samples showing the most significant performance gaps, F2 exhibits superior dominance: During Test (T + 7), F2 achieves 16 victories compared to F1’s three. Notably, Crypto (11 wins) and 10Y Yield (three wins) are key cases. This highlights F2’s predictive correction capability in Crypto and 10Y Yield markets where F1 struggles with only one marginal win.
Moderate-Intensity Interval (Panel B):
As the |Diff.| value decreases, F1’s competitiveness increases. F1 leads F2 by 10 to nine wins, with a marginal gap. Upon closer examination of asset distribution, F2 demonstrates strong domain advantages in Crypto (2) and 10Y Yield (5), and F1 in Indices (4).
Moderate-Low-Intensity and Low-Intensity Intervals (Panels C, D):
As the |Diff.| value decreases further, the competitiveness of F1 falls a little bit. As shown in Panels C and D, F2 leads slightly over F1 with 11:7 and 10:8 scores. This highlights F2’s core strength in making substantial adjustments in Exch. Rates and Commodities, rather than Indices.

As shown in Table 13, this study validates that F2 exhibits a higher overall win rate and shows substantial enhancements in forecast accuracy across diverse scenarios, particularly in Cryptocurrency (Crypto) markets and 10-Year Government Bond Yield (10Y Yield) assets, showcasing its debate mechanism advantages. Figure 14 visualizes the performance differentials (Diff.) between F1 and F2 across five asset categories, with the zero-baseline distinguishing their relative strengths. Specifically, hollow bars represent F1’s superiority, and solid black bars emphasize F2’s dominance. In sum, from this analysis we can derive the following key observations.

Systematic Dominance in Cryptocurrencies:
F2 demonstrates its superiority in Crypto and parts of 10Y Yield blocks, with dense black bar visuals indicating significant error reduction (up to ~20%). This validates its debate mechanism as a volatility stabilizer, outperforming F1 in both win count and forecast error reduction for high-volatility, nonlinear, and policy-sensitive markets.
Traditional Assets Showing Competitive Parity:
For the traditional asset categories such as Exchange Rates and Commodities, F1 and F2 demonstrate alternating dominance with minimal error reduction fluctuations (mostly within ±10%). This reflects “competitive parity” between the two models in traditional assets. Notably, F1 retains a slight edge in highly efficient, mean-reverting markets like US100 and US500 (hollow bars in Indices). This underscores the resilience of benchmark forecasts/random walk hypothesis in efficient markets, where excessive debate interventions may introduce noise.

4.3. Analysis and Testing: Structural Models and Hypotheses

The credibility of experiments conducted by this study is validated through comprehensive statistical analysis by implementing two test phases. Firstly, this study examines two datasets of Daily Item-Level FEI metrics from Test (T + 7) (on F1 and F2) using SPSS (see Table 14 and Table 15). Secondly, this study uses paired sample t-tests to evaluate model correlations and the statistical significance of forecast discrepancies, excluding random error and validating robust outcomes. In addition, the non-parametric Wilcoxon signed-rank test was conducted to double-check the findings of the paired sample t-test. Subsequently, this study quantifies the volatility of 75 FMIs’ asset prices to highlight F2’s adaptability across diverse closing price environments (see Table 16).

4.3.1. Analysis of Paired Sample Correlation

The paired sample correlation coefficients between F1 and F2 in Test (T + 7) are presented in Table 14. The results indicate strong positive correlations (r = 0.843~0.911) across all forecast periods (Day 1~Day 7), with statistically significant p-values (<0.001) across all outcomes. This underscores F2’s ability to maintain F1’s foundational forecasting framework while enhancing accuracy without modifying the core structural assumptions.

4.3.2. Paired Sample t-Test

This study assesses the statistically significant differences in forecast performance between F1 and F2 through a paired sample t-test analysis. The results are shown in Table 15.

Mean and Stability:
In all forecast days of Test (T + 7), F2’s FEI mean consistently exceeds F1’s, with F2’s FEI SD noticeably lower than F1’s. This indicates F2 not only has higher forecasting accuracy but also exhibits greater stability through reduced dispersion in results.
Statistical Significance:
T-test results confirm significant performance differences (see Table 15). Notably, significant results were observed on Day 1 (p = 0.013), Day 3 (p = 0.015), Day 4 (p = 0.022), Day 6 (p = 0.019), and Day 7 (p = 0.008). In Test (T + 7), negative deviations (F1 minus F2) in paired differences (in Table 15) strongly support F2’s elevated FEI, highlighting the substantial impact of the debate mechanism on enhanced forecast accuracy.
Additional Wilcoxon signed-rank test:
Further non-parametric validation via the Wilcoxon signed-rank test can re-ensure the robust superiority of F2. The daily p-values (ranging from Day 1 to Day 7: 0.047, 0.094, 0.025, 0.008, 0.022, 0.004, 0.002) show significant improvements (p < 0.05) on 6 out of 7 days, including high significance (p < 0.01) on Days 4, 6, and 7. The marginal significance on Day 2 (p = 0.094) might be characteristic of authentic, real-world market noise; ultimately, this provides compelling evidence that the debate mechanism noticeably enhances forecast accuracy.

4.3.3. Distribution of Closing Price Volatility Intensity

As shown in Table 16, this study utilizes Quartile Analysis to classify assets by volatility metrics (i.e., STDEV.S) based on 75 FMIs. Assets are grouped into four volatility tiers/quartiles: Top Tier (100~75%, i.e., the highest volatility), Second Tier (75~50%), Third Tier (50~25%), and Bottom Tier (25~0%, i.e., the lowest volatility), demonstrating F2’s adaptability in varying market conditions.

High-Volatility Concentration:
Cryptocurrencies predominate the Top Tier (STDEV.S ≥ 1.33%), achieving 14/15 assets, positioning them as a “high-risk, high nonlinearity” market representative.
Medium-to-Low Volatility Distribution (the Second, Third, and Bottom Tiers):
Commodities (9/15) predominate the Second Tier (1.33~0.68%), Indices (9/15) the Third Tier (0.68~0.38%), and Exch. Rates (12/15) the Bottom Tier (0.38% ~ 0%).
Performance Alignment with Environmental Considerations:
Our study findings corroborate earlier studies (as evidenced by Table 11 and Table 13, and Figure 14). Table 16 underscores Crypto’s elevated volatility (≥1.33%), consistent with F2’s dominance: 11/15 wins in Table 13 (Panel A) and a minimum 8.60% forecast performance gain. Figure 14 also confirms this trend, demonstrating that F2’s debate-guided Optimization Variables surpass F1 in volatile markets, highlighting both practical and academic significance.

5. Concluding Remarks

Statistical validation of F2 (debate process), utilizing the Dual-Agent LLM Debate Mechanism, reveals its capability to predict FMIs under the HIPES framework. Analysis of 75 FMIs combined with paired sample t-tests via SPSS substantiated the mechanism’s substantial positive influence on forecast accuracy (in terms of Overall Average FEIs, Daily Average FEIs, and Daily Item-Level FEIs) as well as the Breadth of Forecast Benefit (75 FMIs). Findings show that F2, enhanced through the LLM debate mechanism, noticeably reduces overall forecast error and outperforms F1, especially in high-volatility asset classes.

Our study findings indicate that F2 surpasses F1 in overall forecast accuracy. F2’s Daily Average FEIs and Overall Average FEIs demonstrate superior accuracy, with values approaching 1. By employing the Dual-Agent LLM Debate Mechanism, the F2 model effectively captures market dynamics, resulting in forecasts more aligned with actual trends. F2’s incorporation of cross-debate as an Optimization Variable improves accuracy and stability in FMI forecasting. F2’s overall average FEI achieves 0.808 in Test (T + 7), while F1’s value is 0.777 (see Table 8). These findings confirm that F2 surpasses F1 in overall accuracy, achieving a +3.1% improvement. This validates RQ1: the ‘F2 Consensus Value’ produced via the Dual-Agent LLM Debate Mechanism—relative to the ‘F1 Forecast Value’ based on a single LLM1 interpretation of HIPES text—noticeably enhances forecast accuracy and stability.

Experimental findings reveal substantial disparities in forecast accuracy between F1 and F2 across asset categories. When performance gaps surpass 8.60%, F2 demonstrates noticeably higher item capture (16 vs. 3), with pronounced strengths in high-volatile assets such as Cryptocurrencies (Crypto) and 10-Year Government Bond Yields (10Y Yield), as detailed in Table 11, Table 13, and Figure 14. This highlights not only enhanced frequency of Forecast Value Proximity but also amplified Forecast Performance Gains. Indeed, recent literature corroborates our findings that Cryptocurrencies experience extreme volatility, heavily driven by external news and public opinion [48]. To overcome the forecasting difficulties caused by these abrupt swings, our HIPES mechanism injects a shared ‘Integrated Context’ into both LLM(1) and LLM(2) to establish a stable and standardized analytical baseline. Conversely, it is noted that F1 demonstrates robust explanatory power in core assets such as Indices, with verified strengths in terms of forecast accuracy. This echoes the Adaptive Market Hypothesis [49], highlighting the dynamic characteristics of market efficiency. The disparity on Crypto may arise from cryptocurrency markets’ underscored sensitivity to political sentiment and speculative dynamics, where F2’s Optimization Variable efficiently captures nonlinear volatility patterns. However, Indices are governed by traditional supply–demand dynamics, allowing F1’s intuitive forecast to surpass F2 due to mitigated noise. This affirms RQ2: Empirical results demonstrate unique predictive strengths for F1 and F2 within various asset categories (e.g., Commodities, Stock Market Indices, Foreign Exchange Rates, Cryptocurrencies, 10Y Government Bond Yields), confirming their tailored performance in diverse financial market contexts.

The exploration of this study shows that F2 exhibits the advantages of debate mechanisms (see Table 12). In Crypto assets, F2 demonstrated notable “high-frequency high-accuracy” effectiveness (89 effective vs. nine ineffective instances). This confirms that in highly volatile markets, increased frequency of debate mechanism intervention enhances overall correction effectiveness, effectively compensating for the forecast blind spots of the single model (F1). Conversely, Indices (29 effective vs. 75 ineffective instances) highlight the “more debate, more errors” risk of over-correction against the LLM Debate Forecast, F2. This counterexample demonstrates that debate mechanisms are not universally applicable; they must be strategically applied to specific assets with high sensitivity and debate effectiveness to avoid introducing noise and achieve meaningful benefits.

The divergent forecast strengths between F1 and F2 suggest that no single model can universally fit for all 75 FMIs. Both F1 and F2 models constitute a complementary strategy: F1 is optimal for stable assets (e.g., Indices), while F2 is particularly effective for High-Volatility Assets (e.g., Cryptocurrencies and 10-Year Yield). While F2 demonstrates overall statistical significance, F1’s stable performance on stable assets warrants attention. The research findings described above underscore the shortcomings of single-model frameworks and offer vital empirical support for advancing an Optimized Portfolio Strategy in future research.

6. Implications

It is demonstrated by this study that moving beyond traditional machine learning dependence on historical price data, integrating unstructured data (e.g., real-time HIPES data, news text) can elevate forecast accuracy. By leveraging environmental data, such as HIPESs, as market forecast inputs and innovating through the Dual-Agent Cross-Debate Mechanism, organizations can achieve enhanced forecast accuracy in financial markets. This finding advocates for managers to strategically integrate cross-debate between LLMs and unstructured data as a novel approach to uncover “more advanced forecasting methodologies,” rather than remaining constrained by conventional forecast frameworks.

This study advocates adopting a customized portfolio strategy based on empirical evidence of complementary model strengths. For example, stable assets such as Indices are suitable for using F1, whereas High-Volatility Assets like Cryptocurrencies and 10-Year Yield align better with F2. Assets such as Exchange Rates and Commodities can benefit from the combined framework of F1 and F2. Integrating both models may enhance predictive accuracy and overall forecasting effectiveness.

This research introduces F2, a high-reward, low-entry barrier implementation model capable of capturing extreme market scenarios to deliver timely Debate Forecast, while maintaining low deployment expenses. This study integrates general-purpose LLM platforms (e.g., Gemini and ChatGPT) with foundational data handling systems (e.g., Google Sheets). This approach enables financial forecasters to implement the forecast framework with low computational overhead and reduced development costs.

This study also shows that integrating Cross-Debate Consensus Based Optimization Variables (CDCBOVs) into LLMs provides governments, enterprises, and individual investors with an AI-driven analytical framework for financial market forecasting. This approach can be adopted by financial service practitioners to enable rapid, strategic decision-making that transcends conventional market intuition, particularly in response to abrupt, high-impact political and economic events (e.g., HIPESs).

Existing studies have largely overlooked methods to strengthen the economic reasoning abilities of LLMs. This research highlights that combining Integrated Context (comprising HIPES transcripts) with CDCBOVs (facilitated by the dual-LLM cross-debate mechanism) can generate Optimization Variables and then attain consensus Forecast Values (F2), thereby substantially improving the forecast accuracy of FMIs. By addressing existing academic research gaps, this study demonstrates the concept, design, and implementation, supported by empirical evidence, with respect to using LLMs to achieve the optimization of financial market forecasting.

This study addresses the complex forecasting challenge of High-Volatility Assets—specifically Cryptocurrencies—by demonstrating that CDCBOVs serve as critical mechanisms for mitigating model biases. Our research findings underscore the distinct academic significance of the LLM Cross-Debate Consensus-Based Forecasting Initiative in filtering extreme market noise and address a significant theoretical gap regarding how LLMs equipped with CDCBOVs can mitigate volatile market conditions.

7. Limitations and Future Studies

This research centers on 75 FMIs obtained from tradingeconomics.com, utilizing their closing prices as the Realized Value (RV), with a sample evenly distributed across five primary asset categories. To strengthen the model’s adaptability, future work may eliminate this sampling limitation, enabling users to customize and expand the forecasted targets beyond the 75 FMIs based on specific requirements. This less limited approach can not only validate the model’s flexibility but also enable the evaluation of its forecasting performance across various FMI categories.

This study currently examines U.S. political and economic statements within a defined timeframe, constrained by limited information sources. Future work could broaden the experimental framework by integrating heterogeneous and real-time Integrated Context. Researchers may also extend the data scope to encompass other countries’ political and economic statements, science and technical news, corporate earnings, and social sentiment. Additionally, testing modifications to data collection frequency and time granularity could explore whether varying temporal windows enhance forecast performance.

Subsequent research could investigate multimodal and behaviorally driven Optimization Variables, leveraging CDCBOVs’ demonstrated accuracy and performance enhancements [3,33]. Integrating voice tone, micro-expression analysis, and behavioral finance indicators in subsequent studies could enable the transformation or progression of nonverbal market sentiment into structured datasets or knowledge frameworks, thereby enhancing LLMs’ capabilities in terms of analyzing and coping with radical market fluctuations.

Due to limitations in our research resource, this study only works on Triple-C (Cross-Validation, Cross-Debate, and Consensus-Building) debate experiments under the Gemini and ChatGPT model environments. The implementation and findings of this study entail some key limitations and corresponding directions for future research. First, regarding the operational environment, both LLM(1) and LLM(2) were executed via premium web-based GUIs; consequently, all hyperparameters (e.g., Temperature) were constrained to system defaults, precluding code-level fine-tuning. Second, due to this reliance on a web-based environment and the inherent non-determinism of commercial models, exact verbatim reproducibility remains restricted, despite the robustness of our methodological framework. Third, regarding the scope of baselines, this experiment was designed as an internal comparative study intuitively aimed at confirming the superiority of the debate mechanism (F2) over the single-pass baseline (F1), and it has not yet been externally compared against traditional econometric models or other algorithmic baselines. However, having successfully established F2 > F1 in this exploratory proof-of-concept, future research can extend this architecture into more innovative workflows to further maximize the advantages of F2. Finally, because commercial models are subject to continuous and opaque updates, longitudinally tracking whether the absolute forecasting accuracy of F1 and F2 will consistently improve over time remains a critical task for future studies. Future research may replicate our experiments on various platforms under diverse LLM frameworks, such as Grok, Claude, and other LLMs. Cross-model validation can substantiate the generalizability of the “Dual-Agent LLM Debate Mechanism”, while facilitating comprehensive examination of potential advantages of specific LLM models for forecasting the dynamic trends of financial market indicators in various desirable and applicable financial asset categories. Additionally, for future comprehensive performance validations, researchers can explore a variety of practical baselines—including, but not limited to, self-consistency sampling, majority voting across multiple independent runs, and a simple ‘multi-run best-of’ selection baseline—to better isolate the underlying mechanics of performance improvement.

Author Contributions

Conceptualization, S.E.C. and K.-C.C.; methodology, S.E.C. and K.-C.C.; software, K.-C.C.; validation, S.E.C. and K.-C.C.; formal analysis, S.E.C. and K.-C.C.; investigation, S.E.C. and K.-C.C.; resources, S.E.C. and K.-C.C.; data curation, S.E.C. and K.-C.C.; writing—original draft, S.E.C. and K.-C.C.; writing—review and editing, S.E.C. and K.-C.C.; visualization, K.-C.C.; supervision, S.E.C.; project administration, S.E.C.; funding acquisition, S.E.C. and K.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Chung Hsing University, Taiwan.

Data Availability Statement

Due to legal and business contract constraints, the data presented in this study are available only upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A supplements Figure 4 and Figure 5, as well as Table 2 in this article, by providing a concrete, step-by-step case study illustrating how F1 is systematically mapped to ‘Optimization Variables’ within our ‘Triple-C framework’ for generating F2.

First, Figure A1 specifically demonstrates the workflow of Stage 1, Cross-Validation. It shows how (F1), generated via the LLM(1) Direct Forecast, is fed into LLM(2) to execute the Cross-Validation process, systematically transforming the initial data structure from [i]:[F1] into a comprehensively evaluated format of [i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1].

Figure A1. Cross-Validation workflow and confidence score (SC) trigger mechanism.

Secondly, Figure A2 shows the Stage 2 (Cross-Debate) Loop, illustrating the interactive execution between LLM(1) and LLM(2). It explicitly demonstrates how the data structure is systematically transformed from [F1]/[SC1]/[ST1]/[RE1]/[RN1] into the updated array [i]:[RN1]/[SC2]/[ST2]/[RE2]/[RN2]. Capped at a maximum of three rounds (Round 3), this iterative Cross-Debate process dynamically injects the Opponent’s conflicting reasoning back into the prompt as active Optimization Variables to facilitate logical correction.

Finally, Figure A3 demonstrates the Stage 3 (Consensus Building) stage, outlining how iterative cognitive conflicts are successfully resolved, culminating in the stable convergence of these variables into the final forecast (F2).

Figure A2. Cross-Debate Loop and dynamic parameter injection process.

Figure A3. Consensus Building and final forecast (F2) convergence.

The primary goal of providing Figure A1, Figure A2 and Figure A3 and the Triple-C framework, together with their corresponding operational workflows, is to enhance transparency and manifest that the proposed dual-agent debate mechanisms do not require obscure, proprietary code. Instead, by strictly following a set of standardized procedural steps, an independent researcher can deliberately execute and reproduce this logic-collision workflow using standard LLM interfaces. The Triple-C framework is divided into three distinct stages, which are illustrated below:

The Cross-Validation stage (see Figure A1):
This stage conducts the systematic evaluation of the initial baseline forecast (F1) by the Opponent (LLM2). Its workflow operates under a strict five-step protocol:
(1). Read F1 Value: The Opponent inputs and processes the Proponent’s direct baseline forecast (F1) for a specific FMI.
(2). Give Confidence Score: The Opponent evaluates F1 and assigns a quantitative confidence score ([SC], bounded 0~1).
(3). Detection Status: An algorithmic check determines the stance ([ST]): assigned as “Agreed” if SC ≥ 0.7, or “Disagree” if SC < 0.7.
(4). Give Reasoning: If ST = Disagree, the Opponent generates a textual rationale ([RE]) articulating why it rejects the Proponent’s logic.
(5). Give Revised Number: If ST = Disagree, the Opponent calculates and outputs a mathematically reasonable revised forecast ([RN]). Conversely, if ST = Agreed, the system defaults to retaining the F1 value as the final [RN].

In the process, the entire set of newly generated parameters dynamically injected into the LLM’s prompt is collectively defined as ‘Optimization Variables’.

2.: The Cross-Debate Loop stage (see Figure A2):
This stage illustrates the core variance-control mechanism and the dynamic parameter injection process. It is important to note that throughout this loop, both LLMs operate under a symmetrical protocol, applying the exact same methodology for each exchange. The iterative process follows a strict eight-step algorithmic sequence:
(1). Detection SC ≥ 0.7: The receiving model (e.g., LLM 1) reads the confidence score ([SC]) generated by the opposite model in the previous stage. If SC ≥ 0.7, the debate terminates; otherwise, it proceeds.
(2). Detection Status: The model verifies the opposite model’s status ([ST]). If ST = Agreed, the loop halts; if ST = Disagree, the active cross-debate initiates.
(3). Read Reasoning: The model ingests the opposite model’s refutation reasoning ([RE]), utilizing this explicitly structured conflicting rationale as the new contextual basis for logical correction.
(4). Read RN Value: The model reads the revised numerical forecast ([RN]) proposed by the opposite model.
(5). Give Confidence Score: The model critically evaluates the opposite model’s [RN] and assigns a new confidence score ([SC], bounded 0~1).
(6). Detection Status: Based on the newly generated score, the model determines its own stance ([ST]): assigned as “Agreed” if SC ≥ 0.7, or “Disagree” if SC < 0.7.
(7). Give Reasoning: If ST = Disagree, the model generates explicit text-based reasoning ([RE]) articulating its counter-argument and why it rejects the opposite model’s stance.
(8). Give Revised Number: If ST = Disagree, the model computes and outputs a newly revised numerical forecast ([RN]) justified by its reasoning. Conversely, if initially ST = Agreed, the system automatically retains the previous round’s numerical value as the final [RN].

In the process, the entire set of newly generated parameters, which are dynamically injected into the LLM’s prompt, is collectively defined as ‘Optimization Variables’. Throughout the Cross-Debate Loop (Rounds 1~3), both LLM(1) and LLM(2) operate under a symmetrical protocol, applying the exact same standardized methodology to iteratively evaluate and refine their outputs.

3.: The Consensus Building stage (see Figure A3):
This stage executes the final convergence process where the iterative debate resolves cognitive conflict to produce the stable, relatively accurate final forecast (F2). It ensures a deterministic output through a strict two-step resolution protocol:
(1). Give Consensus Value (F2): The system generates the final quantitative forecast (F2) based on the debate’s termination condition. If logical consensus (SC ≥ 0.7 or ST = Agreed) is achieved prior to or upon the conclusion of the three-round limit, F2 directly adopts the final revised numerical forecast ([RN]) from that concluding round of debate. Conversely, if the maximum three-round limit is exhausted without reaching mutual agreement, an algorithmic fallback is triggered, and F2 is calculated as the arithmetic mean of the final [RN] values proposed by both LLMs.
(2). Give Status: The system logs an explicit convergence status tag to indicate the derivation method of F2, ensuring complete procedural traceability. If F2 resulted from a mutual logical agreement, the status is tagged as “Consensus.” If F2 was generated via the mathematical averaging fallback due to unresolved conflict after three rounds of debate, the status is tagged as “Mean.”

The sequential steps illustrated in Figure A1 (Steps 1–5), Figure A2 (Steps 1–8), and Figure A3 (Steps 1–2) encapsulate the core conceptual logic of the prompts designed for this study. By cross-referencing these visual workflows (see Figure A1, Figure A2 and Figure A3) with the prompt structural framework provided in Table 2, researchers can readily adapt and customize the prompts to suit their specific linguistic preferences and operational requirements.

It is crucial to emphasize that while this specific experiment utilized Google Gemini 3 Pro and OpenAI ChatGPT 5.2 Thinking, the fundamental Triple-C architecture is inherently model-agnostic. The consistency lies in the standardized experimental workflow, not in the specific models employed. We strongly encourage researchers and financial practitioners to embrace this exploratory spirit. By adopting this transparent framework, future studies can seamlessly substitute different LLMs, adjust the Optimization Variables, and develop their own robust forecasting methodologies tailored to diverse financial or analytical domains. Our overarching goal is to prove that the technological exploration of LLMs—when guided by a scientifically designed procedural framework—transcends mere prompt engineering, providing universally reproducible methodologies for future research.

Appendix B

Given the comprehensive architecture and intricate micro-level details of the proposed framework, it is easy for researchers to become bogged down in granular specifics. Therefore, Appendix B is explicitly designed to provide a top-level, macro perspective. Figure A4 serves as a master navigation blueprint, summarizing the overall logical flow and execution stages for both the Market Forecast (F1) and the Triple-C debate mechanism (F2).

By presenting this overarching workflow, Appendix B effectively reinforces the detailed mechanisms outlined in Appendix A. It allows researchers to bypass overly rigid specifics and instantly grasp the primary objective of each phase through simplified descriptions. This empowers researchers to construct their own prompts using their preferred languages, syntaxes, or phrasing to guide the LLMs effectively. Ultimately, it provides a highly user-friendly reference to easily locate the appropriate diagrams (Figure 8, Figure A1, Figure A2 and Figure A3) while maintaining the framework’s adaptability across diverse global environments.

To facilitate this, the prompt engineering logic for each stage is distilled into two guiding principles: (1) Core Directive (the essential mindset for the prompt) and (2) Execution Rule (the structural constraint).

Figure A4. Structural blueprint and guiding principles for prompt engineering.

(F1) Market Forecast Stage:
As mapped in Figure 8, this involves instructing LLM(1) to analyze the HIPES data and generate initial predictions for the target FMIs.
(1). Core Directive: The researcher’s prompt logic is to command LLM(1) to produce the baseline forecast.
(2). Execution Rule: It follows strictly a single-turn execution (one round) to obtain the initial F1 value.
(F2) Triple-C Stage 1/3 (Cross-Validation):
As detailed in Figure A1, this step directs LLM(2) to act as an independent critic, evaluating and validating the initial forecast generated in F1.
(1). Core Directive: The prompt logic is to compel LLM(2) to critically cross-validate the contents and rationales of LLM(1)’s F1 output.
(2). Execution Rule: It follows strictly a single-turn execution (one round) to validate F1 value derived by LLM(1).
(F2) Triple-C Stage 2/3 (Cross-Debate Loop):
As defined in Figure A2, this stage establishes the iterative interaction logic, where LLM(1) and LLM(2) actively debate, challenge, and refine each other’s rationales.
(1). Core Directive: The prompt logic facilitates a cross-debate. Crucially, the researcher must inject the output generated from each interaction into the subsequent turn’s prompt as historical context.
(2). Execution Rule: It follows an iterative loop requiring a minimum of one round and capped at a maximum of three rounds of debate.
(F2) Triple-C Stage 3/3 (Consensus Building):
As outlined in Figure A3, this final step instructs LLM(1) to synthesize the entire debate history and construct the ultimate consensus forecast.
(1). Core Directive: The prompt logic restricts LLM(1) to a purely mechanical extraction task rather than subjective evaluation. It strictly directs the LLM to review the final state of the debate and output a binary-like judgment: either extracting the reached “Consensus” or outputting the calculated “Mean”, thereby finalizing the result.
(2). Execution Rule: It follows strictly a single-turn execution (one round) to extract the ultimate F2 value.

References

Fu, K.; Zhang, Y. Incorporating multi-source market sentiment and price data for stock price prediction. Mathematics 2024, 12, 1572. [Google Scholar] [CrossRef]
Yang, K.; Deng, R.; Wei, Y.; Wang, S. The power of ChatGPT in processing text: Evidence from analysis and prediction in the exchange rate markets. Financ. Innov. 2025, 11, 118. [Google Scholar] [CrossRef]
Jain, P.M.K.; Aggarwal, S. News and stock market volatility: A global systematic literature review. TPM Test. Psychom. Methodol. Appl. Psychol. 2025, 32, 1633–1645. [Google Scholar]
Karpatne, A.; Atluri, G.; Faghmous, J.H.; Steinbach, M.; Banerjee, A.; Ganguly, A.; Shekhar, S.; Samatova, N.; Kumar, V. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Trans. Knowl. Data Eng. 2017, 29, 2318–2331. [Google Scholar] [CrossRef]
Chang, S.E.; Chung, K.-C. Exploring the use of high-impact political and economic statements in LLM for judging financial market trend—A technical indicator-based approach. Mathematics 2026, 14, 869. [Google Scholar] [CrossRef]
Korinek, A. Generative AI for economic research: Use cases and implications for economists. J. Econ. Lit. 2023, 61, 1281–1317. [Google Scholar] [CrossRef]
Kıcıman, E.; Ness, R.; Sharma, A.; Tan, C. Causal reasoning and large language models: Opening a new frontier for causality. arXiv 2024, arXiv:2305.00050. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar] [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are zero-shot reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
Lopez-Lira, A.; Tang, Y. Can ChatGPT forecast stock price movements? Return predictability and Large Language Models. arXiv 2024, arXiv:2304.07619. [Google Scholar] [CrossRef]
Sarker, M.K.; Zhou, L.; Eberhart, A.; Hitzler, P. Neuro-symbolic artificial intelligence: Current trends. AI Commun. 2022, 34, 197–209. [Google Scholar] [CrossRef]
Mitroff, I.I.; Emshoff, J.R. On strategic assumption-making: A dialectical approach to policy and planning. Acad. Manag. Rev. 1979, 4, 1–12. [Google Scholar] [CrossRef]
Johnson, D.W.; Johnson, R.T. Energizing learning: The instructional power of conflict. Educ. Res. 2009, 38, 37–51. [Google Scholar] [CrossRef]
Michael Nussbaum, E. Collaborative discourse, argumentation, and learning: Preface and literature review. Contemp. Educ. Psychol. 2008, 33, 345–359. [Google Scholar] [CrossRef]
Ma, J.; Wang, C.; Rong, L.; Wang, B.; Xu, Y. Exploring multi-agent debate for zero-shot stance detection: A novel approach. Appl. Sci. 2025, 15, 4612. [Google Scholar] [CrossRef]
Gentzkow, M.; Kelly, B.; Taddy, M. Text as data. J. Econ. Lit. 2019, 57, 535–574. [Google Scholar] [CrossRef]
Shiller, R.J. Narrative economics. Am. Econ. Rev. 2017, 107, 967–1004. [Google Scholar] [CrossRef]
Baker, S.R.; Bloom, N.; Davis, S.J. Measuring economic policy uncertainty. Q. J. Econ. 2016, 131, 1593–1636. [Google Scholar] [CrossRef]
Blinder, A.S.; Ehrmann, M.; Fratzscher, M.; de Haan, J.; Jansen, D.-J. Central bank communication and monetary policy: A survey of theory and evidence. J. Econ. Lit. 2008, 46, 910–945. [Google Scholar] [CrossRef]
Bianchi, F.; Gómez-Cram, R.; Kind, T.; Kung, H. Threats to central bank independence: High-frequency identification with twitter. J. Monet. Econ. 2023, 135, 37–54. [Google Scholar] [CrossRef]
Caldara, D.; Iacoviello, M. Measuring geopolitical risk. Am. Econ. Rev. 2022, 112, 1194–1225. [Google Scholar] [CrossRef]
Hassan, T.A.; Schreger, J.; Schwedeler, M.; Tahoun, A. Sources and transmission of country risk. Rev. Econ. Stud. 2024, 91, 2307–2346. [Google Scholar] [CrossRef]
Hassan, T.A.; Hollander, S.; van Lent, L.; Tahoun, A. Firm-level political risk: Measurement and effects. Q. J. Econ. 2019, 134, 2135–2202. [Google Scholar] [CrossRef]
Sautner, Z.; Van Lent, L.; Vilkov, G.; Zhang, R. Firm-level climate change exposure. J. Financ. 2023, 78, 1449–1498. [Google Scholar] [CrossRef]
Loughran, T.I.M.; McDonald, B. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J. Financ. 2011, 66, 35–65. [Google Scholar] [CrossRef]
Gu, S.; Kelly, B.; Xiu, D. Empirical asset pricing via machine learning. Rev. Financ. Stud. 2020, 33, 2223–2273. [Google Scholar] [CrossRef]
Wang, Z.; Jiang, J.; Zhan, Y.; Zhou, B.; Li, Y.; Zhang, C.; Yu, B.; Ding, L.; Jin, H.; Peng, J.; et al. Hypnos: A domain-specific large language model for anesthesiology. Neurocomputing 2025, 624, 129389. [Google Scholar] [CrossRef]
Ruan, L.; Jiang, H. Stock price prediction using FinBERT-enhanced sentiment with SHAP explainability and differential privacy. Mathematics 2025, 13, 2747. [Google Scholar] [CrossRef]
Goldstein, I.; Spatt, C.S.; Ye, M. Big data in finance. Rev. Financ. Stud. 2021, 34, 3213–3225. [Google Scholar] [CrossRef]
Hui, X.; Reshef, O.; Zhou, L. The short-term effects of generative artificial intelligence on employment: Evidence from an online labor market. Organ. Sci. 2024, 35, 1977–1989. [Google Scholar] [CrossRef]
Mullainathan, S.; Shleifer, A. The market for news. Am. Econ. Rev. 2005, 95, 1031–1053. [Google Scholar] [CrossRef]
Grimmer, J.; Stewart, B.M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal. 2013, 21, 267–297. [Google Scholar] [CrossRef]
Liberti, J.M.; Petersen, M.A. Information: Hard and soft. Rev. Corp. Financ. Stud. 2019, 8, 1–41. [Google Scholar] [CrossRef]
Davis, A.K.; Ge, W.; Matsumoto, D.; Zhang, J.L. The effect of manager-specific optimism on the tone of earnings conference calls. Rev. Account. Stud. 2015, 20, 639–673. [Google Scholar] [CrossRef]
Pesaran, M.H.; Timmermann, A. Selection of estimation window in the presence of breaks. J. Econom. 2007, 137, 134–161. [Google Scholar] [CrossRef]
Clark, T.E.; McCracken, M.W. Improving forecast accuracy by combining recursive and rolling forecasts. Int. Econ. Rev. 2009, 50, 363–395. [Google Scholar] [CrossRef]
Yan, J.; Huang, Y. MambaLLM: Integrating macro-index and micro-stock data for enhanced stock price prediction. Mathematics 2025, 13, 1599. [Google Scholar] [CrossRef]
Lewis, D.J.; Mertens, K.; Stock, J.H.; Trivedi, M. Measuring real activity using a weekly economic index. J. Appl. Econom. 2022, 37, 667–687. [Google Scholar] [CrossRef]
Athey, S. Beyond prediction: Using big data for policy problems. Science 2017, 355, 483–485. [Google Scholar] [CrossRef]
Andersen, T.G.; Bollerslev, T.; Diebold, F.X.; Labys, P. Modeling and forecasting realized volatility. Econometrica 2003, 71, 579–625. [Google Scholar] [CrossRef]
Patton, A.J. Volatility forecast comparison using imperfect volatility proxies. J. Econom. 2011, 160, 246–256. [Google Scholar] [CrossRef]
Welch, I.; Goyal, A. A comprehensive look at the empirical performance of equity premium prediction. Rev. Financ. Stud. 2008, 21, 1455–1508. [Google Scholar] [CrossRef]
Rapach, D.E.; Strauss, J.K.; Zhou, G. Out-of-sample equity premium prediction: Combination forecasts and links to the real economy. Rev. Financ. Stud. 2010, 23, 821–862. [Google Scholar] [CrossRef]
Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 253–263. [Google Scholar] [CrossRef]
Clark, T.E.; West, K.D. Approximately normal tests for equal predictive accuracy in nested models. J. Econom. 2007, 138, 291–311. [Google Scholar] [CrossRef]
Ma, F.; Lyu, Z.; Li, H. Can ChatGPT predict Chinese equity premiums? Financ. Res. Lett. 2024, 65, 105631. [Google Scholar] [CrossRef]
Pelster, M.; Val, J. Can ChatGPT assist in picking stocks? Financ. Res. Lett. 2024, 59, 104786. [Google Scholar] [CrossRef]
Pellicani, A.; Pio, G.; Ceci, M. CARROT: Simultaneous prediction of anomalies from groups of correlated cryptocurrency trends. Expert Syst. Appl. 2025, 260, 125457. [Google Scholar] [CrossRef]
Lo, A. The adaptive markets hypothesis: Market efficiency from an evolutionary perspective. J. Portf. Manag. 2004, 30, 15–29. [Google Scholar] [CrossRef]

Figure 1. Research framework of the dual-path comparative experiment.

Figure 2. Prompt-guided independent inference sessions with 9 isolated chat boxes and 9 dedicated prompts.

Figure 3. Market forecast framework.

Figure 4. Debate process framework.

Figure 5. Debate process principles.

Figure 6. Forecast horizons and rolling validation framework.

Figure 7. Data aggregation (via rolling window).

Figure 8. Mechanism of Market Forecast (F1).

Figure 9. Mechanism of Cross-Validation stage.

Figure 10. Mechanism of Cross-Debate stage (1/2).

Figure 11. Mechanism of Cross-Debate stage (2/2).

Figure 12. Mechanism of Consensus-Building stage.

Figure 14. Performance differentials (Diff.) at the Daily Item Level: by asset category and FEI intensity (7–day mean).

Table 1. Five categories of assets, with each category consisting of 15 FMIs.

Item	Commodities	Indices	Exchange Rates	Crypto	10Y Yield
1	Brent	ASX200	AUDUSD	ADA	Australia
2	Coal	DE40	DXY	ALGO	Brazil
3	Copper	ES35	EURUSD	ATOM	Canada
4	Crude Oil	FR40	GBPUSD	AVAX	Chile
5	Gasoline	GB100	NZDUSD	BCH	China
6	Gold	IBOVESPA	USDBRL	BNB	France
7	Heating Oil	IT40	USDCAD	BTC	Germany
8	Iron Ore CNY	JP225	USDCHF	DAI	India
9	Lumber	MOEX	USDCNY	DOT	Italy
10	Natural Gas	SENSEX	USDINR	ETH	Japan
11	Silver	SHANGHAI	USDJPY	LTC	Russia
12	Soybeans	TSX	USDKRW	MATIC	South Africa
13	Steel	US100	USDMXN	SOL	Switzerland
14	TTF Gas	US30	USDRUB	UNI	United Kingdom
15	Wheat	US500	USDTRY	XRP	United States

Table 2. Specifications and functions of prompts (Prompt 1~Prompt 9).

Process Stage	Agent (Prompt)	Role	Operational Objective (Technical Detail)
Panel A: F1 (Direct Forecast)
Market Forecast	LLM(1) (P1, i.e., Prompt 1)	Proponent	Initialization: Transforms unstructured data into structured theoretical variables via Semantic-to-Theory Mapping to generate the baseline forecast (F1).
Panel B: F2 (Debate Forecast)
Cross- Validation	LLM(2) (P2, i.e., Prompt 2)	Opponent	Scoring and Risk Detection: Challenges F1 logic and assigns an initial confidence score (SC) to trigger the debate path.
Cross-Debate Loop	LLM(1) (P3,5,7, i.e., Prompts 3, 5, and 7)	Proponent	Score Recognition and Defense: Identifies low scores (SC < 0.7), performs Causal Relationship Analysis, and refines the numerical forecast attributes.
Cross-Debate Loop	LLM(2) (P4,6,8, i.e., Prompts 4, 6, and 8)	Opponent	Re-Scoring and Critical Review: Evaluates the revised logic against theoretical constraints and updates the Agreement Score to determine consensus.
Consensus Building	LLM(1) (P9, i.e., Prompt 9)	Proponent	Threshold Verification and Convergence: Verifies the passing score (SC ≧ 0.7) and synthesizes the debate history into the final Robust Consensus (F2).

Table 3. Input–output data flow matrix.

Process Stage	Agent (Prompt)	Input Data	Output Variable	Output File Name
Panel A: F1 (Direct Forecast)
Round 1	LLM1 (P1)	Context (C)	F1	Forecast Values (F1)
Panel B: F2 (Debate Forecast)
1. Cross-Validation
Round 1	LLM2 (P2)	C + F1	RN1	Opponent Context (1)
2. Cross-Debate Loop
Round 1 (1/2)	LLM1 (P3)	C + RN1	RN2	Proponent Context (1)
Round 1 (2/2)	LLM2 (P4)	C + RN2	RN3	Opponent Context (2)
Round 2 (1/2)	LLM1 (P5)	C + RN3	RN4	Proponent Context (2)
Round 2 (2/2)	LLM2 (P6)	C + RN4	RN5	Opponent Context (3)
Round 3 (1/2)	LLM1 (P7)	C + RN5	RN6	Proponent Context (3)
Round 3 (2/2)	LLM2 (P8)	C + RN6	RN7	Opponent Context (4)
3. Consensus Building
Round 1	LLM1 (P9)	Latest RN (RN1~7)	F2	Consensus Values (F2)

Note: Reference Figure 3, Figure 4 and Figure 5 for extra information; “Context” refers to the “Integrated Context”.

Table 4. Output formats and input references.

Process Stage	Output Variable	Input Reference (Read)	Format Schema (Write)
Panel A: F1 (Direct Forecast)
Round 1	F1		[i]:[F1]
Panel B: F2 (Debate Forecast)
1. Cross-Validation
Round 1	RN1	[i]:[F1]	[i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1]
2. Cross-Debate Loop
Round 1 (1/2)	RN2	[i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1]	[i]:[RN1]/[SC2]/[ST2]/[RE2]/[RN2]
Round 1 (2/2)	RN3	[i]:[RN1]/[SC2]/[ST2]/[RE2]/[RN2]	[i]:[RN2]/[SC3]/[ST3]/[RE3]/[RN3]
Round 2 (1/2)	RN4	[i]:[RN2]/[SC3]/[ST3]/[RE3]/[RN3]	[i]:[RN3]/[SC4]/[ST4]/[RE4]/[RN4]
Round 2 (2/2)	RN5	[i]:[RN3]/[SC4]/[ST4]/[RE4]/[RN4]	[i]:[RN4]/[SC5]/[ST5]/[RE5]/[RN5]
Round 3 (1/2)	RN6	[i]:[RN4]/[SC5]/[ST5]/[RE5]/[RN5]	[i]:[RN5]/[SC6]/[ST6]/[RE6]/[RN6]
Round 3 (2/2)	RN7	[i]:[RN5]/[SC6]/[ST6]/[RE6]/[RN6]	[i]:[RN6]/[SC7]/[ST7]/[RE7]/[RN7]
3. Consensus Building
Round 1	F2	[i]:[RNn-1]/[SCn]/[STn]/[REn]/[RNn]	[i]:[F2]/[ST8]

Note: Reference Figure 3 and Figure 5 for extra information.

Table 5. Relevant experiment dates with news count.

Date	Data Start Date	HIPES Horizon	Data End and Experiment Start Date	Forecast Horizon	RV Date and Experiment End Date	News Count
Test (T + 7)
1	19 October 2025	30 (days)	18 November 2025	7 (days)	25 November 2025	23
2	20 October 2025		19 November 2025		26 November 2025	21
3	21 October 2025		20 November 2025		27 November 2025	22
4	22 October 2025		21 November 2025		28 November 2025	22
5	23 October 2025		22 November 2025		29 November 2025	20
6	24 October 2025		23 November 2025		30 November 2025	20
7	25 October 2025		24 November 2025		12 January 2025	20

Table 6. Seven-day dynamic of (F2) debate disagreements.

		(F1)	(F2)
Step		Direct Forecast	Cross- Validation		Cross-Debate						Consensus Building
	Round	Round 1	Round 1		Round 1		Round 2		Round 3		Round 1
Day		Round 1	#A	#D	#A	#D	#A	#D	#A	#D	Round 1
Day 1		75	1	74	67	7	3	4	4	0	74
Day 2		75	1	74	66	8	5	3	3	0	74
Day 3		75	1	74	67	7	1	6	6	0	74
Day 4		75	1	74	59	15	9	6	6	0	74
Day 5		75	1	74	1	73	0	73	73	0	74
Day 6		75	2	73	54	19	3	16	16	0	73
Day 7		75	4	71	56	15	15	0	0	0	71
Total		525	11	514	514		144		108		514

Note: #A = agree count; #D = disagree count.

Table 7. Compositional structure of (F2) forecasts.

	Day 1	Day 2	Day 3	Day 4	Day 5	Day 6	Day 7	Total
Asset Category	Day 1	Day 2	Day 3	Day 4	Day 5	Day 6	Day 7	Total
Panel A: Agreed, LLM(1)
Commodities	0	0	0	0	0	1	2	3
Indices	0	0	0	0	0	0	1	1
Exchange Rates	0	0	0	0	0	0	0	0
Crypto	1	1	1	1	1	1	1	7
10Y Yield	0	0	0	0	0	0	0	0
Total:	1	1	1	1	1	2	4	11
Panel B: Debate, LLM(1)
Commodities	0	1	0	0	0	0	0	1
Indices	0	0	0	0	0	0	0	0
Exchange Rates	0	0	0	0	0	0	0	0
Crypto	0	0	0	0	0	0	0	0
10Y Yield	0	0	0	0	0	1	0	1
Total:	0	1	0	0	0	1	0	2
Panel C: Debate, LLM(2)
Commodities	15	14	15	15	15	14	13	101
Indices	15	15	15	15	15	15	14	104
Exchange Rates	15	15	15	15	15	15	15	105
Crypto	14	14	14	14	14	14	14	98
10Y Yield	15	15	15	15	15	14	15	104
Total:	74	73	74	74	74	72	71	512

Table 8. FEI comparison (Daily Average and Overall Average) of F1 and F2.

	Test (T + 7)
Forecast Horizon	F1’s FEI Value	Comp. Result	F2’s FEI Value
Panel A: Comparing Daily Average FEIs for F1 and F2
Day 1	0.783	<	0.814 *
Day 2	0.795	<	0.816 *
Day 3	0.780	<	0.813 *
Day 4	0.776	<	0.810 *
Day 5	0.774	<	0.791 *
Day 6	0.771	<	0.809 *
Day 7	0.763	<	0.803 *
Panel B: Comparing Overall Average FEIs for F1 and F2
Overall Average	0.777	<	0.808 *

Note: 1. Each FEI value spans a range from 0 to 1, with a higher FEI value reflecting greater forecasting accuracy; 2. * denotes the winner with a higher FEI value.

Table 9. Comparison of Daily Average and Overall Average MAE across F1 and F2.

	Test (T + 7)
Forecast Horizon	F1’s MAE Value	Comp. Result	F2’s MAE Value
Panel A: Comparing Daily Average MAEs for F1 and F2
Day 1	1393	>	1301 *
Day 2	1056 *	<	1189
Day 3	1440	>	1296 *
Day 4	1545	>	1303 *
Day 5	1587	>	1458 *
Day 6	1622	>	1359 *
Day 7	1703	>	1478 *
Panel B: Comparing Overall Average MAEs for F1 and F2
Overall Average	1478	>	1341 *

Note: MAE = Mean Absolute Error; * denotes the lower value.

Table 10. Comparison of Daily Average and Overall Average MSE across F1 and F2.

	Test (T + 7)
Forecast Horizon	F1’s MSE Value	Comp. Result	F2’s MSE Value
Panel A: Comparing Daily Average MSEs for F1 and F2
Day 1	32.9 M	>	18.9 M *
Day 2	11.9 M *	<	14.4 M
Day 3	35.8 M	>	18.0 M *
Day 4	45.6 M	>	18.8 M *
Day 5	49.9 M	>	27.2 M *
Day 6	53.6 M	>	22.1 M *
Day 7	64.9 M	>	31.2 M *
Panel B: Comparing Overall Average MSEs for F1 and F2
Overall Average	42.1 M	>	21.5 M *

Note: MSE = Mean Squared Error; * denotes the lower value.

Table 11. Win statistics comparison (on Overall Average FEIs) of F1 and F2 across four quartiles.

	Test (T + 7)
Asset Category	F1’s FEI Value	Comparison Result	F2’s FEI Value
Panel A: Overall Win Distribution across All 4 Quartiles (0 < FEI ≦ 1)
Commodities	5/15	<	10/15 *
Indices	11/15 *	>	4/15
Exchange Rates	5/15	<	10/15 *
Crypto	1/15	<	13/15 *
10Y Yield	6/15	<	9/15 *
Panel A Total	28	<	46 *
Panel B: Top Quartile (75~100%, i.e., 0.75 ≦ FEI < 1.00)
Commodities	5/15	<	7/15 *
Indices	11/15 *	>	3/15
Exchange Rates	5/15	<	9/15 *
Crypto	1/15	<	3/15 *
10Y Yield	4/15	<	9/15 *
Panel B Total	26	<	31 *
Panel C: Second Quartile (50~75%, i.e., 0.50 ≦ FEI < 0.75)
Commodities	0/15	<	3/15 *
Indices	0/15	<	1/15 *
Exchange Rates	0/15	<	1/15 *
Crypto	0/15	<	3/15 *
10Y Yield	2/15 *	>	0/15
Panel C Total	2	<	8 *
Panel D: Third Quartile (25~50%, i.e., 0.25 ≦ FEI < 0.50)
Commodities	0/15	=	0/15
Indices	0/15	=	0/15
Exchange Rates	0/15	=	0/15
Crypto	0/15	<	5/15 *
10Y Yield	0/15	=	0/15
Panel D Total	0	<	5 *
Panel E: Bottom Quartile (0~25%, i.e., 0.00 ≦ FEI < 0.25)
Commodities	0/15	=	0/15
Indices	0/15	=	0/15
Exchange Rates	0/15	=	0/15
Crypto	0/15	<	2/15 *
10Y Yield	0/15	=	0/15
Panel E Total	0	<	2 *

Note: 1. Forecast Estimate Index (FEI) scores range from 0 to 1, with higher FEI scores corresponding to greater forecasting accuracy; 2. * indicates the category with the higher number of wins.

Table 12. Correction analysis: showing how (F2) debate enhances (F1) forecasts.

	(F1)	(F2)					-	-	-
Agreed or Debate	-	Agreed	Debate				-	-	-
Value Adopted From	LLM(1)	LLM(1)	LLM(1)		LLM(2)		-	-	-
Debate Validity	-	-	Valid	Invalid	Valid	Invalid	Total Valid	Total Invalid	Total Valid Ratio
Commodities	105	3	1	0	70	31	71	31	69.6%
Indices	105	1	0	0	29	75	29	75	27.9%
Exch. Rates	105	0	0	0	68	37	68	37	64.8%
Crypto	105	7	0	0	89	9	89	9	90.8%
10Y Yield	105	0	1	0	67	37	68	37	64.8%
Total:	525	11	2	0	323	189	325	189

Table 13. Daily item-level performance differential (Diff.) by asset category and FEI intensity.

	Test (T + 7)
Asset Category	F1’s FEI Value	Comparison Result	F2’s FEI Value
Panel A: Top Quartile (75~100%, i.e., \|Diff.\| ≧ 8.60%)
Commodities	0/15	<	2/15 *
Indices	1/15 *	>	0/15
Exchange Rates	0/15	=	0/15
Crypto	1/15	<	11/15 *
10Y Yield	1/15	<	3/15 *
Panel A Total	3	<	16 *
Panel B: Second Quartile (50~75%, i.e., 8.60% > \|Diff.\| ≧ 3.90%)
Commodities	2/15	=	2/15
Indices	4/15 *	>	0/15
Exchange Rates	1/15 *	>	0/15
Crypto	0/15	<	2/15 *
10Y Yield	3/15	<	5/15 *
Panel B Total	10 *	>	9
Panel C: Third Quartile (25~50%, i.e., 3.90% > \|Diff.\| ≧ 1.86%)
Commodities	1/15	<	5/15 *
Indices	3/15 *	>	1/15
Exchange Rates	2/15	<	4/15 *
Crypto	0/15	=	0/15
10Y Yield	1/15	=	1/15
Panel C Total	7	<	11 *
Panel D: Bottom Quartile (0~25%, i.e., 1.86% > \|Diff.\| > 0%)
Commodities	2/15 *	>	1/15
Indices	3/15	=	3/15
Exchange Rates	2/15	<	6/15 *
Crypto	0/15	=	0/15
10Y Yield	1/15 *	>	0/15
Panel D Total	8	<	10 *

Note: 1. Forecast Estimate Index (FEI) scores range from 0 to 1, with higher FEI scores corresponding to greater forecasting accuracy. 2. * denotes the higher win count in the specific asset category.

Table 14. Paired sample correlation coefficients for F1 and F2.

		Test (T + 7)
Forecast Horizon	N	Correlation (r)	p
Day 1 (T + 1)	75	0.908	0.000 ***
Day 2 (T + 2)	75	0.911	0.000 ***
Day 3 (T + 3)	75	0.888	0.000 ***
Day 4 (T + 4)	75	0.863	0.000 ***
Day 5 (T + 5)	75	0.890	0.000 ***
Day 6 (T + 6)	75	0.843	0.000 ***
Day 7 (T + 7)	75	0.864	0.000 ***

Note: F1 = Direct Forecast; F2 = Debate Forecast; ***

p

< 0.001.

Table 15. Paired sample t-test of FEI: results from models F1 and F2.

	F1		F2		Paired Differences
Forecast Horizon	Mean	SD	Mean	SD	t-Value	$p$
Test (T + 7)
Day 1 (T + 1)	0.783	0.245	0.814	0.204	−2.543 *	0.013 *
Day 2 (T + 2)	0.795	0.230	0.816	0.197	−1.879	0.064
Day 3 (T + 3)	0.780	0.246	0.813	0.202	−2.502 *	0.015 *
Day 4 (T + 4)	0.776	0.249	0.810	0.205	−2.344 *	0.022 *
Day 5 (T + 5)	0.774	0.252	0.791	0.234	−1.274	0.207
Day 6 (T + 6)	0.771	0.254	0.809	0.209	−2.405 *	0.019 *
Day 7 (T + 7)	0.763	0.260	0.803	0.216	−2.706 *	0.008 *

Note: SD = standard deviation; the t-value is derived from paired sample differences (F1 − F2); * denotes statistical significance.

Table 16. Closing price volatility intensity distribution across asset categories.

STDEV.S	≧1.33%	1.33~0.68%	0.68~0.38%	0.38~0%
Asset Category	Top (100~75%)	Second (75~50%)	Third (50~25%)	Bottom (25~0%)
Commodities	2/15	9/15 *	3/15	1/15
Indices	0/15	4/15	9/15 *	2/15
Exchange Rates	0/15	0/15	3/15	12/15 *
Crypto	14/15 *	0/15	0/15	1/15
10Y Yield	3/15	6/15	3/15	3/15
Total	19	19	18	19

Note: * Denotes the asset category with the highest occurrence count within each volatility intensity tier.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, S.E.; Chung, K.-C. Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics 2026, 14, 1393. https://doi.org/10.3390/math14081393

AMA Style

Chang SE, Chung K-C. Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics. 2026; 14(8):1393. https://doi.org/10.3390/math14081393

Chicago/Turabian Style

Chang, Shuchih Ernest, and Kai-Chun Chung. 2026. "Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators" Mathematics 14, no. 8: 1393. https://doi.org/10.3390/math14081393

APA Style

Chang, S. E., & Chung, K.-C. (2026). Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics, 14(8), 1393. https://doi.org/10.3390/math14081393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators

Abstract

1. Introduction

2. Research Background

2.1. LLMs with Debate as Research Tools—LLM Consensus-Based Market Forecasting

2.2. High-Impact Political and Economic Statements, Their Market Impacts, and Data Structure for Executing LLM Experiments

2.3. Evaluation of Forecast Frameworks: Model F1 and Model F2

2.4. Recap of Research Method Elaboration

3. Research Method

3.1. Research Framework

3.1.1. Five Categories of Assets, with Each Category Comprising 15 FMIs

3.1.2. The Dual-Path Comparative Experiment

3.1.3. Independent Inference Processes Guided by 9 Prompts

3.1.4. Market Forecast Framework

3.1.5. Debate Process Framework

3.1.6. Specifications of Prompts (Prompt 1~Prompt 9)

3.1.7. Input–Output Data Flow Matrix

3.1.8. Output Formats and Input References

3.1.9. Forecast Horizons and Rolling Validation Framework

3.2. Data Collection Instruments and Techniques

3.2.1. Rolling Window-Based Data Aggregation

3.2.2. Mechanism of Market Forecast (F1)

3.2.3. Mechanism of Debate Process Forecast (F2)

3.3. Analysis Methods

3.3.1. Mathematical Framework of the Forecast Estimate Index (FEI)

3.3.2. Item-Level Difference (Diff.) Framework

3.3.3. Paired Sample Correlation and Paired Sample t-Test

3.3.4. Market Characteristics and Volatility Stratification

4. Experiments Results

4.1. Sample Description and Research Rigor

4.1.1. Time Series Used for Experiments

4.1.2. Input Data for Experiments

4.1.3. Rules and Sources for Final Realized Value Acquisition

4.1.4. Seven-Day Dynamic of (F2) Debate Disagreements

4.1.5. Compositional Structure of (F2) Forecasts

4.2. Measurement Results: Analytical Assessment

4.2.1. FEI Comparison (Daily Average and Overall Average) Between F1 and F2

4.2.2. Win Statistics (Daily Average) by Asset Category and FEI Intensity

4.2.3. Correction Analysis: How (F2) Debate Enhances (F1) Forecasts

4.2.4. Forecast Performance Differential Analysis

4.3. Analysis and Testing: Structural Models and Hypotheses

4.3.1. Analysis of Paired Sample Correlation

4.3.2. Paired Sample t-Test

4.3.3. Distribution of Closing Price Volatility Intensity

5. Concluding Remarks

6. Implications

7. Limitations and Future Studies

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI