Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators
Abstract
1. Introduction
- RQ1: Can the “F2 Consensus Value” generated through the LLM Cross-Debate Mechanism—compared to the “F1 Forecast Value” derived from a single LLM reading of HIPES text—noticeably enhance accuracy and stability in forecasting FMIs?
- RQ2: Do models F1 and F2 possess distinct advantages in forecasting FMIs across various asset categories (such as Stock Market Indices, Foreign Exchange Rates, Commodities, Cryptocurrencies, 10Y Government Bond Yields)?
2. Research Background
2.1. LLMs with Debate as Research Tools—LLM Consensus-Based Market Forecasting
2.2. High-Impact Political and Economic Statements, Their Market Impacts, and Data Structure for Executing LLM Experiments
2.3. Evaluation of Forecast Frameworks: Model F1 and Model F2
2.4. Recap of Research Method Elaboration
- Stage 1, purification of text input:Stage 1 filters out secondary media sources, prioritizing primary source transcripts of HIPESs spanning the 30-day period preceding the forecasting process.
- Stage 2, application of HIPESs:This stage fills HIPES transcripts directly into LLMs, enabling the generation of FMI forecasting values (F1).
- Stage 3, debate-driven variable injection:This stage incorporates Optimization Variables generated through cross-debate processes into LLMs. By embedding the HIPES extraction and the CDCBOV framework, consensus-based forecasting values are produced (F2).
- Stage 4, dynamic forecasting and (T + 7) testing:Stage 4 employs a rolling window mechanism to capture medium-term market dynamics, which are used to compare F1 and F2 under the (T + 7) time window.
- Stage 5, analysis of forecasting results:This stage quantitatively assesses forecasting results against accuracy metrics, followed by statistical significance analyses to validate meaningful enhancements derived from the “LLM Cross-Debate Consensus Based Forecasting” initiative.
3. Research Method
3.1. Research Framework
3.1.1. Five Categories of Assets, with Each Category Comprising 15 FMIs
3.1.2. The Dual-Path Comparative Experiment
- Input Data Step:For each forecasting cycle, the system synthesizes a 30-day news corpus (representing HIPESs) to generate an Integrated Context, which is then fed into the LLM. This makes sure that the operational consistency of both models (F1 and F2) operates under identical input conditions.
- Market (i.e., Financial Market) Forecast Step:F1 is used to forecast the closing prices of 75 FMIs. F1 is dependent exclusively on the Integrated Context for forecasting.
- Debate Forecast Step:By leveraging a Dual-Agent LLM Debate process that progresses sequentially through Triple-C (Cross-Validation, Cross-Debate, and Consensus Building) execution stages, F2 dynamically generates the Optimization Variable to ensure logical robustness and ultimately produce a Consensus Value.
- Realized Value (RV) Step:On the target day, the realized closing prices of the 75 FMIs are collected and designated as the RV.
- Compare Result Step:Forecast accuracy and performance are assessed via FEI series, Diff. metrics, paired t-tests, and volatility measures, creating a benchmarking framework rooted in 75 FMIs. This evaluation assesses whether F2 demonstrates statistically significant superiority over F1 and identifies broader implications to enhance the research’s practical utility.
3.1.3. Independent Inference Processes Guided by 9 Prompts
3.1.4. Market Forecast Framework
3.1.5. Debate Process Framework
- Proponent (LLM1/Gemini 3 Pro):After F1’s execution, it starts to be responsible for “defending and correcting.” Its decision process begins by reviewing the Opponent’s status tag [ST]. If marked “Disagree,” the Proponent analyzes the confidence score [SC] and rebuttal [RE], then reads the revised value [RN] to anchor the reasonable numerical range. Based on causal analysis, it adjusts the value if the Opponent’s objection is valid or reinforces its original forecast if the critique is deemed invalid.
- Opponent (LLM2/ChatGPT 5.2):Tasked with “auditing and challenging,” it conducts a strict logic audit (Logic Audit) on the Proponent’s input (initial [F1] or revised [RN]). Using the Integrated Context and internal theoretical cognition, it calculates [SC] to define [ST]: if [SC] < 0.7, it marks the status as “Disagree,” providing [RE] and generating [RN] to challenge the Proponent’s argument; if [SC] meets the threshold, it marks it as “Consensus” to terminate the debate.
- Disagree Mechanism (Red Path):If any of the 75 FMIs has a score below 0.7 ([SC] < 0.7), the system triggers the red path, forcing a new round of debate for unmet indicators to generate [RN] correction values, or forcibly entering the convergence phase if the round limit is reached.
- Agree Mechanism (Green Path):Only when all 75 indicators meet [SC] ≧ 0.7 and are marked as [ST] = Agreed will the system trigger the green path, initiating Global Early Termination to skip remaining debate rounds and directly route the batch of Forecast Values to the Consensus-Building stage.
- The Cross-Validation Stage:Initiated by Chat Box 2, LLM(2) receives the initial Forecast F1 and Integrated Context, executing a strict Logic Consistency Check. The system reviews each of the 75 FMIs individually. If logical inconsistencies or insufficient confidence (SC < 0.7) are detected, it marks the case as Disagree, generates the first correction value RN1, and forces the debate to commence. Only when all 75 FMIs meet [SC] ≧ 0.7 will the system trigger Global Early Termination, skipping remaining rounds.
- The Cross-Debate Stage:If the previous stage triggers Disagree, the system initiates a maximum of three rounds of iterative debate. LLM(1) (Proponent) and LLM(2) (Opponent) engage in alternating attacks and defenses (Chat Box 3~Chat Box 8). The Dual-Input Mechanism ensures each round employs iterative input, compelling models to simultaneously read the Integrated Context and the Opponent’s previous output (Input Reference), thereby ensuring the debate focuses on numerical discrepancies. The Dual-Termination Logic follows the Global Indicator Status: if the Opponent persists in doubt, the debate is forced to conclude at the third round; only when all 75 FMIs meet [SC] ≧ 0.7 will the system trigger Global Early Termination, skipping remaining rounds.
- The Consensus-Building Stage:This stage marks the conclusion of the debate process, with LLM(1) in Chat Box 9 executing the final decision via the Conditional Input Strategy. The system first dynamically locks the input source (either Proponent Context or Opponent Context) based on the debate’s termination status. Subsequently, this study adopts the Dynamic Convergence Mechanism: for cases reaching Agreed consensus, the system directly adopts the validated numerical value; for Disagree cases with unresolved disputes, it calculates the arithmetic average of RNs to reconcile the final positions of both sides. Finally, the system removes intermediate parameters like RE and SC, packaging the converged value as F2 with a consensus status (State 8), completing the full logical loop from prediction to decision. The finalized Consensus Values (F2) are output after optimization.
3.1.6. Specifications of Prompts (Prompt 1~Prompt 9)
- Prompt (1) for Direct Forecast (F1):Panel A’s Prompt (1) specializes in executing Semantic-to-Forecast Mapping, corresponding to the Market Forecast Stage in Figure 3, which converts unstructured market information into an initial forecast (F1).
- Prompts (2)~(9) for Debate Forecast (F2):Panel B’s Prompts (2)~(9) constitute the Adversarial Debate Calibrator, aligning with the three-stage process in Figure 4: Prompt (2) initiates the Cross-Validation stage for risk detection, Prompts (3)~(8) drive the Cross-Debate stage through iterative counterarguments, and Prompt (9) executes the Consensus-Building stage for final convergence. This module refines the initial forecast (F1) into a Robust Consensus Value (F2) via confidence score (SC) and iterative mechanisms.
3.1.7. Input–Output Data Flow Matrix
3.1.8. Output Formats and Input References
- [i]: Financial Market Indicator (FMI) ID (1~75).
- [F1]: Direct Forecast Value = Forecast Values (F1).
- [SC]: Confidence score (0.0~1.0), quantifying the level of agreement with the Input Reference.
- [ST]: Status (i.e., “Disagree” or “Consensus”).
- [RE]: Reasoning. The qualitative text explaining the causal logic or theoretical constraints behind the revision and supporting the debate/rebuff.
- [RN]: Revised Number. The new Forecast Value proposed in the current round ([RN(t)]).
- [F2]: Debate Forecast Value = Consensus Values (F2).
3.1.9. Forecast Horizons and Rolling Validation Framework
- Test (T + 7) (Forecast horizon spanning 7 days):This test simulates short-term forecasting scenarios by operating on a baseline date (T) to forecast market values at T + 7 (e.g., 11/25 from 11/18). This test also evaluates LLM sensitivity to new market inputs and compares the forecasting accuracy between F1 and F2 under short-term conditions.
- Method for Rolling Validation:A seven-day replication framework mitigates date-specific biases, which generates 14 forecast datasets to support effective quantitative evaluation.
3.2. Data Collection Instruments and Techniques
3.2.1. Rolling Window-Based Data Aggregation
- The Process of Single News Unit:Core textual data are sourced from HIPESs made by international leaders, systematically extracted from Rev.com’s verbatim transcripts, and then standardized into .txt files containing (News Date), (News Title), and (News Transcripts). These serve as foundational units for aggregation.
- The Process of Integrated Context:This process aggregates 30 days of “Single News Unit texts” (T-30 to T-1) to capture cumulative semantic context, which is then merged into a single long-text file, forming the complete input basis for LLM inference.
3.2.2. Mechanism of Market Forecast (F1)
3.2.3. Mechanism of Debate Process Forecast (F2)
- The Cross-Validation Stage:This stage, as illustrated in Figure 9 and guided by Table 3, initiates a Dual-Input Mechanism led by LLM(2). It injects the original Integrated Context and initial F1 Forecast Value from Market Forecast (1) to simulate an auditor’s review of existing predictions. LLM(2) validates 75 FMIs without re-predicting, generating a structured Opponent Context (1): [i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1]. For example, BTC’s [125,000]/[0.60]/[Disagree]/[Reason: xxxx]/[118,000] shows that when SC1 (0.60) falls below the 0.7 threshold, the system flags [Disagree], proposes a revised value [RN1] (118,000), and triggers the red debate path (shown in Figure 4 and Figure 5) via the Logic Gate.
- The Cross-Debate Stage:LLM1: Proponent (Google Gemini 3 Pro)This stage operates as a closed “Iterative Dialectic Loop” (see Figure 10 and Figure 11 and Table 3), where LLM(1) acts as the Proponent and LLM(2) acts as the Opponent, enforcing a Dual-Input Mechanism to reference static Integrated Context and dynamic Input Reference (created from the counterparty’s prior arguments). This ensures consistent factual grounding, driving predictive values toward consensus via logical iteration. LLM(1) conducts causal analysis to refine [RN] and [RE], encapsulating outputs as Proponent Context (1~3) for subsequent review.LLM2: Opponent (OpenAI ChatGPT 5.2 Thinking)The Opponent’s scenario (Figure 11) represents the review process in even number prompts (Table 3: P2, P4, P6, P8). Its input combines Integrated Context and Proponent Context (1~3), transforming LLM(2) into a rigorous reviewer. Prompts (P4, P6, and P8) enforce “Theoretical Constraints” to validate defense reasoning. Core logic updates SC as a Logic Gate: if SC < 0.7, a new [RN] is generated and archived as Opponent Context (2~4) to trigger the next debate round; if SC ≥ 0.7, consensus is achieved, leading to the final Consensus-Building phase.
- The Consensus-Building Stage:This stage terminates iterative debate and establishes consensus. When confidence scores (SC ≥ 0.7) are met, the system activates convergence via a Conditional Input Strategy (Table 3), selecting the latest revised file (Latest RN) from the Opponent/Proponent Context (2~4/1~3). LLM(1) executes Threshold Verification and Convergence (Prompt 9) to finalize decision values, outputting [i]:[F2]/[ST8] (Final Consensus Status) as the Market Forecast (F2) result and completing the prediction-to-decision closed-loop.
3.3. Analysis Methods
3.3.1. Mathematical Framework of the Forecast Estimate Index (FEI)

- Daily Item-Level FEI:
- 2.
- Daily Average FEI:
- 3.
- Overall Average FEI:
3.3.2. Item-Level Difference (Diff.) Framework
- The Design of the Item-Level Difference Framework
- 2.
- Quartile Analysis for Strength Distribution
3.3.3. Paired Sample Correlation and Paired Sample t-Test
3.3.4. Market Characteristics and Volatility Stratification
4. Experiments Results
4.1. Sample Description and Research Rigor
4.1.1. Time Series Used for Experiments
4.1.2. Input Data for Experiments
4.1.3. Rules and Sources for Final Realized Value Acquisition
- Indicators and Corresponding Values for Benchmarking:Five categories of assets with each category consisting of 15 FMIs result in a total of 75 FMIs, which are evenly spread across five asset categories. The values of those 75 FMIs were obtained from the Trading Economics platform (tradingeconomics.com).
- Rules of Closing Price:Daily closing prices are used as the baseline. Non-trading days (weekends, holidays) apply a carry-forward rule, retaining the previous valid trading day’s closing price to preserve time series integrity.
4.1.4. Seven-Day Dynamic of (F2) Debate Disagreements
4.1.5. Compositional Structure of (F2) Forecasts
- F1’s Stable Dominance (Agreed, LLM(1)):LLM(1) dominates in a certain Crypto (e.g., Days 1~7), proving its potential applicability with respect to Crypto. The system may avoid unnecessary debates on low-complexity indicators to save computational resources.
- Dual-LLM Cognitive Complementarity:Dual-LLM architecture mitigates risks of overfitting/hallucination during volatility (in all 7 days). LLM(2) dynamically corrects F1’s biases, enhancing decision robustness in extreme scenarios. This underscores the critical value and benefit provided by the proposed “F2 Debate Forecast” approach.
- Dual-LLM Cognitive Specialization:LLM1 (F1) excels in certain mean-reverting assets (e.g., 10Y Yield and Commodities) via intuitive reasoning. LLM2 (F2) corrects nonlinear market biases across all five categories of assets, balancing trend capture and volatility resilience.
4.2. Measurement Results: Analytical Assessment
4.2.1. FEI Comparison (Daily Average and Overall Average) Between F1 and F2
- Comprehensive Superiority:In Test (T + 7), (Panel A) shows (F2) achieving higher (Daily Average FEI) than (F1) on all 7 days (Day 1~Day 7). (Panel B) reveals (F2)’s (Overall Average FEI) reaching 0.808 in Test (T + 7) (vs. 0.777 for (F1)), demonstrating that the Optimization Variable generated by the debate mechanism noticeably enhances the model’s ability to capture market dynamics. Analysis of the 75 FMIs’ (Overall Average FEI) further shows (F2)’s values align closer to actual observations (with FEI approaching 1), highlighting the substantive contribution of the LLM Cross-Debate Mechanism to quantitative forecasting.
- Robustness and Stability:Across various forecast horizons listed in Table 8, F2 maintains FEI above 0.791 at all 7 days (time windows), peaking at 0.816 on Day 2 in Test (T + 7). The consistent superior performance of F2 confirms the LLM Cross-Debate Mechanism’s robustness in enhancing forecasting accuracy across various time windows.
- Alternative Forecasting Accuracy Metrics—MAE and MSE:In calculating widely used forecasting accuracy metrics MAE and MSE, all errors contribute to these metrics unequally. Larger errors have a more significant effect on MAE than smaller ones, and even more so for MSE. Though FEI evaluations mitigate such uneven contributions suffered by MAE/MSE, for comparison purpose this study still calculates MAE/MSE (based on the same data deriving Table 8) to derive Table 9 and Table 10. The results in Table 8, Table 9 and Table 10 consistently demonstrate that F2 outperforms F1 in overall forecasting accuracy.
- Supplementary Weekly Replications in Overall Average FEI Validation:This research is grounded in the central empirical findings from Table 8 (Week 1), augmented by three supplementary weekly assessments to strengthen reliability and counteract the effects of time-dependent variability. The findings reveal that the F2 framework consistently outperforms F1 in Overall Average FEI in all four weekly experiments, with the comparison results as follows: Week 1 (F2’s 0.808 > F1’s 0.777), Week 2 (F2’s 0.807 > F1’s 0.776), Week 3 (F2’s 0.801 > F1’s 0.772), and Week 4 (F2’s 0.796 > F1’s 0.771). These weekly experimental outcomes offer robust evidence showing that the F2 framework provides consistent and substantial predictive superiority, coupled with systematic robustness.
4.2.2. Win Statistics (Daily Average) by Asset Category and FEI Intensity
- Frequency of Overall Wins (Panel A):In Test (T + 7), F2 achieved 46 wins over F1’s 28 wins. This underscores F2’s superior forecasting performance across five asset categories, notably in Commodities, Exchange Rates, Cryptocurrencies, and 10Y Yield, where it exhibits dominance. While F2 demonstrates relative advantages in four asset categories, while F1 maintains competitive strengths in Indices.
- High-Accuracy Range (Panel B):F2 secured 31 wins, outperforming F1’s 26. However, a deeper analysis reveals F2’s wins are heavily concentrated in Exchange Rates and 10Y Yield. This confirms that F2’s superior performance in high-accuracy scenarios stems from the Optimization Variable enabled by the Debate Forecast mechanism, which enhances its effectiveness in specific financial assets.
- Mid-to-High-Accuracy Range (Panel C):F2 secures eight indicators, outperforming F1’s two. While the number of indicators in Panel C has noticeably decreased (from Panel B), they are distributed across all five asset categories. Within this range, F1’s performance is already sufficiently robust, while F2’s Debate Forecast mechanism provides marginal gains but does not establish a dominant advantage.
- Mid-to-Low Range (Panels D) and Low-Accuracy Range (Panels E):F1’s effective win count is zero, while F2 records five and two wins in Panels D and E, respectively. This indicates F2’s resilience in extreme FMI forecasts, compensating for F1’s lack of forecast resilience and accuracy.
- Summary:Table 11’s comprehensive analysis reveals F2’s structural breakthrough in “operational resilience” and “domain-specific advantages.” Through the Optimization Variable’s Debate Mechanism correction, forecast accuracy for over half of financial assets is noticeably enhanced. This confirms F2’s ability to maintain full interval stability while possessing targeted forecast advantages.
4.2.3. Correction Analysis: How (F2) Debate Enhances (F1) Forecasts
- Valid Debate:
- Debate, LLM(2), Valid: Adopted LLM(2) revisions; FEI validation confirms “accuracy > LLM(1)”.
- Debate, LLM(1), Valid: Retained LLM(1); FEI shows its performance ≥ LLM(2), blocking suboptimal proposals.
- Invalid Debate:
- Debate, LLM(2), Invalid: Mistakenly adopted LLM(2); LLM(1) FEI was better than LLM(2)’s (over-correction).
- Debate, LLM(1), Invalid: Retained LLM(1); LLM(2) had higher accuracy but was mistakenly ignored.
- Metrics:
- Total Valid: Absolute count of successful corrections/defenses via debate.
- Total Valid Ratio: Proportion of effective debates relative to total debates.
- High Total Valid and High Total Valid Ratio:Crypto assets (Total Valid: 89; Ratio: 90.8%) show “high-frequency high-accuracy” performance, with (F2) correcting (F1)’s systematic biases via the cross-debate mechanism. The dual-agent debate framework demonstrates superior adaptability in volatile, nonlinear markets (e.g., Crypto), achieving 89 valid corrections vs. nine invalid cases.
- Moderate Total Valid and Moderate Total Valid Ratio:Commodities (Total Valid: 71; Ratio: 69.6%), Exch. Rates (Total Valid: 68; Ratio: 64.8%), and 10Y Yield (Total Valid: 68; Ratio: 64.8%) exhibit moderate validity with high debate frequency. (F2) acts as a “precision sniper,” focusing on trend reversals while trusting (F1)’s baseline judgments for Commodities, Exch. Rates, and 10Y Yield.
- Low Total Valid and Low Total Valid Ratio:Indices (Total Valid: 29; Ratio: 27.9%) highlight over-correction risks. High debate frequency but low validity (Total Valid: 29; Total Invalid: 75) suggest (F2)’s interventions introduce noise in efficient markets. Model optimization requires reducing debate thresholds to avoid overfitting.
4.2.4. Forecast Performance Differential Analysis
- High-Intensity Interval (Panel A):In the top 25% of samples showing the most significant performance gaps, F2 exhibits superior dominance: During Test (T + 7), F2 achieves 16 victories compared to F1’s three. Notably, Crypto (11 wins) and 10Y Yield (three wins) are key cases. This highlights F2’s predictive correction capability in Crypto and 10Y Yield markets where F1 struggles with only one marginal win.
- Moderate-Intensity Interval (Panel B):As the |Diff.| value decreases, F1’s competitiveness increases. F1 leads F2 by 10 to nine wins, with a marginal gap. Upon closer examination of asset distribution, F2 demonstrates strong domain advantages in Crypto (2) and 10Y Yield (5), and F1 in Indices (4).
- Moderate-Low-Intensity and Low-Intensity Intervals (Panels C, D):As the |Diff.| value decreases further, the competitiveness of F1 falls a little bit. As shown in Panels C and D, F2 leads slightly over F1 with 11:7 and 10:8 scores. This highlights F2’s core strength in making substantial adjustments in Exch. Rates and Commodities, rather than Indices.
- Systematic Dominance in Cryptocurrencies:F2 demonstrates its superiority in Crypto and parts of 10Y Yield blocks, with dense black bar visuals indicating significant error reduction (up to ~20%). This validates its debate mechanism as a volatility stabilizer, outperforming F1 in both win count and forecast error reduction for high-volatility, nonlinear, and policy-sensitive markets.
- Traditional Assets Showing Competitive Parity:For the traditional asset categories such as Exchange Rates and Commodities, F1 and F2 demonstrate alternating dominance with minimal error reduction fluctuations (mostly within ±10%). This reflects “competitive parity” between the two models in traditional assets. Notably, F1 retains a slight edge in highly efficient, mean-reverting markets like US100 and US500 (hollow bars in Indices). This underscores the resilience of benchmark forecasts/random walk hypothesis in efficient markets, where excessive debate interventions may introduce noise.
4.3. Analysis and Testing: Structural Models and Hypotheses
4.3.1. Analysis of Paired Sample Correlation
4.3.2. Paired Sample t-Test
- Mean and Stability:In all forecast days of Test (T + 7), F2’s FEI mean consistently exceeds F1’s, with F2’s FEI SD noticeably lower than F1’s. This indicates F2 not only has higher forecasting accuracy but also exhibits greater stability through reduced dispersion in results.
- Statistical Significance:T-test results confirm significant performance differences (see Table 15). Notably, significant results were observed on Day 1 (p = 0.013), Day 3 (p = 0.015), Day 4 (p = 0.022), Day 6 (p = 0.019), and Day 7 (p = 0.008). In Test (T + 7), negative deviations (F1 minus F2) in paired differences (in Table 15) strongly support F2’s elevated FEI, highlighting the substantial impact of the debate mechanism on enhanced forecast accuracy.
- Additional Wilcoxon signed-rank test:Further non-parametric validation via the Wilcoxon signed-rank test can re-ensure the robust superiority of F2. The daily p-values (ranging from Day 1 to Day 7: 0.047, 0.094, 0.025, 0.008, 0.022, 0.004, 0.002) show significant improvements (p < 0.05) on 6 out of 7 days, including high significance (p < 0.01) on Days 4, 6, and 7. The marginal significance on Day 2 (p = 0.094) might be characteristic of authentic, real-world market noise; ultimately, this provides compelling evidence that the debate mechanism noticeably enhances forecast accuracy.
4.3.3. Distribution of Closing Price Volatility Intensity
- High-Volatility Concentration:Cryptocurrencies predominate the Top Tier (STDEV.S ≥ 1.33%), achieving 14/15 assets, positioning them as a “high-risk, high nonlinearity” market representative.
- Medium-to-Low Volatility Distribution (the Second, Third, and Bottom Tiers):Commodities (9/15) predominate the Second Tier (1.33~0.68%), Indices (9/15) the Third Tier (0.68~0.38%), and Exch. Rates (12/15) the Bottom Tier (0.38% ~ 0%).
- Performance Alignment with Environmental Considerations:Our study findings corroborate earlier studies (as evidenced by Table 11 and Table 13, and Figure 14). Table 16 underscores Crypto’s elevated volatility (≥1.33%), consistent with F2’s dominance: 11/15 wins in Table 13 (Panel A) and a minimum 8.60% forecast performance gain. Figure 14 also confirms this trend, demonstrating that F2’s debate-guided Optimization Variables surpass F1 in volatile markets, highlighting both practical and academic significance.
5. Concluding Remarks
6. Implications
7. Limitations and Future Studies
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A



- The Cross-Validation stage (see Figure A1):This stage conducts the systematic evaluation of the initial baseline forecast (F1) by the Opponent (LLM2). Its workflow operates under a strict five-step protocol:(1). Read F1 Value: The Opponent inputs and processes the Proponent’s direct baseline forecast (F1) for a specific FMI.(2). Give Confidence Score: The Opponent evaluates F1 and assigns a quantitative confidence score ([SC], bounded 0~1).(3). Detection Status: An algorithmic check determines the stance ([ST]): assigned as “Agreed” if SC ≥ 0.7, or “Disagree” if SC < 0.7.(4). Give Reasoning: If ST = Disagree, the Opponent generates a textual rationale ([RE]) articulating why it rejects the Proponent’s logic.(5). Give Revised Number: If ST = Disagree, the Opponent calculates and outputs a mathematically reasonable revised forecast ([RN]). Conversely, if ST = Agreed, the system defaults to retaining the F1 value as the final [RN].
- 2.
- The Cross-Debate Loop stage (see Figure A2):This stage illustrates the core variance-control mechanism and the dynamic parameter injection process. It is important to note that throughout this loop, both LLMs operate under a symmetrical protocol, applying the exact same methodology for each exchange. The iterative process follows a strict eight-step algorithmic sequence:(1). Detection SC ≥ 0.7: The receiving model (e.g., LLM 1) reads the confidence score ([SC]) generated by the opposite model in the previous stage. If SC ≥ 0.7, the debate terminates; otherwise, it proceeds.(2). Detection Status: The model verifies the opposite model’s status ([ST]). If ST = Agreed, the loop halts; if ST = Disagree, the active cross-debate initiates.(3). Read Reasoning: The model ingests the opposite model’s refutation reasoning ([RE]), utilizing this explicitly structured conflicting rationale as the new contextual basis for logical correction.(4). Read RN Value: The model reads the revised numerical forecast ([RN]) proposed by the opposite model.(5). Give Confidence Score: The model critically evaluates the opposite model’s [RN] and assigns a new confidence score ([SC], bounded 0~1).(6). Detection Status: Based on the newly generated score, the model determines its own stance ([ST]): assigned as “Agreed” if SC ≥ 0.7, or “Disagree” if SC < 0.7.(7). Give Reasoning: If ST = Disagree, the model generates explicit text-based reasoning ([RE]) articulating its counter-argument and why it rejects the opposite model’s stance.(8). Give Revised Number: If ST = Disagree, the model computes and outputs a newly revised numerical forecast ([RN]) justified by its reasoning. Conversely, if initially ST = Agreed, the system automatically retains the previous round’s numerical value as the final [RN].
- 3.
- The Consensus Building stage (see Figure A3):This stage executes the final convergence process where the iterative debate resolves cognitive conflict to produce the stable, relatively accurate final forecast (F2). It ensures a deterministic output through a strict two-step resolution protocol:(1). Give Consensus Value (F2): The system generates the final quantitative forecast (F2) based on the debate’s termination condition. If logical consensus (SC ≥ 0.7 or ST = Agreed) is achieved prior to or upon the conclusion of the three-round limit, F2 directly adopts the final revised numerical forecast ([RN]) from that concluding round of debate. Conversely, if the maximum three-round limit is exhausted without reaching mutual agreement, an algorithmic fallback is triggered, and F2 is calculated as the arithmetic mean of the final [RN] values proposed by both LLMs.(2). Give Status: The system logs an explicit convergence status tag to indicate the derivation method of F2, ensuring complete procedural traceability. If F2 resulted from a mutual logical agreement, the status is tagged as “Consensus.” If F2 was generated via the mathematical averaging fallback due to unresolved conflict after three rounds of debate, the status is tagged as “Mean.”
Appendix B

- (F1) Market Forecast Stage:As mapped in Figure 8, this involves instructing LLM(1) to analyze the HIPES data and generate initial predictions for the target FMIs.(1). Core Directive: The researcher’s prompt logic is to command LLM(1) to produce the baseline forecast.(2). Execution Rule: It follows strictly a single-turn execution (one round) to obtain the initial F1 value.
- (F2) Triple-C Stage 1/3 (Cross-Validation):As detailed in Figure A1, this step directs LLM(2) to act as an independent critic, evaluating and validating the initial forecast generated in F1.(1). Core Directive: The prompt logic is to compel LLM(2) to critically cross-validate the contents and rationales of LLM(1)’s F1 output.(2). Execution Rule: It follows strictly a single-turn execution (one round) to validate F1 value derived by LLM(1).
- (F2) Triple-C Stage 2/3 (Cross-Debate Loop):As defined in Figure A2, this stage establishes the iterative interaction logic, where LLM(1) and LLM(2) actively debate, challenge, and refine each other’s rationales.(1). Core Directive: The prompt logic facilitates a cross-debate. Crucially, the researcher must inject the output generated from each interaction into the subsequent turn’s prompt as historical context.(2). Execution Rule: It follows an iterative loop requiring a minimum of one round and capped at a maximum of three rounds of debate.
- (F2) Triple-C Stage 3/3 (Consensus Building):As outlined in Figure A3, this final step instructs LLM(1) to synthesize the entire debate history and construct the ultimate consensus forecast.(1). Core Directive: The prompt logic restricts LLM(1) to a purely mechanical extraction task rather than subjective evaluation. It strictly directs the LLM to review the final state of the debate and output a binary-like judgment: either extracting the reached “Consensus” or outputting the calculated “Mean”, thereby finalizing the result.(2). Execution Rule: It follows strictly a single-turn execution (one round) to extract the ultimate F2 value.
References
- Fu, K.; Zhang, Y. Incorporating multi-source market sentiment and price data for stock price prediction. Mathematics 2024, 12, 1572. [Google Scholar] [CrossRef]
- Yang, K.; Deng, R.; Wei, Y.; Wang, S. The power of ChatGPT in processing text: Evidence from analysis and prediction in the exchange rate markets. Financ. Innov. 2025, 11, 118. [Google Scholar] [CrossRef]
- Jain, P.M.K.; Aggarwal, S. News and stock market volatility: A global systematic literature review. TPM Test. Psychom. Methodol. Appl. Psychol. 2025, 32, 1633–1645. [Google Scholar]
- Karpatne, A.; Atluri, G.; Faghmous, J.H.; Steinbach, M.; Banerjee, A.; Ganguly, A.; Shekhar, S.; Samatova, N.; Kumar, V. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Trans. Knowl. Data Eng. 2017, 29, 2318–2331. [Google Scholar] [CrossRef]
- Chang, S.E.; Chung, K.-C. Exploring the use of high-impact political and economic statements in LLM for judging financial market trend—A technical indicator-based approach. Mathematics 2026, 14, 869. [Google Scholar] [CrossRef]
- Korinek, A. Generative AI for economic research: Use cases and implications for economists. J. Econ. Lit. 2023, 61, 1281–1317. [Google Scholar] [CrossRef]
- Kıcıman, E.; Ness, R.; Sharma, A.; Tan, C. Causal reasoning and large language models: Opening a new frontier for causality. arXiv 2024, arXiv:2305.00050. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are zero-shot reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
- Lopez-Lira, A.; Tang, Y. Can ChatGPT forecast stock price movements? Return predictability and Large Language Models. arXiv 2024, arXiv:2304.07619. [Google Scholar] [CrossRef]
- Sarker, M.K.; Zhou, L.; Eberhart, A.; Hitzler, P. Neuro-symbolic artificial intelligence: Current trends. AI Commun. 2022, 34, 197–209. [Google Scholar] [CrossRef]
- Mitroff, I.I.; Emshoff, J.R. On strategic assumption-making: A dialectical approach to policy and planning. Acad. Manag. Rev. 1979, 4, 1–12. [Google Scholar] [CrossRef]
- Johnson, D.W.; Johnson, R.T. Energizing learning: The instructional power of conflict. Educ. Res. 2009, 38, 37–51. [Google Scholar] [CrossRef]
- Michael Nussbaum, E. Collaborative discourse, argumentation, and learning: Preface and literature review. Contemp. Educ. Psychol. 2008, 33, 345–359. [Google Scholar] [CrossRef]
- Ma, J.; Wang, C.; Rong, L.; Wang, B.; Xu, Y. Exploring multi-agent debate for zero-shot stance detection: A novel approach. Appl. Sci. 2025, 15, 4612. [Google Scholar] [CrossRef]
- Gentzkow, M.; Kelly, B.; Taddy, M. Text as data. J. Econ. Lit. 2019, 57, 535–574. [Google Scholar] [CrossRef]
- Shiller, R.J. Narrative economics. Am. Econ. Rev. 2017, 107, 967–1004. [Google Scholar] [CrossRef]
- Baker, S.R.; Bloom, N.; Davis, S.J. Measuring economic policy uncertainty. Q. J. Econ. 2016, 131, 1593–1636. [Google Scholar] [CrossRef]
- Blinder, A.S.; Ehrmann, M.; Fratzscher, M.; de Haan, J.; Jansen, D.-J. Central bank communication and monetary policy: A survey of theory and evidence. J. Econ. Lit. 2008, 46, 910–945. [Google Scholar] [CrossRef]
- Bianchi, F.; Gómez-Cram, R.; Kind, T.; Kung, H. Threats to central bank independence: High-frequency identification with twitter. J. Monet. Econ. 2023, 135, 37–54. [Google Scholar] [CrossRef]
- Caldara, D.; Iacoviello, M. Measuring geopolitical risk. Am. Econ. Rev. 2022, 112, 1194–1225. [Google Scholar] [CrossRef]
- Hassan, T.A.; Schreger, J.; Schwedeler, M.; Tahoun, A. Sources and transmission of country risk. Rev. Econ. Stud. 2024, 91, 2307–2346. [Google Scholar] [CrossRef]
- Hassan, T.A.; Hollander, S.; van Lent, L.; Tahoun, A. Firm-level political risk: Measurement and effects. Q. J. Econ. 2019, 134, 2135–2202. [Google Scholar] [CrossRef]
- Sautner, Z.; Van Lent, L.; Vilkov, G.; Zhang, R. Firm-level climate change exposure. J. Financ. 2023, 78, 1449–1498. [Google Scholar] [CrossRef]
- Loughran, T.I.M.; McDonald, B. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J. Financ. 2011, 66, 35–65. [Google Scholar] [CrossRef]
- Gu, S.; Kelly, B.; Xiu, D. Empirical asset pricing via machine learning. Rev. Financ. Stud. 2020, 33, 2223–2273. [Google Scholar] [CrossRef]
- Wang, Z.; Jiang, J.; Zhan, Y.; Zhou, B.; Li, Y.; Zhang, C.; Yu, B.; Ding, L.; Jin, H.; Peng, J.; et al. Hypnos: A domain-specific large language model for anesthesiology. Neurocomputing 2025, 624, 129389. [Google Scholar] [CrossRef]
- Ruan, L.; Jiang, H. Stock price prediction using FinBERT-enhanced sentiment with SHAP explainability and differential privacy. Mathematics 2025, 13, 2747. [Google Scholar] [CrossRef]
- Goldstein, I.; Spatt, C.S.; Ye, M. Big data in finance. Rev. Financ. Stud. 2021, 34, 3213–3225. [Google Scholar] [CrossRef]
- Hui, X.; Reshef, O.; Zhou, L. The short-term effects of generative artificial intelligence on employment: Evidence from an online labor market. Organ. Sci. 2024, 35, 1977–1989. [Google Scholar] [CrossRef]
- Mullainathan, S.; Shleifer, A. The market for news. Am. Econ. Rev. 2005, 95, 1031–1053. [Google Scholar] [CrossRef]
- Grimmer, J.; Stewart, B.M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal. 2013, 21, 267–297. [Google Scholar] [CrossRef]
- Liberti, J.M.; Petersen, M.A. Information: Hard and soft. Rev. Corp. Financ. Stud. 2019, 8, 1–41. [Google Scholar] [CrossRef]
- Davis, A.K.; Ge, W.; Matsumoto, D.; Zhang, J.L. The effect of manager-specific optimism on the tone of earnings conference calls. Rev. Account. Stud. 2015, 20, 639–673. [Google Scholar] [CrossRef]
- Pesaran, M.H.; Timmermann, A. Selection of estimation window in the presence of breaks. J. Econom. 2007, 137, 134–161. [Google Scholar] [CrossRef]
- Clark, T.E.; McCracken, M.W. Improving forecast accuracy by combining recursive and rolling forecasts. Int. Econ. Rev. 2009, 50, 363–395. [Google Scholar] [CrossRef]
- Yan, J.; Huang, Y. MambaLLM: Integrating macro-index and micro-stock data for enhanced stock price prediction. Mathematics 2025, 13, 1599. [Google Scholar] [CrossRef]
- Lewis, D.J.; Mertens, K.; Stock, J.H.; Trivedi, M. Measuring real activity using a weekly economic index. J. Appl. Econom. 2022, 37, 667–687. [Google Scholar] [CrossRef]
- Athey, S. Beyond prediction: Using big data for policy problems. Science 2017, 355, 483–485. [Google Scholar] [CrossRef]
- Andersen, T.G.; Bollerslev, T.; Diebold, F.X.; Labys, P. Modeling and forecasting realized volatility. Econometrica 2003, 71, 579–625. [Google Scholar] [CrossRef]
- Patton, A.J. Volatility forecast comparison using imperfect volatility proxies. J. Econom. 2011, 160, 246–256. [Google Scholar] [CrossRef]
- Welch, I.; Goyal, A. A comprehensive look at the empirical performance of equity premium prediction. Rev. Financ. Stud. 2008, 21, 1455–1508. [Google Scholar] [CrossRef]
- Rapach, D.E.; Strauss, J.K.; Zhou, G. Out-of-sample equity premium prediction: Combination forecasts and links to the real economy. Rev. Financ. Stud. 2010, 23, 821–862. [Google Scholar] [CrossRef]
- Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 253–263. [Google Scholar] [CrossRef]
- Clark, T.E.; West, K.D. Approximately normal tests for equal predictive accuracy in nested models. J. Econom. 2007, 138, 291–311. [Google Scholar] [CrossRef]
- Ma, F.; Lyu, Z.; Li, H. Can ChatGPT predict Chinese equity premiums? Financ. Res. Lett. 2024, 65, 105631. [Google Scholar] [CrossRef]
- Pelster, M.; Val, J. Can ChatGPT assist in picking stocks? Financ. Res. Lett. 2024, 59, 104786. [Google Scholar] [CrossRef]
- Pellicani, A.; Pio, G.; Ceci, M. CARROT: Simultaneous prediction of anomalies from groups of correlated cryptocurrency trends. Expert Syst. Appl. 2025, 260, 125457. [Google Scholar] [CrossRef]
- Lo, A. The adaptive markets hypothesis: Market efficiency from an evolutionary perspective. J. Portf. Manag. 2004, 30, 15–29. [Google Scholar] [CrossRef]













| Item | Commodities | Indices | Exchange Rates | Crypto | 10Y Yield |
|---|---|---|---|---|---|
| 1 | Brent | ASX200 | AUDUSD | ADA | Australia |
| 2 | Coal | DE40 | DXY | ALGO | Brazil |
| 3 | Copper | ES35 | EURUSD | ATOM | Canada |
| 4 | Crude Oil | FR40 | GBPUSD | AVAX | Chile |
| 5 | Gasoline | GB100 | NZDUSD | BCH | China |
| 6 | Gold | IBOVESPA | USDBRL | BNB | France |
| 7 | Heating Oil | IT40 | USDCAD | BTC | Germany |
| 8 | Iron Ore CNY | JP225 | USDCHF | DAI | India |
| 9 | Lumber | MOEX | USDCNY | DOT | Italy |
| 10 | Natural Gas | SENSEX | USDINR | ETH | Japan |
| 11 | Silver | SHANGHAI | USDJPY | LTC | Russia |
| 12 | Soybeans | TSX | USDKRW | MATIC | South Africa |
| 13 | Steel | US100 | USDMXN | SOL | Switzerland |
| 14 | TTF Gas | US30 | USDRUB | UNI | United Kingdom |
| 15 | Wheat | US500 | USDTRY | XRP | United States |
| Process Stage | Agent (Prompt) | Role | Operational Objective (Technical Detail) |
|---|---|---|---|
| Panel A: F1 (Direct Forecast) | |||
| Market Forecast | LLM(1) (P1, i.e., Prompt 1) | Proponent | Initialization: Transforms unstructured data into structured theoretical variables via Semantic-to-Theory Mapping to generate the baseline forecast (F1). |
| Panel B: F2 (Debate Forecast) | |||
| Cross- Validation | LLM(2) (P2, i.e., Prompt 2) | Opponent | Scoring and Risk Detection: Challenges F1 logic and assigns an initial confidence score (SC) to trigger the debate path. |
| Cross-Debate Loop | LLM(1) (P3,5,7, i.e., Prompts 3, 5, and 7) | Proponent | Score Recognition and Defense: Identifies low scores (SC < 0.7), performs Causal Relationship Analysis, and refines the numerical forecast attributes. |
| LLM(2) (P4,6,8, i.e., Prompts 4, 6, and 8) | Opponent | Re-Scoring and Critical Review: Evaluates the revised logic against theoretical constraints and updates the Agreement Score to determine consensus. | |
| Consensus Building | LLM(1) (P9, i.e., Prompt 9) | Proponent | Threshold Verification and Convergence: Verifies the passing score (SC ≧ 0.7) and synthesizes the debate history into the final Robust Consensus (F2). |
| Process Stage | Agent (Prompt) | Input Data | Output Variable | Output File Name |
|---|---|---|---|---|
| Panel A: F1 (Direct Forecast) | ||||
| Round 1 | LLM1 (P1) | Context (C) | F1 | Forecast Values (F1) |
| Panel B: F2 (Debate Forecast) | ||||
| 1. Cross-Validation | ||||
| Round 1 | LLM2 (P2) | C + F1 | RN1 | Opponent Context (1) |
| 2. Cross-Debate Loop | ||||
| Round 1 (1/2) | LLM1 (P3) | C + RN1 | RN2 | Proponent Context (1) |
| Round 1 (2/2) | LLM2 (P4) | C + RN2 | RN3 | Opponent Context (2) |
| Round 2 (1/2) | LLM1 (P5) | C + RN3 | RN4 | Proponent Context (2) |
| Round 2 (2/2) | LLM2 (P6) | C + RN4 | RN5 | Opponent Context (3) |
| Round 3 (1/2) | LLM1 (P7) | C + RN5 | RN6 | Proponent Context (3) |
| Round 3 (2/2) | LLM2 (P8) | C + RN6 | RN7 | Opponent Context (4) |
| 3. Consensus Building | ||||
| Round 1 | LLM1 (P9) | Latest RN (RN1~7) | F2 | Consensus Values (F2) |
| Process Stage | Output Variable | Input Reference (Read) | Format Schema (Write) |
|---|---|---|---|
| Panel A: F1 (Direct Forecast) | |||
| Round 1 | F1 | [i]:[F1] | |
| Panel B: F2 (Debate Forecast) | |||
| 1. Cross-Validation | |||
| Round 1 | RN1 | [i]:[F1] | [i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1] |
| 2. Cross-Debate Loop | |||
| Round 1 (1/2) | RN2 | [i]:[F1]/[SC1]/[ST1]/[RE1]/[RN1] | [i]:[RN1]/[SC2]/[ST2]/[RE2]/[RN2] |
| Round 1 (2/2) | RN3 | [i]:[RN1]/[SC2]/[ST2]/[RE2]/[RN2] | [i]:[RN2]/[SC3]/[ST3]/[RE3]/[RN3] |
| Round 2 (1/2) | RN4 | [i]:[RN2]/[SC3]/[ST3]/[RE3]/[RN3] | [i]:[RN3]/[SC4]/[ST4]/[RE4]/[RN4] |
| Round 2 (2/2) | RN5 | [i]:[RN3]/[SC4]/[ST4]/[RE4]/[RN4] | [i]:[RN4]/[SC5]/[ST5]/[RE5]/[RN5] |
| Round 3 (1/2) | RN6 | [i]:[RN4]/[SC5]/[ST5]/[RE5]/[RN5] | [i]:[RN5]/[SC6]/[ST6]/[RE6]/[RN6] |
| Round 3 (2/2) | RN7 | [i]:[RN5]/[SC6]/[ST6]/[RE6]/[RN6] | [i]:[RN6]/[SC7]/[ST7]/[RE7]/[RN7] |
| 3. Consensus Building | |||
| Round 1 | F2 | [i]:[RNn-1]/[SCn]/[STn]/[REn]/[RNn] | [i]:[F2]/[ST8] |
| Date | Data Start Date | HIPES Horizon | Data End and Experiment Start Date | Forecast Horizon | RV Date and Experiment End Date | News Count |
|---|---|---|---|---|---|---|
| Test (T + 7) | ||||||
| 1 | 19 October 2025 | 30 (days) | 18 November 2025 | 7 (days) | 25 November 2025 | 23 |
| 2 | 20 October 2025 | 19 November 2025 | 26 November 2025 | 21 | ||
| 3 | 21 October 2025 | 20 November 2025 | 27 November 2025 | 22 | ||
| 4 | 22 October 2025 | 21 November 2025 | 28 November 2025 | 22 | ||
| 5 | 23 October 2025 | 22 November 2025 | 29 November 2025 | 20 | ||
| 6 | 24 October 2025 | 23 November 2025 | 30 November 2025 | 20 | ||
| 7 | 25 October 2025 | 24 November 2025 | 12 January 2025 | 20 | ||
| (F1) | (F2) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Step | Direct Forecast | Cross- Validation | Cross-Debate | Consensus Building | |||||||
| Round | Round 1 | Round 1 | Round 1 | Round 2 | Round 3 | Round 1 | |||||
| Day | #A | #D | #A | #D | #A | #D | #A | #D | |||
| Day 1 | 75 | 1 | 74 | 67 | 7 | 3 | 4 | 4 | 0 | 74 | |
| Day 2 | 75 | 1 | 74 | 66 | 8 | 5 | 3 | 3 | 0 | 74 | |
| Day 3 | 75 | 1 | 74 | 67 | 7 | 1 | 6 | 6 | 0 | 74 | |
| Day 4 | 75 | 1 | 74 | 59 | 15 | 9 | 6 | 6 | 0 | 74 | |
| Day 5 | 75 | 1 | 74 | 1 | 73 | 0 | 73 | 73 | 0 | 74 | |
| Day 6 | 75 | 2 | 73 | 54 | 19 | 3 | 16 | 16 | 0 | 73 | |
| Day 7 | 75 | 4 | 71 | 56 | 15 | 15 | 0 | 0 | 0 | 71 | |
| Total | 525 | 11 | 514 | 514 | 144 | 108 | 514 | ||||
| Day | Day 1 | Day 2 | Day 3 | Day 4 | Day 5 | Day 6 | Day 7 | Total | |
|---|---|---|---|---|---|---|---|---|---|
| Asset Category | |||||||||
| Panel A: Agreed, LLM(1) | |||||||||
| Commodities | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | |
| Indices | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | |
| Exchange Rates | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Crypto | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 7 | |
| 10Y Yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Total: | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 11 | |
| Panel B: Debate, LLM(1) | |||||||||
| Commodities | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | |
| Indices | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Exchange Rates | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Crypto | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 10Y Yield | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | |
| Total: | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | |
| Panel C: Debate, LLM(2) | |||||||||
| Commodities | 15 | 14 | 15 | 15 | 15 | 14 | 13 | 101 | |
| Indices | 15 | 15 | 15 | 15 | 15 | 15 | 14 | 104 | |
| Exchange Rates | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 105 | |
| Crypto | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 98 | |
| 10Y Yield | 15 | 15 | 15 | 15 | 15 | 14 | 15 | 104 | |
| Total: | 74 | 73 | 74 | 74 | 74 | 72 | 71 | 512 | |
| Test (T + 7) | |||
|---|---|---|---|
| Forecast Horizon | F1’s FEI Value | Comp. Result | F2’s FEI Value |
| Panel A: Comparing Daily Average FEIs for F1 and F2 | |||
| Day 1 | 0.783 | < | 0.814 * |
| Day 2 | 0.795 | < | 0.816 * |
| Day 3 | 0.780 | < | 0.813 * |
| Day 4 | 0.776 | < | 0.810 * |
| Day 5 | 0.774 | < | 0.791 * |
| Day 6 | 0.771 | < | 0.809 * |
| Day 7 | 0.763 | < | 0.803 * |
| Panel B: Comparing Overall Average FEIs for F1 and F2 | |||
| Overall Average | 0.777 | < | 0.808 * |
| Test (T + 7) | |||
|---|---|---|---|
| Forecast Horizon | F1’s MAE Value | Comp. Result | F2’s MAE Value |
| Panel A: Comparing Daily Average MAEs for F1 and F2 | |||
| Day 1 | 1393 | > | 1301 * |
| Day 2 | 1056 * | < | 1189 |
| Day 3 | 1440 | > | 1296 * |
| Day 4 | 1545 | > | 1303 * |
| Day 5 | 1587 | > | 1458 * |
| Day 6 | 1622 | > | 1359 * |
| Day 7 | 1703 | > | 1478 * |
| Panel B: Comparing Overall Average MAEs for F1 and F2 | |||
| Overall Average | 1478 | > | 1341 * |
| Test (T + 7) | |||
|---|---|---|---|
| Forecast Horizon | F1’s MSE Value | Comp. Result | F2’s MSE Value |
| Panel A: Comparing Daily Average MSEs for F1 and F2 | |||
| Day 1 | 32.9 M | > | 18.9 M * |
| Day 2 | 11.9 M * | < | 14.4 M |
| Day 3 | 35.8 M | > | 18.0 M * |
| Day 4 | 45.6 M | > | 18.8 M * |
| Day 5 | 49.9 M | > | 27.2 M * |
| Day 6 | 53.6 M | > | 22.1 M * |
| Day 7 | 64.9 M | > | 31.2 M * |
| Panel B: Comparing Overall Average MSEs for F1 and F2 | |||
| Overall Average | 42.1 M | > | 21.5 M * |
| Test (T + 7) | |||
|---|---|---|---|
| Asset Category | F1’s FEI Value | Comparison Result | F2’s FEI Value |
| Panel A: Overall Win Distribution across All 4 Quartiles (0 < FEI ≦ 1) | |||
| Commodities | 5/15 | < | 10/15 * |
| Indices | 11/15 * | > | 4/15 |
| Exchange Rates | 5/15 | < | 10/15 * |
| Crypto | 1/15 | < | 13/15 * |
| 10Y Yield | 6/15 | < | 9/15 * |
| Panel A Total | 28 | < | 46 * |
| Panel B: Top Quartile (75~100%, i.e., 0.75 ≦ FEI < 1.00) | |||
| Commodities | 5/15 | < | 7/15 * |
| Indices | 11/15 * | > | 3/15 |
| Exchange Rates | 5/15 | < | 9/15 * |
| Crypto | 1/15 | < | 3/15 * |
| 10Y Yield | 4/15 | < | 9/15 * |
| Panel B Total | 26 | < | 31 * |
| Panel C: Second Quartile (50~75%, i.e., 0.50 ≦ FEI < 0.75) | |||
| Commodities | 0/15 | < | 3/15 * |
| Indices | 0/15 | < | 1/15 * |
| Exchange Rates | 0/15 | < | 1/15 * |
| Crypto | 0/15 | < | 3/15 * |
| 10Y Yield | 2/15 * | > | 0/15 |
| Panel C Total | 2 | < | 8 * |
| Panel D: Third Quartile (25~50%, i.e., 0.25 ≦ FEI < 0.50) | |||
| Commodities | 0/15 | = | 0/15 |
| Indices | 0/15 | = | 0/15 |
| Exchange Rates | 0/15 | = | 0/15 |
| Crypto | 0/15 | < | 5/15 * |
| 10Y Yield | 0/15 | = | 0/15 |
| Panel D Total | 0 | < | 5 * |
| Panel E: Bottom Quartile (0~25%, i.e., 0.00 ≦ FEI < 0.25) | |||
| Commodities | 0/15 | = | 0/15 |
| Indices | 0/15 | = | 0/15 |
| Exchange Rates | 0/15 | = | 0/15 |
| Crypto | 0/15 | < | 2/15 * |
| 10Y Yield | 0/15 | = | 0/15 |
| Panel E Total | 0 | < | 2 * |
| (F1) | (F2) | - | - | - | |||||
|---|---|---|---|---|---|---|---|---|---|
| Agreed or Debate | - | Agreed | Debate | - | - | - | |||
| Value Adopted From | LLM(1) | LLM(1) | LLM(1) | LLM(2) | - | - | - | ||
| Debate Validity | - | - | Valid | Invalid | Valid | Invalid | Total Valid | Total Invalid | Total Valid Ratio |
| Commodities | 105 | 3 | 1 | 0 | 70 | 31 | 71 | 31 | 69.6% |
| Indices | 105 | 1 | 0 | 0 | 29 | 75 | 29 | 75 | 27.9% |
| Exch. Rates | 105 | 0 | 0 | 0 | 68 | 37 | 68 | 37 | 64.8% |
| Crypto | 105 | 7 | 0 | 0 | 89 | 9 | 89 | 9 | 90.8% |
| 10Y Yield | 105 | 0 | 1 | 0 | 67 | 37 | 68 | 37 | 64.8% |
| Total: | 525 | 11 | 2 | 0 | 323 | 189 | 325 | 189 | |
| Test (T + 7) | |||
|---|---|---|---|
| Asset Category | F1’s FEI Value | Comparison Result | F2’s FEI Value |
| Panel A: Top Quartile (75~100%, i.e., |Diff.| ≧ 8.60%) | |||
| Commodities | 0/15 | < | 2/15 * |
| Indices | 1/15 * | > | 0/15 |
| Exchange Rates | 0/15 | = | 0/15 |
| Crypto | 1/15 | < | 11/15 * |
| 10Y Yield | 1/15 | < | 3/15 * |
| Panel A Total | 3 | < | 16 * |
| Panel B: Second Quartile (50~75%, i.e., 8.60% > |Diff.| ≧ 3.90%) | |||
| Commodities | 2/15 | = | 2/15 |
| Indices | 4/15 * | > | 0/15 |
| Exchange Rates | 1/15 * | > | 0/15 |
| Crypto | 0/15 | < | 2/15 * |
| 10Y Yield | 3/15 | < | 5/15 * |
| Panel B Total | 10 * | > | 9 |
| Panel C: Third Quartile (25~50%, i.e., 3.90% > |Diff.| ≧ 1.86%) | |||
| Commodities | 1/15 | < | 5/15 * |
| Indices | 3/15 * | > | 1/15 |
| Exchange Rates | 2/15 | < | 4/15 * |
| Crypto | 0/15 | = | 0/15 |
| 10Y Yield | 1/15 | = | 1/15 |
| Panel C Total | 7 | < | 11 * |
| Panel D: Bottom Quartile (0~25%, i.e., 1.86% > |Diff.| > 0%) | |||
| Commodities | 2/15 * | > | 1/15 |
| Indices | 3/15 | = | 3/15 |
| Exchange Rates | 2/15 | < | 6/15 * |
| Crypto | 0/15 | = | 0/15 |
| 10Y Yield | 1/15 * | > | 0/15 |
| Panel D Total | 8 | < | 10 * |
| Test (T + 7) | |||
|---|---|---|---|
| Forecast Horizon | N | Correlation (r) | p |
| Day 1 (T + 1) | 75 | 0.908 | 0.000 *** |
| Day 2 (T + 2) | 75 | 0.911 | 0.000 *** |
| Day 3 (T + 3) | 75 | 0.888 | 0.000 *** |
| Day 4 (T + 4) | 75 | 0.863 | 0.000 *** |
| Day 5 (T + 5) | 75 | 0.890 | 0.000 *** |
| Day 6 (T + 6) | 75 | 0.843 | 0.000 *** |
| Day 7 (T + 7) | 75 | 0.864 | 0.000 *** |
| F1 | F2 | Paired Differences | ||||
|---|---|---|---|---|---|---|
| Forecast Horizon | Mean | SD | Mean | SD | t-Value | |
| Test (T + 7) | ||||||
| Day 1 (T + 1) | 0.783 | 0.245 | 0.814 | 0.204 | −2.543 * | 0.013 * |
| Day 2 (T + 2) | 0.795 | 0.230 | 0.816 | 0.197 | −1.879 | 0.064 |
| Day 3 (T + 3) | 0.780 | 0.246 | 0.813 | 0.202 | −2.502 * | 0.015 * |
| Day 4 (T + 4) | 0.776 | 0.249 | 0.810 | 0.205 | −2.344 * | 0.022 * |
| Day 5 (T + 5) | 0.774 | 0.252 | 0.791 | 0.234 | −1.274 | 0.207 |
| Day 6 (T + 6) | 0.771 | 0.254 | 0.809 | 0.209 | −2.405 * | 0.019 * |
| Day 7 (T + 7) | 0.763 | 0.260 | 0.803 | 0.216 | −2.706 * | 0.008 * |
| STDEV.S | ≧1.33% | 1.33~0.68% | 0.68~0.38% | 0.38~0% |
|---|---|---|---|---|
| Asset Category | Top (100~75%) | Second (75~50%) | Third (50~25%) | Bottom (25~0%) |
| Commodities | 2/15 | 9/15 * | 3/15 | 1/15 |
| Indices | 0/15 | 4/15 | 9/15 * | 2/15 |
| Exchange Rates | 0/15 | 0/15 | 3/15 | 12/15 * |
| Crypto | 14/15 * | 0/15 | 0/15 | 1/15 |
| 10Y Yield | 3/15 | 6/15 | 3/15 | 3/15 |
| Total | 19 | 19 | 18 | 19 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chang, S.E.; Chung, K.-C. Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics 2026, 14, 1393. https://doi.org/10.3390/math14081393
Chang SE, Chung K-C. Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics. 2026; 14(8):1393. https://doi.org/10.3390/math14081393
Chicago/Turabian StyleChang, Shuchih Ernest, and Kai-Chun Chung. 2026. "Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators" Mathematics 14, no. 8: 1393. https://doi.org/10.3390/math14081393
APA StyleChang, S. E., & Chung, K.-C. (2026). Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators. Mathematics, 14(8), 1393. https://doi.org/10.3390/math14081393

