Next Article in Journal
Design and Manufacturability-Aware Optimization of a 30 GHz Gap Waveguide Bandpass Filter Using Resonant Posts
Previous Article in Journal
A Typhoon Clustering Model for the Western Pacific Coast Based on Interpretable Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RAE: A Role-Based Adaptive Framework for Evaluating Automatically Generated Public Opinion Reports

1
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China
2
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin 150001, China
3
School of Engineering, Santa Clara University, Santa Clara, CA 95053, USA
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(2), 380; https://doi.org/10.3390/electronics15020380
Submission received: 5 December 2025 / Revised: 6 January 2026 / Accepted: 12 January 2026 / Published: 15 January 2026

Abstract

Public Opinion Reports are essential tools for crisis management, yet their evaluation remains a critical bottleneck that often delays response actions. Recently, dominant Large Language Model (LLM)-based evaluators often overlook a critical challenge: highly open-ended dimensions such as “innovation” and “feasibility” require synthesizing diverse stakeholder perspectives, as different groups judge these qualities from fundamentally different perspectives. Motivated by this, we propose the Role-based Adaptive Evaluation (RAE) framework. This framework employs an adaptive mechanism leveraging multi-perspective evaluation insights through role-based analysis, and further introduces dynamically generated roles tailored to specific contexts for these dimensions. RAE further incorporates multi-role reasoning aggregation to minimize individual biases and enhance evaluation robustness. Extensive experiments demonstrate that RAE significantly improves alignment with human expert judgments, especially on challenging highly open-ended dimensions.

1. Introduction

Public Opinion Reports consolidate information from news and social media to provide critical decision support for crisis management [1]. While Large Language Models (LLMs) have made automated report generation technically feasible, the evaluation of these reports remains a significant bottleneck: manual evaluation is time-consuming and often causes delays that lead decision-makers to miss optimal response windows, potentially worsening the crisis [2].
The open-ended and long-form nature of Public Opinion Reports presents a significant challenge for quality evaluation [3,4]. Traditional n-gram-based metrics such as ROUGE and BLEU prove inadequate as they rely on surface-level text overlap and fail to measure deeper qualities like semantic coherence [5]. Consequently, the research community has increasingly turned to LLM-based evaluators, which provide a more comprehensive evaluation that better aligns with human judgment [6,7].
Despite these advancements, a significant research gap remains because current LLM-based evaluators primarily operate from a singular and static perspective. Specifically, single-perspective evaluators [8,9] rely on a single LLM judgment, which fails to capture the multi-stakeholder nature of crisis evaluation. Furthermore, fixed-role evaluators [10] assign predefined personas but lack the flexibility to adapt to diverse crisis contexts. Such limitations prevent these systems from capturing the multi-faceted nature of public opinion analysis as these reports inherently consolidate varied and even conflicting sentiments and viewpoints from diverse groups, including the public, media, and official institutions. This inherent diversity of perspectives creates a fundamental problem for automated evaluation since open-ended qualities like innovation or feasibility are not absolute and instead vary according to the specific perspective of each stakeholder. For instance, a report recommendation to use a city stadium as a mass shelter might be rated highly feasible by a government official focusing on public space availability but deemed not feasible by a public health expert who anticipates significant risks of disease transmission. Single-perspective evaluators cannot synthesize these diverse stakeholder concerns, which leads to misalignment with human expert judgments that naturally integrate multiple viewpoints. As shown in Figure 1, multi-perspective evaluation achieves stronger alignment with expert judgment than single-perspective approaches by explicitly integrating diverse stakeholder viewpoints.
To address this research gap, we propose the Role-based Adaptive Evaluation (RAE) framework, which enhances alignment with human expert judgment by capturing multiple perspectives. This framework centers on how to design an evaluation approach that effectively captures diverse stakeholder perspectives for public opinion reports, particularly on highly open-ended dimensions. RAE operates through an adaptive role-play mechanism where the central innovation involves tailoring the role composition strategy to the specific nature of each evaluation dimension. For dimensions with reference information (e.g., Title, Causes), RAE employs representative stakeholder roles to ensure consistency while capturing diverse concerns. Since these dimensions rely on verifiable facts, representative roles are sufficient for providing necessary perspectives. For highly open-ended dimensions (e.g., Innovation, Feasibility), RAE dynamically generates context-specific roles to reflect varying stakeholder viewpoints. This dynamic role generation is essential because stakeholder priorities and expertise vary significantly across different crisis contexts. Moreover, RAE incorporates multi-role reasoning aggregation, which leverages insights from multiple roles to minimize biases and improve the overall robustness of the evaluation process.
This work makes the following key contributions:
  • We propose a multi-perspective evaluation framework that addresses open-ended dimensions evaluation challenges by dynamically generating roles to capture diverse, context-specific stakeholder viewpoints.
  • We introduce a multi-role reasoning aggregation mechanism that, for each evaluation dimension, synthesizes perspectives from multiple roles into a comprehensive dimension-specific score.
  • Comprehensive experiments demonstrate that RAE achieves strong alignment with human expert judgments, with particularly notable improvements on the challenging highly open-ended dimensions.

2. Related Work

While human evaluation is the gold standard for open-ended text generation [4], it suffers from being expensive, time-consuming, and unreliable [2]. Given the increasing capabilities of LLMs, LLM-as-a-Judge frameworks serve as a scalable and cost-efficient alternative [5,6]. Zero-shot and few-shot LLM-based evaluators show higher agreement with human annotators than traditional metrics across NLG tasks [11,12,13], sometimes surpassing crowdworkers when compared against expert annotations [14]. Recent research extends this approach by developing customizable judges [8,9,15] that can adopt diverse societal perspectives, ranging from the general public to domain experts [16,17,18].
However, relying on a single evaluator, whether human or LLM, can introduce bias and instability into the evaluation results [19,20,21]. Recent multi-agent evaluation systems explore diverse collaboration strategies [22,23]. Some strategies focus on interaction mechanisms [24], such as discussion with weighted voting [25], interactive querying [26], and structured committees or reviewer-chair hierarchies [27,28]. Others focus on the composition of the agent group, for example by simulating multiple roles within a single model [29] or using multiple smaller models for cost-effectiveness [30]. Collectively, these collaborative methods have shown improvement in response factuality [31], enhance performance on complex tasks [10], and mitigate the Degeneration-of-Thought (DOT) problem [32].
Despite these advances, existing approaches overlook a fundamental challenge in crisis-related public opinion evaluation [8,9,10]. Crisis scenarios are inherently multifaceted, and highly open-ended dimensions require synthesizing diverse stakeholder perspectives because different groups judge qualities like “innovation” or “feasibility” from fundamentally different viewpoints that vary by context. To address this, we introduce an adaptive mechanism that dynamically generates context-tailored evaluator roles for these dimensions, combined with a reasoning aggregation mechanism that synthesizes independent role-specific judgments into comprehensive and robust final assessment.

3. Preliminaries: The Structure of a Public Opinion Report

A Public Opinion Report comprises five key components, each designed to provide a comprehensive overview of a crisis event.

3.1. Event Title

The Event Title serves as a concise identifier that allows readers to immediately grasp the event’s essence. A well-formed title conveys the crisis name, type, and time, facilitating efficient storage and retrieval.

3.2. Event Summary

The Event Summary offers a structured overview covering six key aspects: Crisis Name (What), Location (Where), Time (When), Cause (Why/How), Involved Parties (Who), and Impact (the event’s consequences and severity).

3.3. Event Timeline

The Event Timeline traces public opinion evolution through three phases: Incubation Period (initial low-activity phase), Peak Period (rapid attention growth with widespread discussion), and Decline Period (diminishing interest phase).

3.4. Event Focus

The Event Focus analyzes public opinion from two participant groups—Netizens and Authoritative Institutions—extracting three dimensions for each: Core Topics, Sentiment Stance, and Key Viewpoints.

3.5. Event Suggestions

The Event Suggestions section provides actionable recommendations based on the preceding analysis. These may include communication strategies, public relations actions, or policy adjustments.

4. Methodology

We propose the RAE framework (Figure 2), a role-playing-based approach for multi-perspective evaluation. Compared to existing single-perspective evaluators [8,9] and fixed-role systems [10], RAE is designed to better reflect the multi-stakeholder nature of crisis evaluation by combining an adaptive role-play mechanism for dimension-specific evaluation with multi-role reasoning aggregation for robust score synthesis.

4.1. Task Definition and Research Hypotheses

Given a crisis event E and its corresponding generated public opinion report R, our evaluation task aims to assess the quality of R across a set of evaluation dimensions D = { d 1 , d 2 , , d n } , where n represents the total number of dimensions. In our framework, n = 15 dimensions span five major components including Event Title, Event Summary, Event Timeline, Event Focus, and Event Suggestions. Each dimension d i D evaluates a specific quality aspect with scores ranging from 1 (unacceptable) to 5 (excellent).
The evaluation process can be formalized as a function:
S d = f eval ( r , R , E , d ) , S d { 1 , 2 , 3 , 4 , 5 } ,
where r denotes the evaluator role (representing a specific stakeholder perspective), R is the generated report, E contains source materials and reference information, d is the dimension being evaluated, and S d is the resulting quality score.
Our goal is to construct a framework that maximizes the alignment between the evaluated scores S and the human expert consensus S human as follows:
S = arg max f eval Alignment ( S , S human ) ,
where S = { S d 1 , S d 2 , , S d n } represents the set of scores across all dimensions, and Alignment ( · ) measures the degree of agreement through correlation metrics (Pearson’s ρ , Kendall’s τ ) and error metrics (MAE).
To investigate and validate the effectiveness of the proposed RAE framework, we formulate the following four research hypotheses based on the core design of our system:
  • H1: What aggregation mechanism optimally synthesizes multi-role evaluations into reliable scores?
  • H2: Does increasing the number of dynamic roles continuously improve evaluation performance, or is there an optimal threshold?
  • H3: Why employ adaptive role composition rather than using only dynamically generated roles for all dimensions?
  • H4: What underlying mechanisms account for RAE’s improved alignment with human expert judgments in complex evaluation scenarios?

4.2. Adaptive Role-Play Mechanism

The central innovation of RAE is its adaptive role-play mechanism, which tailors the role composition strategy based on the nature of evaluation dimensions. All dimensions in RAE employ multi-perspective evaluation through multiple evaluator roles, but the approach to composing these roles differs based on whether the dimension relies on reference information or is highly open-ended.

4.2.1. Predefined Roles for Dimensions with Reference Information

For dimensions grounded by reference information (e.g., Title, Causes), evaluations revolve around verifiable facts. However, evaluating the quality of how the report writes and presents these facts (e.g., whether a Title is concise, or whether the Cause captures the core issue) still requires synthesizing multiple viewpoints. To achieve a stable, consistent, and representative multi-perspective evaluation for these dimensions, RAE employs a set of predefined, representative roles (e.g., “Netizen”, “Journalist”). Each role follows the schema: ⟨Role Type, Role Description⟩, where the Role Description specifies the evaluation focus for that role. By leveraging this fixed set of specialized roles, RAE ensures the evaluation of these core components is repeatable and robust, establishing a foundational multi-perspective evaluation layer for the framework.

4.2.2. Dynamic Roles for Highly Open-Ended Dimensions

Highly open-ended dimensions, such as evaluating the Innovation or Feasibility of suggestions, lack fixed standards and assess long-form content where quality judgments vary dramatically by context. To address this, RAE dynamically generates context-specific evaluator roles, capturing the diverse, situation-dependent viewpoints needed for comprehensive evaluation. Each dynamic role follows the schema: Role Type , Role Description , where the description captures the stakeholder’s specific concerns and evaluation priorities. Given a crisis event E and its report R, this generation process can be formalized as:
P ( roles | E , R ) = LLM ( P dynamic , E , R )
Dynamic generation may produce semantically overlapping roles. To ensure diverse and representative evaluators while maintaining efficiency, RAE implements a two-stage deduplication process. First, we apply MinHash with n-gram features to efficiently remove roles with high surface-level similarity. Second, we perform semantic clustering to identify conceptually similar roles. A sentence embedding model encodes each role description into a vector representation, and the k-means algorithm then clusters these vectors based on similarity. Finally, only the role closest to each cluster centroid proceeds to the evaluation stage.

4.3. Multi-Role Reasoning Aggregation

The final stage of RAE consolidates diverse evaluations from multiple roles into a single, reliable score for each dimension. To identify the aggregation strategy that best aligns with human expert judgments, we investigate three alternative approaches:
(1) Voting Aggregation: This approach employs majority voting to aggregate scores from all evaluating roles. For a given evaluation dimension d, the final score S ^ d is the score (on a 1–5 scale) that is most frequently assigned by the set of all evaluating roles R:
S ^ d = argmax s { 1 , , 5 } r = 1 | R | 1 ( S r , d = s )
where | R | is the total number of roles, S r , d is the score from role r, and 1 ( · ) is the indicator function.
(2) Process Aggregation: This reasoning-centric approach employs a judge LLM. For a given dimension d, the judge receives only the textual justifications, excluding their numerical scores. The judge analyzes these arguments for logical soundness and then generates a final score S ^ d based on this qualitative evaluation:
S ^ d = LLM judge { J 1 , d , J 2 , d , , J | R | , d }
where J r , d represents role r’s justification for dimension d.
(3) Comprehensive Aggregation: This approach employs a judge LLM to holistically evaluate both the numerical scores and the textual justifications. For a given dimension d, the judge LLM receives these dual inputs to determine the most appropriate final score S ^ d :
S ^ d = LLM judge { S 1 , d , S 2 , d , , S | R | , d } , { J 1 , d , J 2 , d , , J | R | , d }
where S r , d represents role r’s score for dimension d.

5. Experiments

5.1. Datasets

OPOR-Bench [33] is an event-centric dataset specifically designed for public opinion report, covering 463 crisis events spanning from 2012 to 2025.
These events span diverse crisis types including natural disasters (e.g., cyclones, floods) and human-caused crises (e.g., traffic accidents, industrial incidents). Each crisis event in the dataset includes a collection of multi-source documents, specifically news articles and social media posts, along with a structured reference summary containing key event information (location, time, cause, impact, etc.) and a set of reports generated by LLMs.

5.2. Experimental Setup

5.2.1. Baseline

We compare RAE against OPOR-Eval, an agentic evaluation pipeline that orchestrates three predefined tools (Fact-Checker, Opinion-Miner, and Solution-Counselor) for section-specific evaluation of public opinion reports [33].

5.2.2. Model Configuration

To evaluate the performance of RAE, we establish a comprehensive experimental framework encompassing both model configurations and statistical metrics. For the automated evaluation phase, we employ GPT-4o and DeepSeek-V3 to score reports across 15 distinct dimensions. During this process, the model temperature is set to 0.3 and all other hyperparameters remain at their default values as specified in the evaluation guidelines in Appendix A.

5.2.3. Evaluation Metrics

To ground these automatic results, we also conduct a human evaluation and calculate the Intraclass Correlation Coefficient (ICC) to assess the inter-rater reliability among human experts. The alignment between automatic metrics and human judgments is subsequently quantified through three key indicators including Spearman’s rank correlation ( ρ ), Kendall’s tau ( τ ), and Mean Absolute Error (MAE).

5.3. Evaluation Framework Validation

5.3.1. Human Evaluation Protocol

To validate the RAE framework, we conduct a rigorous human evaluation with three domain experts who co-design the scoring criteria. The entire process is blind, with experts independently scoring reports via a dedicated annotation interface. The protocol divides into two sequential phases:
  • Calibration: Experts first rate a subset of reports. The goal is to establish a consistent understanding of the scoring criteria, with agreement formally measured using the Intraclass Correlation Coefficient (ICC).
  • Formal Evaluation: After achieving a high level of agreement in the calibration phase, the experts proceed to score the entire corpus. This phase yields the three complete sets of ratings used for our human-agent agreement analysis.

5.3.2. Agreement Analysis

We analyze the human evaluation results to address two critical questions: (1) How reliable are human experts? (2) How strong does RAE align with human judgments?
Answer 1: Our analysis confirms a high degree of inter-rater reliability among human experts. Established guidelines categorize ICC values indicate poor (<0.50), moderate (0.50–0.75), good (0.75–0.90), and excellent (>0.90) agreement [34]. In light of these standards, our results in Table 1 reveal consistently high reliability across all dimensions, ranging from objective ones like Date Accuracy to more subjective ones such as Innovation. The consistently strong ICC scores validate the human ratings as a reliable standard for evaluating the RAE framework.
Answer 2: RAE achieves strong alignment with human judgments. Established guidelines consider correlations (Spearman’s ρ , Kendall’s τ ) above 0.70 are considered strong [35], and lower MAE indicates better alignment [36]. As detailed in Table 2, the RAE framework demonstrates consistently high performance, with the majority of results surpassing the “strong” correlation threshold. RAE also outperforms the OPOR-Eval baseline on both base models, achieving higher correlation and lower MAE on GPT-4o and DeepSeek-v3. This robust performance confirms that RAE reliably captures human evaluation patterns. The consistency holds across different aggregation methods, role counts, and base models. We further analyze the impact of specific design choices in the following sections.

6. Discussions

This section provides a hypothesis-driven discussion of our experimental results. We validate the research hypotheses defined in Section 4.1 through controlled comparisons, ablations, and a case study, and we summarize the practical implications of each finding.

6.1. Validation of H1: Voting Aggregation Achieves Optimal Alignment with Human Judgments

Our results confirm H1, which posits that a democratic synthesis of multi-role perspectives is superior to centralized judgment. The Voting strategy demonstrates a clear performance advantage, achieving an average Spearman’s ρ of 0.835 that consistently outperforms both the Process (0.818) and Comprehensive (0.798) methods. This superiority is robust, holding true across different base models and numbers of roles, and is further corroborated by its consistently lower MAE, as detailed in Table 2. This hierarchy reveals a compelling insight about aggregation mechanism design. The success of voting suggests that it leverages the “wisdom of crowds” effect, where diverse, independent judgments converge toward the ground truth by canceling out individual biases and random errors. Conversely, introducing a single judge LLM creates a potential single point of failure that may amplify rather than mitigate subjective biases. Most notably, the Comprehensive approach (providing both scores and justifications) paradoxically degrades performance. This stems from its centralized design, which concentrates decision-making in a single judge LLM. This centralization is vulnerable to two sources of bias: (1) while voting distributes power across roles (allowing biases to cancel out), the judge’s own preferences can override the statistical consensus; and (2) the justifications introduce noise, allowing a single eloquent outlier (e.g., a low score with strong reasoning) to unduly sway the judge. In either case, this replaces distributed collective wisdom with a potentially flawed individual judgment. In contrast, the straightforward majority voting mechanism preserves the robustness of multi-perspective evaluation through direct numerical aggregation, proving to be the most reliable aggregation strategy for the RAE framework.

6.2. Validation of H2: Five Dynamic Roles Optimally Balance Diversity and Relevance

To determine the optimal number of dynamic roles, we vary the number of roles (k) retained after clustering from 3 to 7. Given that Voting aggregation achieves the strongest alignment, we report its performance across different base models for each k value. As shown in Figure 3, the results reveal a clear non-linear trend peaking at k = 5 , which validates H2. Performance improves as k increases from 3 to 5, peaking at k = 5 with the strongest alignment, but then declines at k = 6. This pattern highlights a critical trade-off between perspective diversity and domain relevance. Initially, adding more roles (from 3 to 5) is beneficial as it captures essential stakeholder perspectives. For a flooding crisis, increasing from 3 to 5 roles might add critical roles such as “Urban Infrastructure Planner” and “Public Health Specialist” alongside core roles like “Hydraulic Engineer” and “Emergency Coordinator”, ensuring comprehensive coverage of the crisis’s multifaceted dimensions. However, excessive role generation (at k = 6 or beyond) begins to introduce marginally relevant perspectives that weaken evaluation quality. For instance, generating a sixth role like “Agricultural Policy Expert” for the same flooding crisis may provide limited insight into urban flood response evaluation, introducing noise rather than meaningful perspective diversity. The system trades comprehensive domain coverage for increasingly less central viewpoints. This finding provides practical guidance for deploying RAE by suggesting the generation of a diverse pool of candidate roles, followed by clustering and retaining roughly five key representatives to maximize both evaluation coverage and precision.

6.3. Validation of H3: Dynamic Role Generation Is Critical, While Adaptive Composition Improves Efficiency

To analyze the role of each component and validate the adaptive composition strategy, we conduct a comparative analysis on GPT-4o, using the Voting-based aggregation configuration with five dynamically generated roles. We design four variants: (1) w/o Predefined: Removes the predefined representative roles. It evaluates the dimensions with reference information directly without role-play, but keeps dynamic roles for open-ended dimensions; (2) w/o Dynamic: Removes the dynamically generated roles. It evaluates the highly open-ended dimensions directly without role-play, but keeps predefined roles for dimensions with reference information; (3) All Predefined: Applies the predefined representative roles to all dimensions; and (4) All Dynamic: Applies dynamically generated roles to all dimensions.
The ablation study in Table 3 provides strong support for H3. First, dynamic role generation is the most critical component for high performance. This is strongly supported by the substantial performance drops in the two variants lacking dynamic roles for open-ended evaluation: w/o Dynamic drops to 0.81 (−0.07) and All Predefined drops to 0.82 (−0.06). The substantial performance drops in the two variants lacking dynamic roles for open-ended evaluation (w/o Dynamic drops to 0.81 (−0.07) and All Predefined drops to 0.82 (−0.06)) strongly support this conclusion. Both demonstrate that a fixed, predefined role strategy cannot adequately capture the context-specific expertise required for highly open-ended dimensions. Second, dynamic generation proves universally effective. Surprisingly, All Dynamic achieves performance (0.88) identical to the full RAE. This suggests that dynamic role generation, while specifically designed for open-ended dimensions, does not harm performance on dimensions with reference information. It is likely capable of generating domain-specific roles that prove equally effective as our predefined ones. Finally, these results validate the practical efficiency of RAE’s adaptive strategy. The similarly poor performance of All Predefined (0.82) and w/o Dynamic (0.81) confirms that generic representative roles cannot substitute for context-specific expertise. Performance drops concentrate in open-ended dimensions: Innovation (0.78 vs. 0.88), Feasibility (0.76 vs. 0.87), and Guidance (0.77 vs. 0.87). While All Dynamic performs well, full RAE achieves equivalent performance with greater efficiency by using predefined roles for reference-informed dimensions, avoiding unnecessary dynamic generation costs. This adaptive strategy represents an optimal balance between performance and computational efficiency.

6.4. Validation of H4: Case Evidence Illustrates Why Multi-Perspective Evaluation Improves Expert Alignment

To provide intuition into the mechanics of RAE, we present a case study on a report about the “2021 Black Sea incident”, comparing RAE’s evaluation of the “Innovation” dimension against a vanilla LLM baseline, as depicted in Figure 4. The vanilla LLM baseline, lacking a specific viewpoint, offers a generic 4/5 score, praising the suggestions as a “fresh approach”. In contrast, RAE’s five dynamically generated roles produce domain-specific disagreements. Some roles, such as the Publicist and Diplomat, recognize innovation in transparency initiatives and multilateral dialogues within their respective domains, assigning scores of 4/5. However, the Military Strategist, International Lawyer, and Media Journalist assign a more critical 3/5, converging on the view that while the suggestions are solid, they remain “conventional in nature” and do not introduce fundamentally new concepts in their respective domains. With three of the five roles assigning a score of 3, the voting mechanism converges on a final, more grounded score of 3/5. This case vividly demonstrates how voting aggregation avoids being dominated by a few positive opinions and instead captures the cautious consensus. It reveals a nuanced reality that the baseline misses: what appears innovative in one domain can be conventional in others. This context-aware perspective leads to a more robust and well-grounded evaluation.

7. Conclusions

In this work, we propose the Role-based Adaptive Evaluation (RAE) framework for public opinion report evaluation in crisis contexts. RAE combines an adaptive role-play mechanism with multi-role reasoning aggregation to better reflect the multi-stakeholder nature of crisis assessment and to improve alignment with human expert judgments, especially on highly open-ended dimensions. Our findings show that RAE improves reliability over static single-agent evaluation by integrating diverse stakeholder viewpoints and synthesizing them with robust aggregation. Specifically, multi-perspective role-play is crucial for open-ended dimensions, where perceived innovation and feasibility depend strongly on context-specific stakeholder concerns. Moreover, the adaptive role-play mechanism balances evaluation quality and computational cost. In particular, voting-based aggregation provides the most reliable synthesis of role-specific judgments, with five dynamic roles offering an effective trade-off between perspective diversity and relevance. Furthermore, RAE provides practical value by informing stakeholder-oriented messaging [37], supporting multi-stakeholder ethical engagement [38], and guiding context-sensitive deployment in complex public opinion environments [39].

7.1. Limitations

Despite its effectiveness, RAE has several limitations. First, the computational cost of multi-role aggregation is higher than single-agent systems due to multiple LLM calls, which may be prohibitive for resource-constrained or real-time large-scale deployments. Second, performance is strictly dependent on the base LLM’s capability to simulate roles, and inherent model biases or limited domain knowledge could potentially affect evaluation quality. Furthermore, our current scope is primarily limited to text-based, English-language crisis reports, which may affect its generalization to multimodal data (e.g., images/videos) or other languages and domains.

7.2. Future Work

Future work will focus on developing parameter-efficient aggregation mechanisms and exploring model distillation to reduce computational overhead and latency. We will also explore stronger role simulation and calibration methods, including domain-aware role constraints and automatic checks for role relevance and consistency, to mitigate model bias and knowledge gaps. Finally, we will extend RAE to additional languages and domains and explore multimodal extensions for more realistic crisis communication settings.

Author Contributions

Conceptualization, J.Y. and Y.X.; Investigation, J.Y.; Methodology, J.Y. and Y.X.; Formal Analysis, J.Y. and Y.X.; Software, Y.F.; Validation, Y.F. and J.Y.; Data Curation, Y.F.; Writing—Original Draft Preparation, J.Y.; Supervision, L.Z., H.S. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.

Acknowledgments

During the preparation of this manuscript, the authors used Claude Opus 4.5 for grammar checking and language refinement. The authors have reviewed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Scoring Criteria

This appendix presents detailed scoring criteria for evaluating Public Opinion reports. To illustrate the distinct evaluation approaches for objective and subjective dimensions, we provide two representative examples: Date Accuracy (an objective dimension from Timeline) and Innovation (a subjective dimension from Event Suggestions). Following Kocmi and Federmann [40], all dimensions are rated on a 5-point Likert scale, where 1 indicates unacceptable quality and 5 represents excellent quality.

Appendix A.1. Objective Dimensions: Timeline—Date Accuracy

Scoring Criteria for Date Accuracy (Objective Dimensions)
This criterion evaluates the factual accuracy of the key dates in the Event_Timeline. The evaluation is based on four critical reference dates provided for the event’s lifecycle: the start of the Incubation Period, the start of the Peak Period, and the start and end of the Decline Period.
The final 1–5 score is determined based on how many of these four critical dates are correctly reflected in the generated timeline. The mapping is as follows:
Score 5 (Excellent): 
The timeline correctly reflects or aligns with all four of the reference dates.
Score 4 (Good): 
The timeline correctly reflects or aligns with three of the four reference dates.
Score 3 (Medium): 
The timeline correctly reflects or aligns with two of the four reference dates.
Score 2 (Poor): 
The timeline correctly reflects or aligns with only one of the four reference dates.
Score 1 (Unacceptable): 
The timeline fails to correctly reflect any of the four reference dates.

Appendix A.2. Subjective Dimensions: Event Suggestions—Innovation

Scoring Criteria for Innovation (Subjective Dimensions)
This section details the four scoring dimensions for the Event Suggestions.
Innovation: This criterion evaluates whether the suggestions offer novel or forward-thinking approaches beyond standard practices.
Score 5 (Excellent):  
Demonstrates a disruptive or revolutionary perspective, providing an entirely fresh angle to tackle the issues.
Score 4 (Good): 
Proposes clearly innovative, forward-thinking solutions; shows strong originality.
Score 3 (Medium): 
Displays some creative elements or novel concepts but lacks a fully developed, breakthrough approach.
Score 2 (Poor): 
Makes slight improvements within a conventional framework, offering limited creativity.
Score 1 (Unacceptable): 
Strictly follows outdated or formulaic methods, showing no new ideas.

References

  1. Wang, B.; Zi, Y.; Zhao, Y.; Deng, P.; Qin, B. ESDM: Early sensing depression model in social media streams. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 22–24 May 2024; Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; ELRA and ICCL: Torino, Italy, 2024; pp. 6288–6298. [Google Scholar]
  2. Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 13806–13834. [Google Scholar]
  3. Wang, D.; Yang, K.; Zhu, H.; Yang, X.; Cohen, A.; Li, L.; Tian, Y. Learning Personalized Alignment for Evaluating Open-ended Text Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 13274–13292. [Google Scholar]
  4. Liu, Y.; Yu, J.; Xu, Y.; Li, Z.; Zhu, Q. A survey on transformer context extension: Approaches and evaluation. arXiv 2025, arXiv:2503.13299. [Google Scholar] [CrossRef]
  5. Chiang, C.H.; Lee, H.Y. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 15607–15631. [Google Scholar]
  6. Tseng, Y.M.; Huang, Y.C.; Hsiao, T.Y.; Chen, W.L.; Huang, C.W.; Meng, Y.; Chen, Y.N. Two tales of persona in LLMs: A survey of role-playing and personalization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 16612–16631. [Google Scholar]
  7. Chen, Q.; Qin, L.; Liu, J.; Peng, D.; Guan, J.; Wang, P.; Hu, M.; Zhou, Y.; Gao, T.; Che, W. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models. arXiv 2025, arXiv:2503.09567. [Google Scholar]
  8. Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 2511–2522. [Google Scholar]
  9. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
  10. Xiong, K.; Ding, X.; Cao, Y.; Liu, T.; Qin, B. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, New Orleans, LO, USA, 10–16 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 7572–7590. [Google Scholar]
  11. Lin, Y.C.; Neville, J.; Stokes, J.; Yang, L.; Safavi, T.; Wan, M.; Counts, S.; Suri, S.; Andersen, R.; Xu, X.; et al. Interpretable user satisfaction estimation for conversational systems with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 11100–11115. [Google Scholar]
  12. Liu, W.; Wang, X.; Wu, M.; Li, T.; Lv, C.; Ling, Z.; JianHao, Z.; Zhang, C.; Zheng, X.; Huang, X. Aligning large language models with human preferences through representation engineering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 10619–10638. [Google Scholar]
  13. Lin, Y.T.; Chen, Y.N. LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), Toronto, ON, Canada, 14 July 2023; Chen, Y.N., Rastogi, A., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 47–58. [Google Scholar]
  14. Cegin, J.; Simko, J.; Brusilovsky, P. ChatGPT to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 1889–1905. [Google Scholar]
  15. Pan, Q.; Ashktorab, Z.; Desmond, M.; Santillán Cooper, M.; Johnson, J.; Nair, R.; Daly, E.; Geyer, W. Human-centered design recommendations for LLM-as-a-judge. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop, Bangkok, Thailand, 15 August 2024; Soni, N., Flek, L., Sharma, A., Yang, D., Hooker, S., Schwartz, H.A., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 16–29. [Google Scholar]
  16. Xu, C.; Wen, B.; Han, B.; Wolfe, R.; Wang, L.L.; Howe, B. Do language models mirror human confidence? Exploring psychological insights to address overconfidence in LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 25655–25672. [Google Scholar]
  17. Aher, G.; Arriaga, R.I.; Kalai, A.T. Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  18. Park, J.S.; Popowski, L.; Cai, C.; Morris, M.R.; Liang, P.; Bernstein, M.S. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA, 29 October–2 November 2022. [Google Scholar]
  19. Kim, A.; Kim, K.; Yoon, S. DEBATE: Devil’s advocate-based assessment and text evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 1885–1897. [Google Scholar]
  20. Koo, R.; Lee, M.; Raheja, V.; Park, J.I.; Kim, Z.M.; Kang, D. Benchmarking cognitive biases in large language models as evaluators. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 517–545. [Google Scholar]
  21. Kumar, S.; Nargund, A.A.; Sridhar, V. CourtEval: A courtroom-based multi-agent evaluation framework. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 25875–25887. [Google Scholar]
  22. Li, G.; Al Kader Hammoud, H.A.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative agents for “mind” exploration of large language model society. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
  23. Chen, A.; Lou, L.; Chen, K.; Bai, X.; Xiang, Y.; Yang, M.; Zhao, T.; Zhang, M. DUAL-REFLECT: Enhancing large language models for reflective translation through dual learning feedback mechanisms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 693–704. [Google Scholar]
  24. Zhao, J.; Plaza-del Arco, F.M.; Genchel, B.; Curry, A.C. Language model council: Democratically benchmarking foundation models on highly subjective tasks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 12395–12450. [Google Scholar]
  25. Zhang, Y.; Chen, Q.; Li, M.; Che, W.; Qin, L. AutoCAP: Towards automatic cross-lingual alignment planning for zero-shot chain-of-thought. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 9191–9200. [Google Scholar]
  26. Bai, Y.; Ying, J.; Cao, Y.; Lv, X.; He, Y.; Wang, X.; Yu, J.; Zeng, K.; Xiao, Y.; Lyu, H.; et al. Benchmarking foundation models with language-model-as-an-examiner. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
  27. Zhao, R.; Zhang, W.; Chia, Y.K.; Xu, W.; Zhao, D.; Bing, L. Auto-Arena: Automating LLM evaluations with agent peer battles and committee discussions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 4440–4463. [Google Scholar]
  28. Chu, Z.; Ai, Q.; Tu, Y.; Li, H.; Liu, Y. Automatic large language model evaluation via peer review. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, New York, NY, USA, 21–25 October 2024; pp. 384–393. [Google Scholar]
  29. Wu, N.; Gong, M.; Shou, L.; Liang, S.; Jiang, D. Large language models are diverse role-players for summarization evaluation. In Proceedings of the Natural Language Processing and Chinese Computing: 12th National CCF Conference, NLPCC 2023, Foshan, China, 12–15 October 2023; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2023; pp. 695–707. [Google Scholar]
  30. Chen, H.; Goldfarb-Tarrant, S. Safer or luckier? LLMs as safety evaluators are not robust to artifacts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 19750–19766. [Google Scholar]
  31. Li, Y.; Du, Y.; Zhang, J.; Hou, L.; Grabowski, P.; Li, Y.; Ie, E. Improving multi-agent debate with sparse communication topology. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 7281–7294. [Google Scholar]
  32. Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
  33. Yu, J.; Xu, Y.; Li, H.; Li, J.; Zhu, L.; Shen, H.; Shi, L. OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation. Comput. Mater. Contin. 2025. [Google Scholar] [CrossRef]
  34. Mu, H.; Xu, Y.; Feng, Y.; Han, X.; Li, Y.; Hou, Y.; Che, W. Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants’ API Invocation Capabilities. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–24 May 2024; pp. 2342–2353. [Google Scholar]
  35. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
  36. Willmott, C.J.; Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
  37. Zhou, A.; Tsai, W.H.S.; Men, L.R. Optimizing AI Social Chatbots for Relational Outcomes: The Effects of Profile Design, Communication Strategies, and Message Framing. Int. J. Bus. Commun. 2024. [Google Scholar] [CrossRef]
  38. Dospinescu, N. A Study on Ethical Communication in Business. In Proceedings of the 3rd International Scientific Conference on Recent Advances in Information Technology, Tourism, Economics, Management and Agriculture—ITEMA 2019, Bratislava, Slovakia, 24 October 2019; Selected Papers. ITEMA: Belgrade, Serbia, 2019; pp. 165–172. [Google Scholar] [CrossRef]
  39. Kazlauskienė, I.; Atkočiūnienė, V. Application of information and communication technologies for public services management in smart villages. Businesses 2025, 5, 31. [Google Scholar] [CrossRef]
  40. Kocmi, T.; Federmann, C. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; Nurminen, M., Brenner, J., Koponen, M., Latomaa, S., Mikhailov, M., Schierl, F., Ranasinghe, T., Vanmassenhove, E., Vidal, S.A., Aranberri, N., et al., Eds.; European Association for Machine Translation: Tampere, Finland, 2023; pp. 193–203. [Google Scholar]
Figure 1. Comparison of (a) Single-Perspective Evaluation and (b) Multiple-Perspective Evaluation. (a) Single-perspective evaluation can misalign with expert judgment by overlooking stakeholder-specific concerns (✗). (b) Multi-perspective evaluation synthesizes diverse viewpoints for stronger expert alignment (✓).
Figure 1. Comparison of (a) Single-Perspective Evaluation and (b) Multiple-Perspective Evaluation. (a) Single-perspective evaluation can misalign with expert judgment by overlooking stakeholder-specific concerns (✗). (b) Multi-perspective evaluation synthesizes diverse viewpoints for stronger expert alignment (✓).
Electronics 15 00380 g001
Figure 2. Overview of the RAE framework. (Section 4.2) The Adaptive Role-Play Mechanism utilizes predefined and dynamic roles for evaluation. (Section 4.3) The Multi-Role Reasoning Aggregation synthesizes these evaluations into a final score (1–5).
Figure 2. Overview of the RAE framework. (Section 4.2) The Adaptive Role-Play Mechanism utilizes predefined and dynamic roles for evaluation. (Section 4.3) The Multi-Role Reasoning Aggregation synthesizes these evaluations into a final score (1–5).
Electronics 15 00380 g002
Figure 3. Impact of dynamic role count on evaluation performance using Voting aggregation. The curves show Spearman’s ρ for GPT-4o and DeepSeek-v3. Performance peaks at 5 roles, demonstrating an optimal balance between perspective diversity and relevance.
Figure 3. Impact of dynamic role count on evaluation performance using Voting aggregation. The curves show Spearman’s ρ for GPT-4o and DeepSeek-v3. Performance peaks at 5 roles, demonstrating an optimal balance between perspective diversity and relevance.
Electronics 15 00380 g003
Figure 4. Case study comparing (a) vanilla LLM baseline and (b) RAE framework on the “Innovation” dimension for a report on the “2021 Black Sea incident”. The (a) vanilla LLM provides a single, optimistic score, while (b) RAE aggregates diverse role-based scores for a more grounded result.
Figure 4. Case study comparing (a) vanilla LLM baseline and (b) RAE framework on the “Innovation” dimension for a report on the “2021 Black Sea incident”. The (a) vanilla LLM provides a single, optimistic score, while (b) RAE aggregates diverse role-based scores for a more grounded result.
Electronics 15 00380 g004
Table 1. The ICC scores confirm high inter-rater reliability among human experts.
Table 1. The ICC scores confirm high inter-rater reliability among human experts.
DimensionICC3
Event Title0.843
Event Summary
 Event Nature0.868
 Time & Loc.0.860
 Involved Parties0.856
 Causes0.879
 Impact0.887
Event Timeline
 Date Acc.0.839
 Sub Events0.793
Event Focus
 Contro. Topic0.894
 Repr. Stmt.0.877
 Emo. Anal.0.893
Event Suggestions
 Rel.0.855
 Feas.0.841
 Emo. Guide.0.850
 Innov.0.861
Table 2. Human-agent agreement analysis on RAE aggregation methods (VOTE, PROC, COMP) compared with the Direct (no role-play) baseline and the OPOR-Eval framework. Best results in bold. Higher correlation ( ρ , τ , ↑) and lower MAE (↓) indicate better performance.
Table 2. Human-agent agreement analysis on RAE aggregation methods (VOTE, PROC, COMP) compared with the Direct (no role-play) baseline and the OPOR-Eval framework. Best results in bold. Higher correlation ( ρ , τ , ↑) and lower MAE (↓) indicate better performance.
MethodSpearman’s ρ Kendall’s τ MAE ↓
GPT-4o
 VOTE0.8730.8640.335
 PROC0.8620.8520.346
 COMP0.8440.8340.382
 Direct0.6950.6050.534
 OPOR-Eval [33]0.7210.6330.412
DeepSeek-v3
 VOTE0.8680.8610.414
 PROC0.8550.8430.443
 COMP0.8250.8190.488
 Direct0.2850.2400.885
 OPOR-Eval [33]0.3120.2880.761
Table 3. Component analysis and strategy comparison of RAE on GPT-4o.
Table 3. Component analysis and strategy comparison of RAE on GPT-4o.
MethodTitleSummaryTimelineFocusSuggestionsAVG 
Tit.Nat.T&L.Par.Cau.Imp.Dat.Sub.Top.Sta.Emo.Rel.Fea.Gui.Inn.
RAE0.880.870.890.880.860.880.860.890.860.890.870.880.870.870.880.88
w/o Predefined0.81 0.83 0.83 0.84 0.77 0.79 0.83 0.870.85 0.89 0.85 0.89 0.870.88 0.86 0.84 (−0.04)
w/o Dynamic0.83 0.84 0.86 0.850.86 0.87 0.82 0.78 0.81 0.76 0.82 0.780.75 0.750.77 0.81 (−0.07)
all Predefined0.86 0.84 0.870.86 0.83 0.850.84 0.81 0.82 0.770.83 0.79 0.76 0.77 0.78 0.82 (−0.06)
all Dynamic0.90 0.87 0.88 0.86 0.89 0.90 0.87 0.87 0.86 0.89 0.87 0.87 0.88 0.86 0.86 0.88 (0.00)
Notes: AVG shows the average Pearson correlation. Drops from the full RAE model are in red italics. denotes statistically significant drop (p < 0.05). Abbreviations: Tit. = Title, Nat. = Nature, T&L. = Time & Location, Par. = Parties, Cau. = Causes, Imp. = Impact, Dat. = Date Accuracy, Sub. = Sub Events Coverage, Top. = Controversy Topic, Sta. = Representative Statement, Emo. = Emotional Analysis, Rel. = Relevance, Fea. = Feasibility, Gui. = Guidance, Inn. = Innovation.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, J.; Xu, Y.; Feng, Y.; Zhu, L.; Shen, H.; Shi, L. RAE: A Role-Based Adaptive Framework for Evaluating Automatically Generated Public Opinion Reports. Electronics 2026, 15, 380. https://doi.org/10.3390/electronics15020380

AMA Style

Yu J, Xu Y, Feng Y, Zhu L, Shen H, Shi L. RAE: A Role-Based Adaptive Framework for Evaluating Automatically Generated Public Opinion Reports. Electronics. 2026; 15(2):380. https://doi.org/10.3390/electronics15020380

Chicago/Turabian Style

Yu, Jinzheng, Yang Xu, Yifan Feng, Ligu Zhu, Hao Shen, and Lei Shi. 2026. "RAE: A Role-Based Adaptive Framework for Evaluating Automatically Generated Public Opinion Reports" Electronics 15, no. 2: 380. https://doi.org/10.3390/electronics15020380

APA Style

Yu, J., Xu, Y., Feng, Y., Zhu, L., Shen, H., & Shi, L. (2026). RAE: A Role-Based Adaptive Framework for Evaluating Automatically Generated Public Opinion Reports. Electronics, 15(2), 380. https://doi.org/10.3390/electronics15020380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop