Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

First, I would like to congratulate the authors on the work presented. The article addresses a relevant topic and proposes an interesting and well-structured architecture. However, in its current form, it presents important limitations, mainly in the experimental validation, which is based on a single case study and lacks comparison with existing approaches or real-world data. Furthermore, the evaluation raises potential concerns regarding circularity, as key metrics are derived from the model itself, and no robustness or sensitivity analysis to system configurations is provided. Overall, the contribution appears stronger at a conceptual level than at an empirical one. For these reasons, I believe that the article requires substantial revisions before it can be considered for acceptance.

Comments on the Quality of English Language

The manuscript is generally well written and clear, with an appropriate academic tone. However, minor grammatical issues, occasional overly long sentences, and some inconsistencies in terminology and punctuation should be addressed to improve readability

Author Response

We sincerely thank you for the careful and constructive evaluation. The observation that the work was stronger at the conceptual level than at the empirical level was well-taken and guided the most substantial changes in this revision.

R1-1. The experimental validation is based on a single case study and lacks comparison with existing approaches or real-world data.

We agree. In the revised manuscript, we have expanded the experimental section from a single scenario to three application scenarios:

Greenwashing backlash prediction (Scenario 1, original, Experiments 1–4, now with increased K)
Greenhushing risk assessment (Scenario 2, new, Experiment 5)
Crisis communication simulation (Scenario 4, new, Experiment 6)

In addition, we added a cross-model comparison experiment (Experiment 7) using GPT-4o alongside the original Claude Sonnet 4, and a prompt sensitivity test (Experiment 8). The revised experiments span 230 independent AFG runs across 19 conditions, 3 scenarios, and 2 LLM backends, with 1,840 persona-level responses.

Regarding comparison with existing approaches, we note that no directly comparable system exists that integrates sentinel monitoring with LLM-based synthetic population simulation. The cross-model comparison (Experiment 7) serves as an architectural robustness check rather than a head-to-head comparison with a competing system.

See revised Section 6 (all subsections), the updated experimental configuration table, and the new experiment result tables (Experiments 5–8).

R1-2. The evaluation raises potential concerns regarding circularity, as key metrics are derived from the model itself.

We acknowledge this concern and address it partially, but do not resolve it, through two complementary strategies.

First, the cross-model comparison (Experiment 7) mitigates one dimension of circularity: two different LLMs (Claude Sonnet 4 and GPT-4o), given identical prompts and persona specifications, produced highly correlated persona-level behavioral rankings (r = 0.935, p = 0.0006) despite differences in absolute sentiment values (p = 0.01) and thematic vocabulary (Jaccard = 0.060). This indicates that the structural patterns are not artifacts of a single model's response tendencies. However, the fundamental circularity remains: both models generate synthetic responses assessed by synthetic metrics, and cross-model agreement does not constitute external validation.

Second, we added a "What SAPIENT Cannot Justify" paragraph to the Limitations that explicitly states what the framework's outputs should not be used for: population-level estimates, individual predictions, regulatory evidence, or replacement of human research. We also added "External validation" as the lead limitation, clearly stating that all evaluation remains system-internal and that human-coded theme comparison is the necessary next step.

See the Cross-Model Comparison section (Experiment 7), Limitations ("External validation" and "What SAPIENT cannot justify").

R1-3. No robustness or sensitivity analysis to system configurations is provided.

We now provide sensitivity analysis along three dimensions:

Model sensitivity (Experiment 7): Changing the LLM backend from Claude to GPT-4o preserved persona rankings (r = 0.94) but shifted absolute sentiment by +0.22 points (p = 0.01).

Prompt sensitivity (Experiment 8): Semantic paraphrasing of the stimulus did not significantly affect sentiment (p = 0.10, n.s.) but produced a modest shift in credibility (p = 0.034).

Temperature sensitivity (Experiment 4, retained): Stratified temperature and adversarial persona injection were compared against a uniform baseline.

See the Cross-Model Comparison, Prompt Sensitivity, and Variance Collapse sections (Experiments 7, 8, and 4).

R1-4. The contribution appears stronger at a conceptual level than at an empirical one.

We agree with this characterization and have revised the paper's positioning accordingly. The abstract, contributions list, discussion, and conclusion now explicitly frame the work as a "framework contribution with preliminary empirical validation." The final sentence of the abstract reads: "The evidence remains preliminary and does not replace human validation; we present SAPIENT as a framework contribution with initial empirical grounding across multiple scenarios and models."

See revised Abstract, Contributions (Section 1), Discussion (Section 7), and Conclusion (Section 8).

R1-5. Minor grammatical issues, occasional overly long sentences, and some inconsistencies in terminology and punctuation.

We have conducted a thorough language revision. Specific fixes include: "multi agent" → "multi-agent" throughout; "in silico" → "in-silico" as compound modifier; "what happens next" → "subsequent narrative development"; and several long sentences in the methodology and governance sections split for readability.

Changes throughout the manuscript.

Overall Improvements Motivated by All Reviewers

In addition to the changes above, the following improvements were motivated by the collective feedback from all three reviewers:

Explicit hypotheses (H1–H4) were added at the beginning of the experimental section, providing a clearer inferential structure (motivated by Reviewer 2, comment 3).

Runtime and cost table with actual measurements (3,328 API calls, 3.2M tokens, $18.50, 34 minutes) replaces the placeholder text (motivated by Reviewer 2, comment 7).

Computational environment documented: Python 3.13, Windows, Anthropic/OpenAI API access, concurrency parameters (motivated by Reviewer 2, comment 8).

Framework extensibility demonstrated: adding GPT-4o and two new scenarios required only configuration-level changes, with no modifications to the core AFG protocol or analysis pipeline (motivated by Reviewer 3, comment 1).

Statistical framing revised throughout to use "exploratory," "preliminary evidence," and "provisional" language consistently (motivated by Reviewer 3, comment 4).

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents SAPIENT, a multi-agent framework for corporate reputation intelligence that combines sentinel-based media monitoring with LLM-driven synthetic population simulation. The paper addresses a relevant and timely problem by attempting to connect real-time signal detection from public text streams with qualitative in-silico exploration through Agentic Focus Groups (AFGs). One of the main strengths of the work is its conceptual originality: the proposed architecture offers an interesting bridge between monitoring and simulation, while also emphasizing human oversight, variance reporting, and explicit caution regarding the interpretive status of synthetic outputs. The research question is clearly stated, and the overall structure of the manuscript is coherent and easy to follow. The preliminary experiments also provide some initial empirical support for the proposed framework, especially regarding signal-conditioned prompting, framing sensitivity, and cross-lingual asymmetries.

However, the manuscript also has several important weaknesses that should be addressed.

Although the literature coverage is broad and relevant, the manuscript refers to a review process that reportedly started with about 500 papers and ended with 23 selected studies, yet this review is not sufficiently documented or clearly referenced, which weakens confidence in the literature grounding of the article.
The bibliography is broad and current, but it relies substantially on arXiv preprints, working papers, and technical reports in several important areas, which reduces the robustness of the evidentiary basis for some of the stronger claims.
The paper does not formulate explicit hypotheses, which makes the inferential structure of the empirical section weaker than it could be.
The presentation of Algorithm 3.3.3 should be improved, since in its current form it is difficult to follow and would benefit from a clearer format.
The operational logic of the simulated social network is not described in enough detail, particularly regarding exposure, topology, and interaction dynamics.
The manuscript does not clearly explain how synthetic personas are instantiated at the LLM level, especially whether they correspond to separate model instances or prompt-conditioned roles built on a shared backend.
The manuscript includes a section on runtime and cost, but it does not actually report meaningful runtime measurements; it only states the LLM backend and published API pricing.
The computational environment is not adequately documented, since there is almost no useful information about the implementation stack, hardware/software setup, or execution environment.
The treatment of metrics needs improvement, because some measures are understandable conceptually, but their units, scales, ranges, and operational interpretation are not always presented in a sufficiently standardized way.
Although the research question is partially answered at the architectural level, the empirical support remains preliminary and limited to a single scenario and a single LLM, so some of the claims should be more carefully bounded. Overall, the manuscript is promising and intellectually interesting, but it currently reads more as a strong architectural proposal with preliminary validation than as a fully demonstrated empirical contribution. For this reason, the paper has merit and originality, but it requires substantial revision before it can be considered sufficiently robust for publication.

Author Response

We thank you for the detailed and operationally focused evaluation. The ten structured comments identified specific gaps that were directly actionable, and we have addressed each one.

R2-1. The literature review process (500 papers screened to 23) is not sufficiently documented.

We acknowledge that the screening process was not transparently reported in the original manuscript. We have added a "Literature Identification and Selection" paragraph at the beginning of Section 2, including five explicit inclusion/exclusion criteria and a PRISMA-inspired flow diagram showing the screening stages: ~500 candidates → 87 after title/abstract screening → 38 assessed at full-text level → 23 core studies. The broader reference list (46 sources) extends beyond the core set to include foundational, methodological, and calibration works.

See Section 2 (Related Work), opening paragraph and PRISMA flow diagram.

R2-2. The bibliography relies substantially on arXiv preprints and technical reports.

We have substantially reduced the preprint burden: 29 references were removed (23 preprints + 6 uncited entries), bringing the total from 75 to 46 references with 7 remaining preprints (15%). The remaining preprints are retained because the agentic AI and LLM simulation field is evolving at an exceptional pace; new foundation models, benchmarks, and architectural patterns appear on a near-weekly basis, and the peer-review cycle has not kept pace with the rate of innovation. The retained preprints are from established research groups and are not relied upon for the paper's central architectural claims. We have added an explicit paragraph in Section 2 explaining this rationale. Where peer-reviewed versions have become available since the original submission, references have been updated.

See Section 2 (Literature Identification and Selection paragraph). Bibliography reduced from 75 to 46 entries.

R2-3. The paper does not formulate explicit hypotheses.

We have added a Hypotheses subsection stating four testable hypotheses: H1 (Persona Differentiation), H2 (Signal Conditioning), H3 (Cross-Lingual Asymmetry), and H4 (Model Dependence). The Summary of Findings now maps results to these hypotheses explicitly.

See the Hypotheses section and the Summary of Findings section.

R2-4. Algorithm 3.3.3 is difficult to follow and would benefit from a clearer format.

We have improved the algorithm presentation by clarifying the step descriptions and adding inline comments. The full pseudocode is provided in the project repository (https://github.com/alperozpinar/SAPIENT-Framework), with a simplified workflow summary retained in the main text.

R2-5. The operational logic of the simulated social network is not described in enough detail.

We clarify that all experiments in this paper use independent mode exclusively, in which each persona responds to the stimulus without observing other personas' outputs. The network interaction mode is specified as an architectural module in Section 3; its empirical validation is deferred to future work. We now state this explicitly in the experimental section to avoid the impression that network dynamics were tested.

See the experimental section opening paragraph.

R2-6. It is unclear whether synthetic personas correspond to separate model instances or prompt-conditioned roles on a shared backend.

All personas are prompt-conditioned roles on a shared LLM backend. Each persona is instantiated through a distinct system prompt encoding its demographic, psychographic, and behavioral attributes, with a separate API call per persona per turn. There are no separate model instances; the differentiation arises entirely from prompt conditioning. We now state this explicitly.

See the experimental section opening paragraph.

R2-7. The manuscript does not report meaningful runtime measurements.

All API calls are now instrumented with per-call token logging and latency measurement through the unified LLM abstraction layer. The runtime and cost table reports aggregate runtime: 3,328 API calls, 2.43M input tokens, 779K output tokens, total cost $18.50, elapsed time 33.9 minutes. Claude averaged 10.2s/call; GPT-4o averaged 6.5s/call but required reduced concurrency due to rate limits.

See the Runtime and Cost Profile section and the runtime and cost table.

R2-8. The computational environment is not adequately documented.

We have added environment details: Python 3.13, Windows, API-based execution (no GPU), Anthropic API (claude-sonnet-4-20250514), OpenAI API (gpt-4o), with concurrency parameters (3 sessions for Claude, 1 for GPT-4o due to 30K TPM rate limit). The project GitHub repository has been updated to include all configuration files, and the Data Availability Statement now references this repository directly.

See the Runtime and Cost Profile section and the Data Availability Statement.

R2-9. Some metrics' units, scales, ranges, and operational interpretation are not sufficiently standardized.

We have ensured that all metrics are defined with their scale, direction, and interpretation in the Metrics and Statistical Methodology section. Sentiment and credibility use a 1–7 Likert scale. Theme stability uses CV ≤ 0.5 as the stability threshold. Variance collapse uses mean pairwise cosine similarity with δ = 0.85 as the flagging threshold. The experimental configuration table provides a unified view across all 19 conditions.

See Section 6.2 (Metrics and Statistical Methodology) and Table 2 (Experimental Configurations).

R2-10. The empirical support remains limited to a single scenario and a single LLM, so claims should be more carefully bounded.

We have expanded to three scenarios and two LLMs (see response to R1-1). In addition, the paper's claims have been systematically bounded: the abstract, discussion, and conclusion now use "preliminary," "initial evidence," and "exploratory" language consistently. The "What SAPIENT Cannot Justify" paragraph explicitly states four categories of inference that are not warranted.

See revised Abstract, Section 6 (all), Limitations, and Section 8.

Overall Improvements Motivated by All Reviewers

Beyond the specific items above:

Two new scenarios (greenhushing and crisis communication) address the generalizability concern shared by Reviewers 1 and 3 (see Experiments 5–6).
Cross-model comparison (Claude vs. GPT-4o) partially addresses the circularity concern raised by Reviewers 1 and 3 (see Experiment 7).
Paper repositioned as a framework contribution with preliminary validation, per Reviewer 3's recommendation (see Abstract, Contributions, Conclusion).
"What SAPIENT Cannot Justify" subsection added to Limitations, per Reviewer 3's request for explicit inference boundaries (see Limitations).

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presents an interesting and timely conceptual framework that combines sentinel monitoring with LLM-based synthetic-population simulation to support corporate reputation intelligence. The topic is relevant, the architecture is ambitious, and the effort to incorporate governance, calibration, and human review is commendable. The paper is promising, especially in how it frames simulation as exploratory rather than predictive. However, the manuscript currently reads more like a strong system proposal than a fully validated research contribution. To be publishable in a stronger form, it would benefit from sharper positioning of novelty, clearer empirical validation, tighter methodological justification, and substantial editing for clarity and language.

Some issues with recommendations below:

The paper is still more conceptual than empirical.
Issue: Much of the manuscript proposes architecture, protocol, governance, and evaluation plans, but the empirical evidence remains preliminary relative to the breadth of the claims.
Recommendation: Reframe the paper more explicitly as a design or framework paper, or substantially strengthen the empirical section with more complete validation.

Novelty claims need to be more carefully bounded.
Issue: The paper claims to be among the first to combine sentinel monitoring and synthetic population simulation, but several adjacent systems have already been discussed, which makes the novelty claim somewhat vulnerable unless the distinctions are shown more systematically.
Recommendation: Add a comparison table that clearly contrasts SAPIENT with prior systems in terms of architecture, real-time coupling, calibration, governance, and intended use.

Validation of the simulation layer is still weak.
Issue: The paper acknowledges that the same model generates both the responses and the internal ratings, creating circularity in the evaluation.
Recommendation: Reduce reliance on model-internal metrics and add external validation using human-coded themes, expert ratings, or real stakeholder data.

Statistical interpretation should be more cautious.
Issue: The paper reports significance tests and effect sizes in preliminary experiments, but the design involves repeated synthetic runs with dependencies, limited effective sample sizes, and no correction for multiple comparisons.
Recommendation: Clarify the exploratory nature of these analyses and avoid presenting the results as strong confirmatory evidence.

The framework raises representativeness concerns that remain unresolved.
Issue: Although the manuscript discusses bias, calibration, and epistemic limits, the persona construction process still depends heavily on assumptions, priors, and prompt design.
Recommendation: Add a clearer discussion of what kinds of stakeholder inference are not justified, even when outputs appear plausible or stable.

Preliminary results are too narrow for the scope of the claims
Issue: The experiments appear to focus mainly on Scenario 1, even though the framework claims to be useful across multiple settings, such as greenwashing, campaign testing, and crisis response.
Recommendation: Either limit the claims to the tested scenario or add empirical demonstrations across more than one application setting.

Human review is important, but not yet independently validated
Issue: The paper includes a mandatory human-review gate, yet it also admits that the authors likely introduced confirmation bias into that review process.
Recommendation: Include an independent reviewer study or inter-rater agreement analysis to show that the review stage is reliable beyond the authors.

The manuscript is too long and sometimes overly dense
Issue: The paper includes an extensive literature review, architectural details, a governance discussion, and an evaluation plan, which can obscure the central contribution.
Recommendation: Tighten the manuscript by reducing repetition, moving some details to an appendix, and focusing the main text on the core research contribution.

Comments on the Quality of English Language

Some issues with quick fixes below:

Missing hyphenation and article consistency
Issue: The abstract says “a multi agent system” instead of the more standard “a multi-agent system.”
Suggestion: Use “multi-agent” consistently throughout the paper.

Awkward phrasing
Issue: “Both are limited by speed, cost, and coverage as digital narratives form and spread faster” is understandable but slightly clumsy.
Suggestion: Revise to “Both are limited in speed, cost, and coverage in an environment where digital narratives form and spread rapidly.”

Missing hyphenation in compound modifiers
Issue: Terms such as “real time,” “focus group sessions,” and “signal conditioned personas” are not always consistently hyphenated when used adjectivally.
Suggestion: Standardize forms such as “real-time,” “focus-group sessions” when adjectival, and “signal-conditioned personas.”

Informal wording in a scholarly sentence
Issue: “what happens next” in quotation marks is readable but slightly conversational for this context.
Suggestion: Replace with “subsequent narrative development” or similar academic phrasing.

Minor style inconsistency in lists
Issue: Some bullet points are very long and contain multiple embedded clauses, which makes them hard to follow.
Suggestion: Split long bullet points into shorter declarative statements or convert them into a compact table.

Equation and notation explanation could be smoother
Issue: Some notation-heavy passages are grammatically correct but difficult to read because prose and symbols are packed too tightly together.
Suggestion: Introduce each symbol in shorter sentences and separate technical definitions from interpretive explanation.

Some sentences are overly long
Issue: Several sentences in the methods and governance sections run for many lines and become difficult to parse.
Suggestion: Break long sentences into two or three shorter sentences to improve readability and reduce ambiguity.

Author Response

We thank you for the strategically oriented evaluation, which helped us sharpen the paper's positioning. The recommendation to frame the work more explicitly as a design-oriented contribution with preliminary validation was adopted as the organizing principle for the revision.

R3-1. The paper is still more conceptual than empirical. Reframe as a framework paper or strengthen empirical section.

We adopted the first recommendation as the primary strategy and supplemented it with empirical strengthening. The paper is now explicitly framed as a "framework contribution with preliminary empirical validation" (Abstract, Contributions, Conclusion). Simultaneously, the empirical section was expanded from 4 to 8 experiments, from 1 to 3 scenarios, and from 1 to 2 LLMs, providing a more substantive empirical basis while maintaining honest positioning.

See revised Abstract (last sentence), Contributions (items 5–6), and Conclusion (paragraphs 1 and 4).

R3-2. Novelty claims need to be more carefully bounded. Add a comparison table with prior systems.

We have added a comparison table in the Theoretical Implications section that positions SAPIENT relative to five adjacent systems (GenSim, RumorSphere, CrisisBench, DualMind, and Park et al.) along eight architectural dimensions: real-time sentinel layer, signal state formalization, repeated-run variance reporting, human review gate, bidirectional coupling, cross-lingual support, governance framework, and calibration hooks. This table makes the distinctions concrete rather than relying on prose claims.

See the comparison table in Theoretical Implications.

R3-3. Validation of the simulation layer is still weak (circularity).

We address this through the cross-model comparison (Experiment 7), which shows that persona-level behavioral rankings are preserved across two different LLMs (r = 0.935, p < 0.001). While this does not resolve circularity—since both models still generate synthetic responses assessed by synthetic metrics—it demonstrates that the patterns are not single-model artifacts. We explicitly state in Limitations that external validation with human-coded themes is the necessary next step.

See the Cross-Model Comparison section (Experiment 7) and Limitations ("External validation").

R3-4. Statistical interpretation should be more cautious.

We have revised all statistical language to be explicitly exploratory. The Summary of Findings now states: "These results are exploratory. The study was not pre-registered, no correction for multiple comparisons was applied across experiments, and the effective sample sizes for some comparisons are small." We added a "Statistical scope" item to Limitations.

See the Summary of Findings section (final paragraph) and Limitations ("Statistical scope").

R3-5. The framework raises representativeness concerns that remain unresolved.

We added a "What SAPIENT cannot justify" paragraph to Limitations that explicitly bounds permissible inference: (a) no population-level estimates; (b) no individual-level predictions; (c) no regulatory/legal evidence; (d) no replacement of human research. This paragraph directly addresses the concern that plausible-looking outputs might be mistaken for externally valid findings.

See Limitations ("What SAPIENT cannot justify").

R3-6. Preliminary results are too narrow for the scope of the claims.

We have expanded to three tested scenarios (greenwashing, greenhushing, crisis communication). The remaining scenario (campaign pre-testing) is retained as a design illustration. All empirical claims are now bounded to the tested scenarios.

See Experiments 5–6 (Greenhushing and Crisis Communication sections).

R3-7. Human review is important but not yet independently validated.

We agree and now explicitly acknowledge this in Limitations: "The human review component has not been independently validated. The authors served as reviewers during the experiments, introducing potential confirmation bias. An independent inter-rater agreement study is specified as a future evaluation component."

See Limitations ("Human review gate").

R3-8. The manuscript is too long and sometimes overly dense.

We have moved detailed governance protocol specifications to the project repository and streamlined the main text. The revision adds new content (four experiment subsections, hypotheses, expanded configuration table, comparison table), offset by removals and compression in other sections.

Language Issues

R3-L1. "multi agent" should be "multi-agent."

Fixed throughout.

R3-L2. "Both are limited by speed, cost, and coverage as digital narratives form and spread faster" is clumsy.

Revised to: "Both are limited in speed, cost, and coverage in an environment where digital narratives form and spread rapidly."

R3-L3. Missing hyphenation in compound modifiers ("real time," "signal conditioned").

Standardized throughout: "real-time," "signal-conditioned," "focus-group sessions," "in-silico."

R3-L4. "what happens next" is informal.

Replaced with "subsequent narrative development."

R3-L5. Some bullet points are too long.

Long bullet points in the governance and methodology sections have been streamlined.

R3-L6. Notation-heavy passages are difficult to read.

We have separated symbol definitions from interpretive prose in the most notation-heavy passages.

R3-L7. Some sentences are overly long.

Long sentences in the methods and governance sections have been split for readability.

Overall Improvements Motivated by All Reviewers

Beyond the specific responses above:

Explicit hypotheses (H1–H4) and a hypothesis-mapped Summary of Findings were added, providing the clearer inferential structure advocated by Reviewer 2.
Runtime instrumentation with actual measurements (3,328 calls, $18.50, 34 min) was added per Reviewer 2's request.
Cross-model comparison (Experiment 7) partially addresses the circularity concern raised by Reviewer 1, providing an additional validation dimension.
Prompt sensitivity test (Experiment 8) addresses the robustness concern raised by Reviewer 1, showing that the primary metric is stable under surface-level prompt variation.
Practical extensibility: the revision process itself served as a test of the framework's extensibility: adding GPT-4o and two new scenarios required only configuration changes, with no modifications to the core protocol or analysis pipeline.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

First, I would like to congratulate the authors on the work presented and on the improvements introduced in this revised version. The manuscript shows clear progress compared to the previous version, particularly in the expansion of the experimental section, the inclusion of additional scenarios and models, and a clearer acknowledgment of the limitations of the approach. These improvements contribute to strengthening the clarity and overall maturity of the work. Overall, given the progress achieved during the revision process, I consider that the article has reached a sufficient level of quality and recommend its acceptance in its current form.

Comments on the Quality of English Language

The manuscript is clear and well written overall, with a consistent academic tone. The quality of the English has improved compared to previous version. Minor stylistic issues remain, such as occasional long sentences and dense formulations, but they do not hinder readability.

Author Response

We sincerely thank Reviewer 1 for the positive evaluation and for acknowledging the improvements made during revision. The assessment that the article has reached a sufficient level of quality is gratifying. The minor stylistic issues noted (occasional long sentences, dense formulations) have been addressed through a systematic language revision pass, with particular attention to the introduction, methodology, and governance sections.

For completeness, we note the following changes made since Reviewer 1’s evaluation, motivated primarily by Reviewer 3’s feedback:

A third LLM backend (Gemini 2.5 Flash, Google) was added to the cross-model comparison (Experiment 7, Section 6.11), extending the analysis to three architecturally distinct models. All pairwise persona correlations exceed r = 0.92 (p_adj < 0.003).
Experiment 8 (prompt sensitivity) was expanded from K = 5 to K = 20, confirming that sentiment is robust to prompt paraphrasing (p = 0.061, n.s.) while revealing that credibility is systematically sensitive (p < 0.001).
Benjamini–Hochberg correction was applied across all eleven inferential tests; nine of eleven survive FDR control at α = 0.05 (Table 4).
A persona prior source table (Table 1) was added documenting which attribute priors are data-informed and which are assumed.
Experimental totals now stand at 280 runs, 2,240 responses, 20 conditions, and 3 LLM backends.

These additions do not alter the conclusions from the version Reviewer 1 evaluated; they strengthen the empirical base.

Reviewer 2 Report

Comments and Suggestions for Authors

The current version of the manuscript adequately addresses my observations. I recommend it for publication.

Author Response

We thank Reviewer 2 for confirming that the revised manuscript adequately addresses all observations from Round 1. The recommendation for publication is deeply appreciated.

Since Reviewer 2’s evaluation, the following changes were made in response to Reviewer 3’s feedback:

Gemini 2.5 Flash was added as a third LLM backend, extending the cross-model comparison (Section 6.11) to three architecturally distinct models with consistent persona differentiation (all pairwise r > 0.92).
Benjamini–Hochberg correction was applied across all inferential tests (Table 4); nine of eleven survive at α = 0.05.
Experiment 8 was expanded from K = 5 to K = 20, and a persona prior source table (Table 1) was added.
Experimental totals increased to 280 runs across 20 conditions, 2,240 persona-level responses, and 3 LLM backends.

These additions reinforce the conclusions Reviewer 2 endorsed without altering the architectural contributions or positioning.

Reviewer 3 Report

Comments and Suggestions for Authors

The revised manuscript is improved over the prior version in several meaningful ways. It is more explicit about the architecture, adds preliminary experiments across multiple scenarios and two LLM backends, expands the discussion of calibration, governance, and limitations, and is generally more transparent about the intended use of the system as a qualitative, exploratory tool rather than a predictive instrument. However, despite these improvements, the paper claims remain supported mainly by internally generated evidence rather than external validation. The manuscript itself acknowledges that human calibration is not yet implemented, that Stage 2 human comparison remains future work, that the analyses are exploratory and not preregistered, that no multiple-comparison correction was applied, and that some experiments have small effective sample sizes. In its current form, the work is better positioned as a promising framework paper with pilot results than as a sufficiently validated scientific contribution for publication.

Some issues are mentioned next in more detail:

Lack of external validation
How to fix it: Validate the simulation outputs against real human data, such as actual focus groups, survey responses, or historical stakeholder reactions, rather than relying primarily on model-generated outputs.
Circular evaluation design
How to fix it: The paper states that the LLM both generates the responses and produces the structured ratings used for evaluation. Replace or supplement these with external human annotations or independently defined outcome measures.
Evaluation plan is stronger than the executed validation
How to fix it: Many of the strongest validation stages, especially calibration against human data and prospective pilot deployment, are still future work. At least one of these should be completed before publication.
Small sample sizes for several experiments
How to fix it: Increase the number of runs, personas, scenarios, and independent replications so the reported effects are more stable and less sensitive to randomness. The paper itself notes small effective sample sizes in some comparisons.
No correction for multiple hypothesis testing
How to fix it: Apply an appropriate correction procedure, such as Holm or Benjamini-Hochberg, and clearly distinguish confirmatory from exploratory analyses.
No preregistration and exploratory analyses
How to fix it: Preregister hypotheses, metrics, stopping rules, and analysis plans for a follow-up study so that inferential claims are more credible.
Prompt sensitivity remains insufficiently addressed
How to fix it: Conduct a fuller ablation study varying prompt wording, persona formatting, output schema, and scenario translation choices, since the manuscript itself notes that systematic prompt-induced bias is not resolved by repeated runs alone.
Human review gate is not independently validated
How to fix it: Use independent reviewers, report inter-rater agreement, and separate system developers from evaluators to reduce confirmation bias. The paper acknowledges that the authors themselves served as reviewers.
Representativeness of personas is not demonstrated
How to fix it: Show how demographic and psychographic priors are estimated, justify the sampling distributions empirically, and compare synthetic subgroup outputs with observed subgroup data. Right now, many priors appear assumed rather than validated.
Claimed model-agnosticism is only weakly supported
How to fix it: Test more than two backends, across more than one scenario per backend, and analyze whether findings generalize beyond the limited cross-model comparison reported in the paper. The abstract itself notes that model-agnostic design was tested on only one scenario.

Comments on the Quality of English Language

English and grammar issues found:

“Submitted to Systemsfor possible open access publication...”
How to fix it: Add the missing space: “Submitted to Systems for possible open access publication...”
“The present paper addresses this question by proposing...”
How to fix it: This is grammatically acceptable, but repeated use of “the present paper” throughout the manuscript sounds formulaic. Replace some instances with “this study” or “this article.”
“supports early qualitative exploration with in-silico groups, while (c) making error visible...”
How to fix it: Parallelism is slightly awkward. Revise to “supports early qualitative exploration with in-silico groups, and makes error visible while keeping a human decision-maker in control.”
“In spite of this progress...”
How to fix it: Prefer the more standard academic phrasing “Despite this progress...”
“The gap remains: existing sentinel-style monitoring for brands focuses on detection, not on predicting...”
How to fix it: This is understandable, but stylistically heavy. Split into two sentences for readability.
“These findings point to a fundamental challenge for any LLM-based simulation platform...”
How to fix it: The sentence is long and dense. Break it into two shorter sentences to improve clarity.
“This is a key element borrowed from qualitative research methodology.”
How to fix it: The phrasing is somewhat informal and vague. Replace with “The moderator agent is adapted from established qualitative research practice.”
“No static defense is permanent.”
How to fix it: This is understandable but stylistically abrupt. A more academic alternative would be “No defense mechanism can be assumed to remain effective indefinitely.”
Hyphenation and line-break artifacts appear throughout the manuscript
How to fix it: Carefully proofread for broken words created by line wrapping, such as “stake- holders,” “quali- tative,” and similar forms in the draft. These should not appear in the final submitted version.
Overuse of long, overloaded sentences
How to fix it: Many sentences contain multiple clauses, parenthetical insertions, and lists. Shorten them and reduce clause stacking, especially in the introduction, literature review, and methodology sections, to improve readability and precision.

Author Response

We thank Reviewer 3 for the thorough evaluation and for recognizing the improvements over the prior version. The reviewer’s observation that the work is “better positioned as a promising framework paper with pilot results” aligns with our own framing, and we have taken concrete steps to strengthen the empirical basis within the scope of a revision cycle.

Below we respond to each substantive and language comment individually.

R3-S1. Lack of external validation: Validate the simulation outputs against real human data.

We agree that external validation is the most important next step for this line of research, and it constitutes Stage 2 of our staged evaluation roadmap (Section 5). Within this revision cycle, we strengthened the internal evidence through cross-model triangulation: three architecturally distinct LLMs (Claude Sonnet 4, GPT-4o, Gemini 2.5 Flash), built by three different organizations with different training corpora and alignment procedures, produced highly consistent persona-level behavioral rankings (all pairwise Pearson r > 0.92, p_adj < 0.003 after BH correction). While this does not constitute external validation in the strict sense—all three models still generate synthetic responses—the structural convergence across independently developed architectures reduces the probability that the observed patterns are artifacts of any single model’s biases.

A preregistered protocol for the Stage 2 follow-up study, specifying hypotheses, sample sizes, primary outcomes, comparison metrics, and stopping rules, will be deposited at OSF prior to Stage 2 data collection. This ensures that the remaining validation step follows a rigorous, pre-committed design rather than an ad hoc analysis.

See revised Experiment 7 (Section 6.11, now three-way comparison with three tables), Limitations (“External validation”), and Data Availability Statement.

R3-S2. Circular evaluation design: Replace or supplement with external human annotations.

The circularity concern is partially addressed through two strategies. First, cross-model triangulation across three independently developed backends provides evidence that the structural patterns are not single-model artifacts; the convergence of three different “judges” (even if all are LLMs) is conceptually analogous to inter-rater agreement in qualitative research. Second, the persona prior source table (Table 1) now makes the input assumptions explicit, enabling readers to assess whether the outputs are plausible given the inputs rather than relying solely on model-internal metrics. We acknowledge that these measures narrow but do not eliminate the circularity; human-coded theme comparison (Stage 2) is the definitive resolution, and a preregistered protocol for Stage 2 has been prepared.

See cross-model comparison (Section 6.11), persona prior source table (Table 1), Limitations (“External validation”).

R3-S3. Evaluation plan is stronger than the executed validation: At least one Stage should be completed.

We acknowledge this gap. The revision narrows it through three additions: (a) a third LLM backend extending the cross-model evidence, (b) the expanded Experiment 8 (K = 20) confirming the prompt sensitivity finding, and (c) the BH correction demonstrating that the statistical claims are robust. We recognize that the gap between the evaluation plan and executed validation is not fully closed; the Stage 2 preregistration protocol represents our concrete commitment to closing it in the near term.

R3-S4. Small sample sizes for several experiments.

The weakest comparison, Experiment 8, was expanded from K = 5 to K = 20 runs per condition. This quadrupling of the sample size not only resolved the small-sample concern but also revealed a substantive finding: the credibility sensitivity to prompt wording strengthened from p = 0.034 (at K = 5) to p < 0.001 (at K = 20), confirming it as a systematic effect rather than noise. All other experiments operate at K = 10 or K = 20, which we consider adequate for the exploratory scope of the analyses.

See revised Experiment 8 (Section 6.12), updated Table 16, and Summary of Findings.

R3-S5. No correction for multiple hypothesis testing.

Benjamini–Hochberg correction was applied across all eleven inferential tests conducted in Experiments 2–8. Nine of eleven tests survive FDR control at α = 0.05. The two non-significant results are methodologically expected: the primary metric (sentiment) remains robust to prompt paraphrasing (p_adj = 0.067), and the GPT-4o–Gemini sentiment difference is non-significant (p_adj = 0.212), consistent with these two models sharing a similar response baseline. As a more conservative sensitivity check, Holm–Bonferroni correction was also computed; Holm-corrected values are available in the project repository.

See new Section 6.3 (Multiple Comparison Correction), Table 4, and BH-adjusted p-values throughout Section 6.

R3-S6. No preregistration and exploratory analyses.

We agree that the current analyses are exploratory and cannot be retroactively preregistered. The revised manuscript now explicitly distinguishes exploratory from confirmatory framing throughout Section 6. To address this concern prospectively, a preregistered protocol for the Stage 2 follow-up study will be deposited at OSF prior to data collection, specifying hypotheses, sample sizes, primary outcomes, and stopping rules. This ensures that the next phase of validation follows a pre-committed design.

See Summary of Findings (final paragraph), Limitations (“Statistical scope”).

R3-S7. Prompt sensitivity remains insufficiently addressed: Conduct a fuller ablation study.

We expanded Experiment 8 from K = 5 to K = 20, which strengthened the credibility finding from borderline (p = 0.034) to highly significant (p < 0.001) and confirmed sentiment robustness (p = 0.061, n.s.). This expansion also yielded a methodological insight: credibility judgments are more sensitive to surface-level linguistic cues than overall attitudinal responses, a finding that contributes to the broader understanding of LLM-based simulation behavior.

A full factorial ablation varying persona formatting, output schema, and scenario translation choices would constitute an independent study with its own research design. We identify this as a priority for future investigation and have added it to the Limitations section.

See revised Experiment 8 (Section 6.12), Limitations (“Prompt sensitivity”).

R3-S8. Human review gate is not independently validated.

We acknowledge this limitation. The authors served as reviewers during the experiments, introducing potential confirmation bias. In the revised Limitations section, we state this explicitly and specify independent inter-rater agreement study as a component of the Stage 4 prospective evaluation. The planned Stage 2 protocol includes provisions for independent evaluators who were not involved in system design.

See Limitations (“Human review gate”).

R3-S9. Representativeness of personas is not demonstrated.

We added a persona prior source table (Table 1) that documents each attribute category, its data source, and whether the prior is data-informed (derived from census or survey data) or assumed (set by the authors based on domain knowledge). Of ten attribute categories, four are grounded in published data (TÜİK census 2023 for age, gender, education, and geographic region) and six are assumed with uniform or categorical priors. We state explicitly that the persona priors define the simulation’s input space, not a population-level claim: “We do not claim that persona priors reproduce actual population distributions.”

See new Table 1 (Persona Prior Sources) and accompanying text in Section 3.3.

R3-S10. Claimed model-agnosticism is only weakly supported.

We added Gemini 2.5 Flash (Google) as a third LLM backend and repeated Experiment 7 (Scenario 1, Variant C) with K = 20 runs. The three-way cross-model comparison now spans three architecturally distinct models from three different organizations. All pairwise persona correlations exceed r = 0.92 (Pearson, p_adj < 0.003). We softened the “model-agnostic” claim to “cross-backend portability” throughout the manuscript and retain the caveat that this has been demonstrated on one scenario; multi-scenario cross-model replication is identified as future work.

See revised Experiment 7 (Section 6.11, three tables), Abstract, Discussion, and Conclusion.

Language Comments

All ten language issues have been addressed:

“Submitted to Systemsfor…” Missing space fixed.
Repeated “the present paper.” Instances replaced with “this study,” “this article,” and “the current work” as appropriate.
Parallelism in “while (c) making error visible…” Revised to: “and (c) makes error visible while keeping a human decision-maker in control.” See Section 1.
“In spite of this progress…” Revised to: “Despite this progress…” See Section 1.
Dense gap sentence. Split into two shorter sentences. See Section 2.1.
Long dense sentence about LLM-based simulation challenge. Split from one compound sentence into three shorter declarative sentences. See Section 2.1.
“This is a key element borrowed from qualitative research methodology” — informal. Revised to: “The moderator agent is adapted from established qualitative research practice.” See Section 3.3.
“No static defense is permanent” — abrupt. Revised to: “No defense mechanism can be assumed to remain effective indefinitely.” See Section 3.5.
Hyphenation and line-break artifacts. The manuscript has been proofread for broken words created by line wrapping. Any remaining artifacts will be resolved in the final typeset version.
Overuse of long, overloaded sentences. A systematic pass was conducted through the introduction, literature review, and methodology sections.

Summary of All Changes in This Revision

Change	Addresses
Gemini 2.5 Flash added as 3rd LLM backend	R3-S1, S2, S10
3-way cross-model comparison (3 tables)	R3-S1, S2, S10
BH correction (11 tests, 9 significant)	R3-S5
Experiment 8 expanded: K = 5 → 20	R3-S4, S7
Persona prior source table	R3-S9
OSF preregistration protocol prepared for Stage 2	R3-S3, S6
Claim softening throughout	R3-S3, S10
10 language fixes	R3-L1–L10
Updated totals: 280 runs, 2,240 responses, 20 conditions, 3 LLMs	All

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

The paper is interesting and ambitious, but it still reads more like a framework paper with preliminary experiments. Let me list a few issues:

The paper still lacks external validation against real human data. The manuscript explicitly says the evidence is preliminary, does not replace human validation, and that external validation with human-coded themes, expert ratings, or real stakeholder data is still the necessary next step. For many reviewers, this is the single biggest barrier because the core claim depends on whether simulated stakeholders correspond meaningfully to real ones.
The calibration problem is not solved. The paper acknowledges that prompt-based persona conditioning has limited calibration power and that outputs should be treated as directional indicators rather than calibrated estimates until fine-tuning or PPI-style correction is added. That means the system’s central output is still not statistically trustworthy enough for a strong empirical claim.
Prompt sensitivity remains a serious methodological weakness. The paper shows that credibility scores change significantly under surface-level paraphrasing and states that a full factorial ablation is still future work. That leaves open the concern that some reported effects may reflect wording choices in prompts rather than stable properties of the framework.
The human review gate is not independently validated. The authors themselves acted as reviewers during experiments, and the paper admits possible confirmation bias. Since human review is presented as an important safety and governance mechanism, not validating that component independently weakens the credibility of the full workflow.
The statistical posture is still exploratory, not confirmatory. Even though BH correction was applied, the paper still says the study was not preregistered and results should be interpreted as preliminary rather than confirmatory evidence. For a journal paper making multiple architectural and empirical claims, that lowers confidence in the strength of the conclusions.

Comments on the Quality of English Language

The writing is still too dense and overloaded with stacked technical nouns. The title and abstract pack too many concepts into single phrases, which hurts readability and makes the paper feel heavier than necessary. This is especially visible in the title and in phrases like “LLM-Based Synthetic Population Simulation” and “Sentinel-Augmented Population Intelligence for Emerging Narrative Tracking.”
Many sentences are too long and carry too many ideas at once. For example, the contribution bullets and several literature-review paragraphs combine novelty claim, comparison to prior work, qualification, and implication in one sentence. This makes the prose harder to follow and increases reviewer fatigue.
There is frequent repetition of cautionary phrasing and metadiscourse. The paper repeatedly says things such as the framework is “preliminary,” “qualitative-exploratory,” “not a substitute,” “not a prediction oracle,” and “not externally valid.” These cautions are appropriate, but they are repeated so often that they start to dilute the flow and make the argument feel defensive.
Some wording is awkward or overly abstract. Phrases such as “makes error visible,” “epistemic boundaries,” and “directional indicators” are understandable to specialists, but they are still somewhat abstract and can sound vague if not grounded immediately in plain language. Reviewers outside the exact niche may see this as conceptual inflation.
The paper sometimes sounds promotional instead of analytical. Expressions like “to the best of our knowledge,” “among the first,” “practical extensibility,” and repeated emphasis on novelty make the tone feel closer to advocacy than neutral scientific reporting. In a journal submission, this can trigger skepticism unless the claims are very tightly supported.

Author Response

We sincerely thank Reviewer 3 for the sustained, rigorous, and constructive evaluation across three rounds. The reviewer's insistence on external human validation was the single most impactful feedback this manuscript received. It pushed us to conduct a study that, in retrospect, should have been part of the original submission. The resulting Experiment 9, and the credibility-sentiment dissociation it revealed, has made the paper substantially stronger. We are grateful for the reviewer's persistence on this point.

R3-R3-1. The paper still lacks external validation against real human data.

We agree that external validation was the most important remaining step. In this revision, we conducted a preregistered pilot human validation study (Experiment 9) using 54 participants recruited through Prolific, a widely used academic research platform. The platform choice was deliberate: SAPIENT simulates general stakeholder reactions, not expert judgments, and a paid, pre-screened general-population sample is the methodologically appropriate comparison group for the framework's intended use case.

Primary analyses used 54 analyzable responses from 54 unique participants who completed the three framing evaluations and passed the attention check. Of these, 34 also completed the post-exposure comparative items (forced-choice selections and demographics).

The results showed partial external replication of the SAPIENT pattern. The predicted credibility ranking was reproduced: accountability was rated highest, followed by progress and targets (A = 4.20, B = 4.61, C = 5.04). The preregistered paired comparison confirmed that accountability framing was rated more credible than targets framing (Wilcoxon one-tailed p = 0.004, Cohen's d = 0.40). Among the 34 participants who completed the comparative items, accountability was selected as most credible by 53% (18/34), while targets was selected as least credible by 62% (21/34).

The sentiment ranking was not replicated: human participants rated all three variants similarly (A = 5.41, B = 5.20, C = 5.06; Friedman p = 0.37, n.s.). A ceiling effect in sentiment ratings (50-61% of participants scored 6 or 7 out of 7) contributed to this compression. We report this transparently and interpret it as evidence that framing affects credibility judgments more than overall sentiment.

This pilot does not constitute full Stage 2 calibration, but provides the first empirical contact between SAPIENT outputs and real human responses. Open-ended theme overlap reached the preregistered threshold in two of three variants (progress and accountability; targets fell below), and human response variance exceeded SAPIENT variance across all conditions (ratio: 2.0-5.0), consistent with the LLM simulation literature.

See new Section 6.10 (Experiment 9), updated Abstract, Summary of Findings, Discussion, Limitations, and Conclusion.

R3-R3-2. The calibration problem is not solved.

The pilot human data provides an initial calibration reference point. For credibility, SAPIENT's variant means were close to human values (e.g., Variant A: SAPIENT 3.81, Human 4.20, difference = 0.39). For sentiment, a systematic offset was observed: human means were approximately 1.0 point higher than SAPIENT means, and the framing differentiation present in SAPIENT was absent in human data. This identifies a specific calibration target: LLM personas appear to process framing cues more analytically than real respondents, producing differentiated sentiment where humans exhibit a flatter affective baseline. Sentiment calibration through fine-tuning or PPI-style correction remains a priority for Stage 2. The architecture's modular design accommodates this integration without structural changes. We note that the observed sentiment effect size (d = 0.15) would require approximately 350 participants to detect at 80% power, confirming that the present pilot was not powered for this dimension. The credibility-sentiment asymmetry is itself an informative finding: it identifies which judgment types are amenable to LLM-based simulation and which require further calibration.

See Experiment 9 discussion, updated Limitations, and Conclusion.

R3-R3-3. Prompt sensitivity remains a serious methodological weakness.

The pilot human study partially addresses this concern from the human side: the credibility framing effect detected by SAPIENT (C > A) was replicated in human responses (p = 0.004), suggesting that the reported pattern reflects a genuine stimulus property rather than a prompt artifact. The sentiment non-replication, conversely, indicates that some SAPIENT-detected effects may be sensitive to the specific way personas process linguistic cues. This is now reported as a limitation rather than hidden.

See Experiment 9 results and updated Limitations.

R3-R3-4. The human review gate is not independently validated.

We acknowledge this limitation. Within this revision, we prioritized the more fundamental concern: external validation of simulation outputs against real human responses (Experiment 9). Independent human review gate validation, including inter-rater agreement with evaluators not involved in system design, is specified as a component of the Stage 4 prospective pilot.

See updated Limitations.

R3-R3-5. The statistical posture is still exploratory, not confirmatory.

Experiment 9 was preregistered at OSF prior to data collection (DOI: 10.17605/OSF.IO/4KFDC), with pre-specified hypotheses (H9a, H9b, H9c), analysis plan (Wilcoxon signed-rank test, alpha = 0.05), and exclusion criteria (attention check, speed filter, duplicate text). This directly addresses the preregistration concern for the new validation component. The primary credibility result (p = 0.004) survives Bonferroni correction for the three preregistered tests (adjusted p = 0.012). Experiments 1-8 remain exploratory, as stated in the manuscript.

See updated Hypotheses (Section 6.1), Experiment 9 method, and updated Limitations.

Language Comments

All five language issues have been addressed with specific, verifiable changes in the revised manuscript. Below we list each concern and the concrete edits made.

R3-L1. Writing is too dense with stacked technical nouns.

The abstract has been shortened by approximately 20%, removing redundant framing (e.g., the sentence "The simulation layer is treated as an instrument for generating qualitative hypotheses, not as a substitute for human judgment or a measure of public opinion" was removed from the abstract, where it was restating what the body already explains). The opening was tightened from two sentences to one. The results paragraph now uses shorter, direct constructions instead of stacking multiple findings into single sentences.

R3-L2. Many sentences are too long and carry too many ideas.

Specific changes:

The first contribution bullet was reduced from 84 words (one sentence) to two shorter sentences totaling 55 words. The phrase "To the best of our knowledge, SAPIENT is among the first systems to explicitly couple..." was replaced with a direct statement: "SAPIENT couples..."
The fifth contribution bullet was reduced from 75 words to 50 words by splitting the hypothesis listing and the BH correction statement into separate sentences.
The Conclusion section was shortened by approximately 30%. The opening paragraph, which previously compressed all experimental results into a single 76-word sentence, was broken into four shorter sentences.
The cross-model interpretation paragraph (Section 6.7) was restructured: one compound sentence with two em-dash-delimited parenthetical clauses was split into three direct sentences.

R3-L3. Frequent repetition of cautionary phrasing.

We conducted a systematic audit and reduced repetitive cautionary language:

"preliminary" reduced from 6 occurrences to 2 (retained only in the anomaly detection agent's technical description and the Statistical Scope limitation, where the term is appropriate).
"epistemic boundaries" reduced from 4 occurrences to 0. The one remaining use of "epistemic" is "epistemic trust judgment" in Experiment 9, where it describes a specific psychological construct.
"not a predictive oracle or a substitute for human research" in the Framework introduction (Section 3) was shortened to a single clause: "The simulation layer generates qualitative hypotheses for human review."
"practical extensibility" was removed entirely (previously appeared 3 times). The Discussion paragraph retains the factual content about modularity without the promotional label.
The "not just X, but Y" pattern was removed (Stage 2 evaluation plan).

R3-L4. Some wording is awkward or overly abstract.

Specific replacements:

"makes error visible" (Introduction research question) → "surfaces potential errors for human review"
"epistemic boundary" (review gate checklist) → removed; the sentence now reads "without presenting hypotheses as predictions"
"directional indicators" (Limitations) → "approximate guides"
"validity limits" replaces "epistemic boundaries" in the calibration literature review and the Conclusion

R3-L5. Tone sometimes sounds promotional instead of analytical.

Specific changes:

"To the best of our knowledge, SAPIENT is among the first" (first contribution bullet) → removed entirely. The bullet now states what SAPIENT does and cites adjacent work, letting the reader assess novelty.
"practical extensibility" (contribution bullet, Discussion, Conclusion) → removed as a label. The factual statements about configuration-level changes remain.
"remarkably stable" (cross-model comparison) → "stable"
"constitutes the strongest evidence in this study" (cross-model comparison) → "provides consistent evidence"
The Conclusion paragraph that repeated the extensibility claim verbatim from the Discussion was removed to avoid redundancy.
All eight em-dashes in the manuscript were replaced with commas, parentheses, or sentence breaks, reducing the density of parenthetical asides that contributed to the overloaded feel.

Summary of All Changes in This Revision

Change	Addresses
Preregistered pilot human validation (Experiment 9, n = 54, Prolific)	R3-R3-1, R3-R3-2, R3-R3-5
Credibility ranking replicated (A < B < C, p = 0.004)	R3-R3-1
Sentiment non-replication reported transparently	R3-R3-1, R3-R3-3
OSF preregistration with pre-specified hypotheses and analysis plan	R3-R3-5
Updated Abstract (shortened ~20%, removed redundant cautions)	R3-L1, R3-L3
Contribution bullets shortened and de-promotionalized	R3-L2, R3-L5
"To the best of our knowledge" and "among the first" removed	R3-L5
"preliminary" reduced from 6 to 2 occurrences	R3-L3
"epistemic boundaries" reduced from 4 to 0	R3-L4
"practical extensibility" removed (3 occurrences)	R3-L3, R3-L5
"remarkably stable" → "stable"	R3-L5
"constitutes the strongest evidence" → "provides consistent evidence"	R3-L5
"directional indicators" → "approximate guides"	R3-L4
"makes error visible" → "surfaces potential errors for human review"	R3-L4
All 8 em-dashes replaced with commas/parentheses	R3-L1, R3-L2
Conclusion shortened ~30%, removed redundant extensibility paragraph	R3-L2, R3-L3
Updated Limitations, Discussion, IRB, Informed Consent, Data Availability	R3-R3-1
Experimental totals: 9 experiments, 280 AFG runs + 54 human responses, 3 scenarios, 3 LLMs	All

Review Reports

Overall Improvements Motivated by All Reviewers

Overall Improvements Motivated by All Reviewers

Language Issues

Overall Improvements Motivated by All Reviewers

Language Comments

Summary of All Changes in This Revision

R3-R3-1. The paper still lacks external validation against real human data.

R3-R3-2. The calibration problem is not solved.

R3-R3-3. Prompt sensitivity remains a serious methodological weakness.

R3-R3-4. The human review gate is not independently validated.

R3-R3-5. The statistical posture is still exploratory, not confirmatory.

Language Comments

R3-L1. Writing is too dense with stacked technical nouns.

R3-L2. Many sentences are too long and carry too many ideas.

R3-L3. Frequent repetition of cautionary phrasing.

R3-L4. Some wording is awkward or overly abstract.

R3-L5. Tone sometimes sounds promotional instead of analytical.

Summary of All Changes in This Revision