1. Introduction
This study tackles a central problem in monetary economics, which is measuring, quickly and consistently the hawkish/dovish stance embedded in the Federal Reserve’s biannual Monetary Policy Reports (MPRs). Markets react not only to policy actions but also to central-bank language, so timely and credible stance measures can improve monitoring, risk management, and empirical research [
1,
2,
3,
4]. Therefore, it poses the question of whether a transparent, low-cost NLP workflow can transform unstructured MPR text into reproducible stance scores and thematically organized understandings suitable for academic and policy use.
The objective is to build a near-real-time, auditable pipeline that converts the FED’s MPRs into quantitative stance scores on a −2 (dovish) to +2 (hawkish) scale, together with confidence/uncertainty metrics and thematic rationales (inflation, interest rates, employment, forward guidance, key policy signals). The hypothesis is that an agentic RAG architecture (document parsing and chunking, embeddings, FAISS orchestrated by LangChain, and GPT-4o reasoning) can recover stable and theory-consistent stance measures at very low latency and cost, and that these measures will track known policy transitions. The dataset comprises 26 MPRs spanning 26 February 2013 to 20 June 2025 to ensure comparability of tone and format across time.
Methodologically, PDFs are parsed and chunked, embedded, indexed in FAISS, and retrieved into a two-stage GPT-4o chain: first, the model assigns an overall stance score with confidence and uncertainty; second, it produces category-specific syntheses that surface the arguments driving the score. To ensure reliability and academic rigor, a four-dimensional validation framework that addresses prominent concerns in text-as-data research is implemented [
5,
6]. Semantic consistency tests use cosine-similarity consistency ratios to verify clear within-stance cohesion and between-stance separation, strongest for key policy signals. Numerical consistency aligns observed correlations with ranges implied by the Taylor-rule tradition and modern monetary analysis [
7,
8], yielding a Numerical Consistency Score (NCS) of 0.800. Bootstrap stability [
9] shows high stability for most metrics while leaving the stance score appropriately variable, reflecting genuine policy dynamics rather than noise, consistent with regime-variation evidence [
10]. Content-quality diagnostics based on length consistency, lexical diversity, and information content produce an average CQS of 0.647. The integrated validation score is 0.796 (B+/Good), supporting publication-grade measurement standards.
The main findings indicate a predominant Neutral distribution (50.0%) with Dovish (26.9%) and Hawkish (23.1%) shares; the average stance is close to zero (0.019), the volatility is σ ≈ 0.866, and we detect a recent hawkish drift of about +0.8 points in the latest MPR window. These patterns are consistent with the literature documenting that central-bank communication contains information distinct from rate moves and that tone can shape expectations and asset prices.
Conclusions emphasize that an embedding-based, agent-orchestrated RAG system with GPT-4o can deliver transparent, scalable, and low-cost measurement of FED stance. Relative to dictionary methods, the agentic workflow captures context and nuance while remaining auditable and fast enough for policy tracking, transition detection, and downstream empirical work (e.g., event studies and forecasting overlays). Limitations include the biannual frequency of MPRs, sensitivity to prompts and model updates, and the descriptive (non-causal) nature of stance scores. Future research should extend coverage to Federal Open Market Committee (FOMC) statements, minutes, press conferences, and speeches; link stance shocks to high-frequency asset-price moves for external validation; compare open-source and proprietary models; and formalize guardrails to mitigate Large Language Model (LLM) measurement error.
For orientation, the paper is structured as follows:
Section 2 motivates the problem and reviews the literature;
Section 3 describes the data and preprocessing and details the agentic RAG architecture;
Section 4 presents the validation framework and results;
Section 5 deals with Hawkish or Dovish signals;
Section 6 reports main empirical findings and discusses limitations; and finally,
Section 7 concludes.
The main contribution of the study: Targeting the biannual MPRs and combining agentic RAG with a formal, multi-pronged validation suite, provides a credible, timely, and reproducible measure of the FED’s hawkish/dovish narrative that practitioners can deploy immediately and scholars can audit and extend.
2. Literature Review
Extensive literature shows that central-bank communication is itself a policy instrument that shapes expectations and asset prices and contains independent information beyond rate surprises [
1]. Inclusive surveys conclude that communication enhances market understanding and policy effectiveness, even as optimal strategies vary across institutions [
2]. These foundations motivate systematic measurement of the stance embedded in the Federal Reserve’s written reports.
Subsequent work decomposes policy news into finer components. In [
3] is showed that language in central-bank communication transmits macroeconomic shocks, while [
4] separate “monetary” from “non-monetary” news and document that information conveyed in words and guidance matters alongside rate moves [
11]. Together, these papers justify stance-oriented text measurement that can be linked to theoretical benchmarks and market reactions.
Within text-as-data methods, early approaches relied on hand-curated dictionaries. They supervised scoring applied to FOMC statements [
12], as well as real-time analyses of market responses to central-bank words [
13]. Central-bank practitioners also developed guidance on text mining for policy analysis [
14]. These efforts established feasibility but faced trade-offs in nuance, coverage, and portability across document types.
A broader methodological canon in economics and political science highlights the importance of transparent feature representations, out-of-sample validation, and principled uncertainty quantification for measurement and inference. In [
15] survey representational choices, prediction targets, and the importance of external validation; Grimmer and Stewart [
5,
6,
9] emphasize semantic, predictive, and face validity tests and warn against unvalidated dictionary methods; and reference [
6] develops principled lexical scoring (“Fightin’ Words”) to identify discriminating terms. These principles underpin the proposed design choices for semantic consistency, numerical consistency, bootstrap stability, and content quality.
More recently, Large Language Models (LLMs) have expanded the frontier of central-bank communication analysis. An IMF working paper [
16] fine-tunes an LLM on a multilingual, decades-long corpus to classify topic, stance, sentiment, and audience, demonstrating scalable classification at sentence level across 169 central banks. In parallel, European Central Bank (ECB) researchers show that ChatGPT 4.o derived sentiment from two pages of PMI commentary significantly improves euro-area GDP nowcasts, underscoring the value of narrative signals even from small text snippets [
17]. These advances validate the use of modern embeddings, retrieval, and LLM scoring for policy-relevant measurement, while also motivating rigorous validation and cost/latency tracking.
Beyond text alone, multimodal work highlights that non-verbal cues and delivery matter: tone, prosody, and body language can move markets after controlling for actions and text [
18], reinforcing that communication channels are multifaceted and economically meaningful. Recent empirical studies highlight that central bank communications contain much more information than what is conveyed in formal statements alone. Research on FOMC minutes and transcripts shows that committees provide significant forward-looking guidance, which financial markets quickly embed into asset prices [
19]. Evidence from inflation-targeting countries further suggests that transparent communication frameworks strengthen policy credibility and help anchor expectations [
20]. At the same time, developments in computational finance demonstrate that accelerated diffusion models with jump components can capture the abrupt shifts associated with changes in policy regimes [
21]. These models offer a complementary, high-frequency perspective that supports the validation of stance measures derived from textual analysis.
Concerning document classes, prior studies frequently emphasize press conferences and post-meeting statements, whereas the MPR remains understudied despite its statutory role and stable format [
22]. This paper addresses that gap by building a reproducible pipeline on the complete set of MPRs from 2013 to 2025, enabling like-for-like comparisons over time.
Lastly, theoretical anchors from monetary-policy norms link textual posture to macroeconomic trade-offs. References [
7,
8] elucidate the links between inflation, output, and interest rates within the context of policy rules. This framework serves as a basis for determining whether language reflects a hawkish or dovish stance as a systematic policy signal rather than merely a linguistic artifact.
The literature establishes that (i) central-bank words matter for expectations and asset prices; (ii) text can be decomposed into meaningful policy news; (iii) transparent, validated measurement is feasible with modern NLP; and (iv) MPRs offer a tractable, underused corpus for stance measurement linked to theoretical benchmarks and downstream empirical uses.
3. Methodology: Preprocessing, Segmentation, and Embeddings
This section describes a low-latency pipeline that converts the Federal Reserve’s biannual MPRs into quantitative measures of monetary-policy stance and structured thematic signals. The approach follows best practices in text-as-data, clear preprocessing, explicit representations, retrieval with principled similarity, and transparent model prompting, so that outputs are reproducible and suitable for empirical work [
5].
The end-to-end data processing and analysis pipeline is visually represented in
Figure 1. This diagram shows the step-by-step process from raw PDF files to the final structured analytical output, with details on each key phase and the technologies used. All semiannual MPRs submitted by the Board of Governors to the U.S. Congress from February 2013 through June 2025 (26 reports) have been analyzed. Documents are sourced from the Federal Reserve’s public archives in their original PDF format to preserve layout fidelity and ensure full traceability to the official record.
PDFs are parsed with a layout-aware extractor to recover reading order across multi-column pages and to capture footnotes, tables, and figure captions conservatively. Tables are converted to a plain-text matrix (Markdown) so rows and columns remain machine-readable downstream. All text is normalized (UTF-8, whitespace, hyphenation, page headers/footers removed), and document-level metadata (publication date, URL, and section headers) is retained for auditability.
To maintain semantic coherence while respecting context limits, each document is segmented into overlapping units of
characters with a 200 character overlap. Let d denote a document and
its chunks; each chunk carries pointers to its source page and character offsets for exact provenance. Each chunk
is mapped to a dense vector in
via an embedding function:
which produces a semantic representation suitable for similarity search. Transformer-based embeddings provide context-sensitive semantics beyond bag-of-words [
23,
24]. In practice, we use a compact, low-cost model to minimize latency and expense while preserving retrieval quality.
All vectors
are indexed with FAISS for sub-second approximate nearest-neighbor search in high dimensions [
25]. Relevance between a user/query vector and a chunk vector is computed via cosine similarity:
To reduce redundancy in the retrieved set, we apply Maximal Marginal Relevance (MMR) [
26]:
where
is the candidate pool,
the selected set, and
balances relevance and diversity.
It is implemented a Retrieval-Augmented Generation (RAG) architecture [
27] with agentic control: a retrieval agent gathers the top-k chunks for a query; a reasoning agent performs scoring/explanation; and a verification agent enforces schema and sanity checks.
In Stage 1, a global stance scoring is made. The system constructs a document-level query (“assess overall monetary stance given dual mandate trade-offs and inflation/growth signals”) and retrieves high-yield chunks spanning the report’s core sections. The LLM returns an overall stance score and a stance label plus a short rationale with citations to chunk IDs.
Later, it is integrated into the thematic extraction. Independent queries target specific dimensions : inflation, interest rates, employment, forward guidance, and key policy signal. For each , the agent retrieves focused evidence and produces (i) a concise summary, (ii) a 0–5 intensity score for the relevant concern/strength, and (iii) an uncertainty note. Prompts are fully templated and versioned; temperature and decoding parameters are fixed for stability. Outputs are validated against a JSON schema (types, ranges, labels) before being written to disk.
To cross-check LLM outputs with lightweight, model-free indicators, it is computed:
TF-IDF and cosine checks. For each theme, we build TF-IDF vectors [
24]:
Moreover, to evaluate pairwise cosine similarity within/between stance groups as an internal diagnostic (used later in validation).
Hawkish/Dovish signal counts. Let
and
D be counts of curated, policy-specific
n-grams (e.g., “further tightening,” “accommodative stance”). The normalized signal index is defined as:
with
to avoid division by zero. HSI is not used to set the stance but serves as a sanity check for directionality. The stance label is a deterministic mapping of the continuous score:
;
;
.
Model-reported confidence is augmented by self-consistency. Let the reasoning agent be executed
times on the same retrieved context under low-temperature decoding. Denote by
the modal label returned across these rerolls (Hawkish, Neutral, or Dovish) and let
be the count of runs that agree with
. The statistics:
There is also reported an analysis-uncertainty metric , which exposes both in the dataset. The retrieval agent returns a fixed set of top-k chunks; the reasoning agent is then sampled times at low temperature (e.g., 0.2) to reduce gratuitous variability while preserving enough stochasticity to reveal instability. The label is the plurality (or majority) outcome; ties are broken by (i) choosing the label with the higher average stance score magnitude, and, if still tied, (ii) selecting the label with the higher model-reported confidence. The final record stores ) together with the continuous score.
For each report that is logged: model/version, prompt template hash, retrieval parameters
, token counts, runtime, and FAISS index version. Every LLM output includes chunk-level citations back to the PDF coordinates (doc, page, character offsets), enabling one-click audit of claims. The final artifact is a single row per MPR with the fields enumerated below. Each row is presented in
Appendix A.1 and
Appendix A.2. This structured dataset supports descriptive analytics, event-study linkage, and forecasting overlays while preserving full provenance to the underlying text (See
Appendix B for output example.
The GPT-4o model is used as the reasoning component because, in pilot runs, it offered the best combination of instruction following, schema adherence (JSON validity), and factual grounding under retrieval, while keeping token-level costs and latency sufficiently low for near-real-time monitoring. Lighter alternatives (e.g., “mini”) reduced unit costs but increased edit rate and degraded semantic/thematic attribution under identical prompts; “turbo-class” variants improved speed but did not consistently match GPT-4o’s reliability in schema-constrained outputs.
Given that the pipeline emphasizes auditable, low-variance outputs with explicit confidence/uncertainty, GPT-4o provided the most stable trade-off.
Section 3’s validation metrics (semantic separation, theory-concordant correlations, and bootstrap stability) were computed on GPT-4o outputs.
The baseline run, average end-to-end runtime per report was 18.6 s. Processing the whole corpus of 26 MPRs required ≈USD 0.97 and ≈8.06 min in total, enabling low-cost, transparent monitoring. On the proposed baseline run, the average end-to-end runtime per report was 18.6 s. Average API cost per report was ≈USD 0.037 (mean 3.7¢), for a total of ≈USD 0.97 across 26 reports with 80 pages on average each. Cost per report is computed as:
with token counts and prices logged at execution time. These figures are audited against the provider dashboard (snapshot dated July 2025). Parallelization can further reduce wall-clock time; total cost is insensitive to concurrency.
About Modelling Selection
To evaluate the robustness of the proposed stance measurements across different model architectures, we conducted a pilot study comparing GPT-4o (proprietary), Llama 3.1–70B (open-source), and Mistral-Large-2 (open-source). Using identical retrieval parameters and prompts on a representative sample of five MPR spanning distinct policy regimes, we assess: (i) stance score correlation with expert benchmarks, (ii) schema adherence, (iii) semantic consistency in thematic summaries, and (iv) computational cost and latency trade-offs.
The results are summarized in
Table 1. GPT-4o consistently outperformed the alternatives, achieving the highest correlation with expert annotations (ρ = 0.89) and full schema validity. Llama 3.1–70B provided competitive stance scoring (ρ = 0.82) but exhibited schema instability (12% invalid JSON outputs in baseline runs), necessitating additional prompt engineering. Mistral-Large-2 performed well in thematic extraction (≈85% qualitative similarity) but produced stance scores with higher variance (σ = 1.12) and longer inference times.
From a cost perspective, open-source models substantially reduce marginal API expenses but demand greater computational resources (3–5× longer inference on consumer-grade GPUs) and more extensive prompt adjustments. For a corpus of 26 reports subject to strict schema requirements and validation standards, GPT-4o offered the best balance of accuracy, reliability, and development efficiency, with a total cost of roughly USD 0.97.
Finally, the strong correlation between GPT-4o and Llama 3.1–70B (ρ = 0.82) suggests that the following central findings: predominant neutrality, regime-specific deviations, and theory-consistent correlations, being robust across model types. Nevertheless, as instruction-following capabilities in open-source models continue to improve, fine-tuned versions of Llama or Mistral may provide a viable alternative for larger-scale or real-time applications.
4. Validation Report
This section establishes the reliability and academic rigor of the measurement pipeline using a four-dimensional framework that addresses core concerns in text-as-data research: construct validity, external (theory) coherence, sampling stability, and content quality [
5,
6]. It is followed a multitrait–multimethod logic to test discriminant/convergent properties of the stance construct [
28], align quantitative relations with monetary-theory benchmarks [
7,
8], examine bootstrap stability [
9], and evaluate linguistic quality using standard IR/CL metrics [
24].
It is tested whether texts assigned to similar monetary stances (hawkish/neutral/dovish) are also semantically similar (and dissimilar to other stances), conditional on topic (
Figure 2). For each theme
, TF-IDF vectors are built and cosine similarity is computed. TF-IDF for term
in document
over corpus
(Equation (4)). Cosine similarity for documents
:
Let
denote similarity among documents sharing the same stance within theme
, and
the mean across different stances. The consistency ratio is:
An apparent discriminant validity is found: Key policy signals exhibit the strongest separation
, forward guidance and inflation show meaningful separation
), while Employment and Interest Rates are moderate (
), reflecting shared terminology across stances. Within-group similarities range
versus
between groups, consistent with a coherent stance construct given topic conditioning [
28].
Likewise, it is tested whether observed correlations match sign and magnitude intervals implied by canonical monetary-policy rules [
7] and modern analyses [
3,
8]. Define expected intervals
for key pairs, e.g.,
;
;
;
.
The Numerical Consistency Score (
NCS) is the share of tested pairs who’s empirical
fall within their theory range:
Getting an NCS of 0.800 means that 80% of relationships meet theoretical expectations. Notably, lies squarely in range, consistent with inflation-targeting logic [
29].
Now, to assess sensitivity to sample composition, we run bootstrap resampling with
replicates. For each metric MMM (e.g., confidence, inflation concern), it is computing the bootstrap means
and standard deviation
then the coefficient of variation and a bounded stability score:
Most metrics are highly stable: Confidence
; Inflation Concern
; Employment Strength
; Growth Concern
. The stance score shows a large
, which is interpreted as genuine policy volatility rather than measurement error, consistent with regime shifts in U.S. monetary policy [
10].
Following IR/CL standards, we summarize three dimensions (length, lexical variety, and entropy) into a composite quality score by theme. Let and be the mean and standard deviation of chunk length; the vocabulary size; and the empirical probability of token .
The Length Consistency (LC) is:
LD defines the lexical diversity:
The information content is normalized by entropy:
Finally, the per-theme composite is:
Key policy signals attain the highest quality (
QS = 0.771), followed by Forward Guidance (0.743). Employment, Interest Rates, and Inflation are moderate (0.541–0.608), yielding an overall average
CQS = 0.647, indicative of concise, information-rich summaries where the FED’s messaging is most explicit [
30].
Finally, the four dimensions are aggregated with equal weights to form the Overall Validation Score (
OVS):
The semantic-consistency dimensions are normalized to SCS = 1.0 (scale-setter), and use NCS = 0.800, BSS = 0.770 (average across metrics), and CQS = 0.647. The OVS = 0.796, which is graded as B+ (Good). This level indicates a robust, publication-grade measurement that (i) separates stances semantically, (ii) comports with theory-grounded correlation structure, (iii) is stable for most reported metrics, and (iv) communicated with sufficient linguistic quality for auditability.
All thresholds (e.g., stance cutoffs, retrieval, MMR) were specified ex ante and held fixed across documents; no parameter was tuned to maximize validation scores. Random seeds, prompt templates, model versions, and FAISS index IDs are logged to enable exact returns. Validation computations are implemented in a separate script to ensure that analytical and reporting layers are decoupled and auditable [
15].
5. Hawkish or Dovish
Now that all is set up and validated,
Figure 3 traces the stance time series from 2013 to 2025 and reveals a straightforward three-phase narrative: a post-crisis accommodation period (2013–2015), a gradual normalization (2016–2020), and a post-pandemic tightening cycle (2021–2025). The distribution of labels: Neutral (50.0%), Dovish (26.9%), and Hawkish (23.1%), and an average stance close to zero (≈0.02) with volatility around 0.87, indicates that, over long horizons, the FED’s written communications cluster near neutrality with intermittent but decisive excursions during regime shifts.
The finding that 50% of the observations fall into the Neutral category requires careful interpretation. This outcome may reflect either (i) genuine policy equilibrium during periods of balanced risks, or (ii) conservative thresholds in the stance-label mapping. We examine this issue from several perspectives.
First, the continuous stance scores reveal considerable variation (
σ ≈ 0.866) with values ranging from −1.8 to +1.5, indicating that the Neutral label encompasses a heterogeneous group of communications. Under this framework, approximately 30% of reports cluster tightly around zero (±0.2), representing truly centrist positions, while the remaining 20% lie between −0.5 and −0.2 or between +0.2 and +0.5, reflecting mild leanings without strong commitments. This pattern is consistent with the Federal Reserve’s well-documented tendency toward gradualism and data-dependent guidance, where conditional signaling is often favored over categorical policy declarations [
31,
32,
33].
Second, we conducted a threshold sensitivity analysis by relaxing the cutoffs to = 0.4 and = −0.4 compared with the baseline values of = 0.5 and = −0.5. Under these alternative thresholds, the distribution becomes more balanced (Neutral 38.5%, Dovish 30.8%, Hawkish 30.7%). However, this reclassification increases label instability across bootstrap resamples (average confidence falling from 0.90 to 0.76) and reduces semantic consistency ratios (CR for Key Policy Signals declining from 2.358 to 1.821). These results suggest that the baseline thresholds strike a more reliable balance between classification stability and directional discrimination, favoring conservative labeling to avoid false positives.
Third, we validated the economic content of the Neutral category by analyzing thematic profiles. Reports labeled Neutral are characterized by higher levels of policy uncertainty (normalized intensity 0.68 vs. 0.42 for Hawkish and 0.51 for Dovish), more balanced attention to inflation and growth, and frequent use of conditional language (“if incoming data,” “depending on developments”). Such features point to intentional ambiguity, consistent with the literature that highlights central banks’ use of vague communication to maintain flexibility when the macroeconomic outlook is uncertain [
2,
33,
34].
Finally, comparison with the Hawkish Signal Index (HSI, Equation (5)) shows that 85% of Neutral reports have |HSI| < 0.3, reinforcing the view that these texts convey balanced or mixed signals. The remaining 15%, those with higher |HSI| but Neutral labels, typically reflect offsetting signals across themes (e.g., hawkish on inflation but dovish on employment). In these cases, the GPT-4o reasoning agent correctly synthesizes the signals into an overall centrist stance, underscoring the value of contextual, multi-factor assessment over one-dimensional keyword methods.
In this sense, this pattern accords with the communication literature’s emphasis on the informational content of central-bank language, beyond mechanical rate moves, and its stabilizing role for expectations [
1,
3,
4,
29]. Two turning points stand out in the figure: (i) deep dovish readings around the 2020 pandemic response and (ii) a subsequent hawkish drift peeking through the inflation-fighting phase, consistent with the institution’s state-contingent posture in adverse shocks and its later resolve to re-anchor expectations. The prevalence of Neutral observations is also consistent with the “gradualism” doctrine in FED communication, keeping guidance steady unless incoming data warrant a change [
35] (which ref. [
31] framed to align expectations without over-committing to a path).
Next,
Figure 4 introduces the Market Impact Score (MIS) to summarize how much a given communication is likely to matter for markets. The index prioritizes policy surprise, i.e., the magnitude of change in stance relative to the previous report, while recognizing the roles of uncertainty, signal strength, and analytical confidence. In this case,
As shown in
Figure 4, MIS spikes align with historically salient communications (e.g., the 2020 emergency guidance and the 2022 anti-inflation pivot). This accords with the idea that surprises, rather than levels, move prices in efficient markets [
36], and with evidence that both “monetary” and “non-monetary” news in central-bank text carry incremental information [
4]. The positive correlation we observe between absolute stance and MIS suggests that clearer, more decisive narratives garner more attention (a “clarity premium”), echoing survey conclusions that transparent communication improves market predictability [
2]. Methodologically, MIS is complementary (not a replacement) for stance: it aggregates how much a report may matter now, given surprise and uncertainty, while the stance score conveys what direction policy risk points toward.
Figure 5 decomposes the narrative into thematic intensities: Inflation concern, Growth concern, Employment strength, and Policy uncertainty, which helps interpret stance movements economically. The inflation dimension exhibits a U-shaped path: easing concerns through 2019, followed by a sharp rise in 2021–2022 as broad-based pressures emerged, consistent with the inflation-targeting logic that links tighter stances to persistent inflation deviations [
29].
Employment strength trends upward through 2019, collapses in 2020, and then recovers, mirroring the dual-mandate lens emphasized by [
32]. Growth concern moves inversely with employment strength, reflecting slack and demand shortfalls in downturns. Policy uncertainty peaks at regime transitions, in line with the uncertainty literature’s prediction that shifts in policy regimes or the macro environment temporarily widen beliefs [
37]. Together, these components rationalize the time-series behavior of stance in
Figure 3: when inflation concern dominates and employment remains robust, the system reads more hawkish; when growth concern spikes amid employment weakness, it reads more dovish; and when signals are mixed, neutrality and higher uncertainty prevail.
Figure 6 examines transitions, changes in stance between consecutive reports. Furthermore, shows that significant moves
) are clustered around crises and inflationary episodes. The asymmetry is instructive: dovish adjustments during acute stress tend to be sudden and significant, whereas hawkish shifts often build over successive reports as the Committee gauges persistence in inflation pressures.
This pattern is consistent with asymmetric loss or preference functions in central banking [
31] and with evidence of regime shifts in U.S. monetary policy over recent decades [
10]. Notably, the presence of several moderate transitions between significant moves suggests policy smoothing remains relevant on average [
38], even if exceptional conditions trigger outsized adjustments. From a monitoring perspective, the transition chart operationalizes “watch points”: sequences of moderate hawkish steps coupled with rising inflation concern typically precede peaks in hawkish stance, whereas sharp dovish swings coincide with stress episodes and heightened uncertainty.
Finally,
Figure 7 contrasts multi-dimensional signal profiles across stance categories via radar charts. Hawkish reports concentrate intensity on inflation concern and hawkish lexicon, with lower growth concern and moderate uncertainty, a profile consistent with Taylor-rule prescriptions when inflation is above target [
7].
Dovish reports invert this pattern, emphasizing growth concern and dovish phrasing, often with a stronger employment focus where labor-market slack or participation dynamics motivate accommodation. Neutral reports distribute signal mass more evenly but display the highest uncertainty, reflecting subtle trade-offs and conditional guidance (“data dependence”) that stop short of directional commitments. This cross-section validates that the stance labels are not arbitrary tags but summarize coherent bundles of textual signals, which the validation framework (
Section 4) corroborates through semantic separation and theory-concordant correlations:
Two practical takeaways follow. First, the stance series and its components provide a policy dashboard that researchers can join to asset-price data for event studies or forecasting overlays; the MIS offers a ranking of which reports are most likely to matter contemporaneously. Second, the thematic breakdown clarifies why stance moves: for instance, a hawkish drift accompanied by rising inflation concern and stable employment strength is a different policy configuration from one where hawkishness coexists with falling employment strength and spiking uncertainty, the former typically signaling persistence, the latter caution. Overall, these results align with the literature that central-bank words shape expectations [
1,
2] that textual shocks have macro-relevant content [
3], and that nuanced, validated text measures can augment empirical monetary analysis at low cost and latency in real time.
Early efforts to analyze central bank communication relied heavily on dictionary methods and bag-of-words representations [
12,
14]. While useful as a starting point, these approaches face clear constraints when it comes to capturing contextual nuance and semantic depth. Dictionary-based scoring assigns fixed sentiment weights to individual terms, but it cannot account for negation, conditional phrasing, or domain-specific usage. This often results in misclassification, for instance, the phrase “higher for longer” conveys guidance on persistence rather than simply a hawkish stance. In addition, manually curated word lists introduce subjective bias and limit the portability of the method across different document types or institutional settings.
By contrast, more recent techniques built on embeddings and transformer architectures [
23], combined with retrieval-augmented generation (RAG) [
27], preserve contextual sensitivity, distinguish between literal and conditional statements, and provide auditable traceability through chunk-level citations. The agent-based RAG framework we propose addresses the limitations of earlier methods by integrating dense semantic representations with explicit reasoning steps. This allows us to produce measurements that capture not only the direction of policy signals but also the confidence behind them, while maintaining full transparency back to the original text.
6. Analysis Results: Main Findings and Limitations
This section interprets the time-series and cross-sectional evidence produced by the pipeline and situates it within the communication-and-markets literature. The stance series in
Figure 3, with Neutral as the modal category (50.0%), Dovish at 26.9%, Hawkish at 23.1%, a mean near zero (~0.02), and volatility
approx. 0.87, conveys a communication regime that generally aims for stability, punctuated by decisive excursions during macro inflection points.
That configuration is consistent with the view that central-bank language is itself an instrument for shaping expectations, working alongside policy rates and balance-sheet tools [
1,
2]. In particular, the deep dovish readings around the 2020 shock and the subsequent hawkish drift, about +0.8 points in the most recent window, mirror the well-documented pivot from emergency accommodation to disinflationary resolve, aligning with evidence that the information content of communications transmits macro shocks [
3] and that both “monetary” and “non-monetary” news components in policy texts are priced by markets [
4].
What is new in this analysis is the document class and the measurement design. Most prior work emphasizes FOMC statements, minutes, press conferences, or high-frequency windows around meetings [
6,
11,
12]. We instead focus on the semiannual MPR, a longer, structured narrative to Congress that has received less stance-centered treatment but offers a relatively stable format for like-for-like measurement across time. The proposed agentic RAG architecture adds confidence/uncertainty to each observation. It is paired with a multi-pronged validation suite (
Section 3), addressing persistent concerns about construct validity, theory concordance, sampling stability, and content quality in text-as-data [
5,
6,
15]. This combination: MPR focus + auditable scoring + explicit validation, provides a stance series that is both reproducible and directly interpretable against monetary theory benchmarks.
The MIS in
Figure 4 further clarifies when communications matter most. By construction, MIS weights policy surprise (the absolute change in stance), uncertainty, signal strength, and analysis reliability. Its spikes line up with historically salient communications, emergency dovish guidance in 2020, and the anti-inflation pivot in 2022, consistent with efficient-markets logic that surprises, not levels, move prices [
36] and with research that decomposes monetary announcements into target and information components [
39].
The empirical co-movement between
and MIS suggests a clarity premium: more decisive narratives draw greater market attention, echoing survey conclusions that higher transparency aids predictability [
2]. While MIS is not a substitute for stance, the pair jointly distinguishes direction (stance) from salience (impact), a separation that is often blurred in dictionary-based tone indices.
The thematic decomposition in
Figure 5 aligns stance fluctuations with the FED’s dual-mandate framework. Inflation concern follows a U-shape, easing through 2019, surging in 2021–2022, consistent with inflation-targeting logic [
8,
20,
29]. Employment strength rises steadily pre-COVID, collapses in 2020, and recovers thereafter; growth concern moves inversely, mirroring slack and demand shortfalls in downturns.
Policy uncertainty peaks near transitions, consistent with the broader uncertainty literature [
35] and with the idea that regime changes temporarily widen belief distributions. These components rationalize the stance dynamics in
Figure 3: hawkish readings co-occur when inflation concern is elevated, and the labor market remains resilient; dovish readings emerge when growth concern rises, and employment weakens; neutral readings dominate when signals are mixed, often alongside higher uncertainty and conditional, data-dependent guidance. Importantly, this internal logic is validated externally by the Numerical Consistency Score (NCS = 0.800) in
Section 4, which checks that observed correlations conform to theory-implied intervals [
7,
8].
The transition analysis in
Figure 6 adds a dynamic layer. Large swings cluster around crises and inflationary episodes; dovish adjustments appear sudden and sizable during acute stress, whereas hawkish shifts often build cumulatively over successive reports as persistence is assessed. This asymmetry is consistent with models of asymmetric central-bank loss functions [
31] and with evidence of regime variation in U.S. monetary policy [
10]. The presence of several moderate moves between large steps illustrates policy smoothing on average [
38]. For monitoring, such patterns operationalize “watch points”: sequences of modest hawkish steps paired with rising inflation concern often precede hawkish peaks; sharp dovish swings co-occur with stress episodes and heightened uncertainty.
Relative to dictionary or bag-of-words approaches [
12,
14], the agentic RAG system captures contextual nuance, for example, distinguishing “higher for longer” as persistence guidance rather than simply classifying “higher” as hawkish. It also audits each inference to specific text spans, addressing reproducibility critiques in automated content analysis. The validation results (
Section 4), semantic separation (high CR in Key Policy Signals), bootstrap stability for most metrics, and quality scores that are highest precisely where policy language is most explicit, reinforce that the pipeline measures a coherent construct rather than artifacts of prompt phrasing. Moreover, attaching confidence and uncertainty to every observation provides a principled way to down-weight low-reliability readings in downstream empirical work (e.g., weighing in event studies or state-space filters).
These findings have practical implications. First, the stance index and its components can be joined to yield-curve and equity data to study market responses to narrative variation beyond rate surprises, complementing high-frequency identification of announcement shocks [
11,
31,
37]. Second, the MIS can inform portfolio nowcasting by flagging which report vintages merit closer scrutiny when allocating attention and risk. Third, the uncertainty trace, peaking at transitions, connects naturally to policy-uncertainty measures [
33] and encourages explicit modeling of uncertainty channels in term-premium or macro-finance frameworks. Finally, by structuring outputs at the theme level (inflation, rates, employment, guidance, key signals), the dataset supports interpretability-first applications (e.g., “which arguments drive hawkishness?”), complementing black-box sentiment scores.
Two stylized facts emerge that have not been documented for the MPR corpus specifically. First, stance asymmetry across regimes: dovish responses are lumpy and front-loaded during stress, whereas hawkishness often accumulates as inflation persistence is established. Second, uncertainty as a leading co-indicator of stance change: peaks in the text-derived uncertainty metric frequently coincide with or anticipate categorical stance transitions, reinforcing the view that communication transitions involve “narrative re-anchoring” before policy levels fully adjust. Both facts are consistent with, but not implied by, prior work focused on short statements or press conferences; they arise from sustained, long-form communication in the MPR and the granular theme scoring we implement.
Several caveats temper interpretation. The MPR is published twice a year, which restricts the proposed series to medium-frequency observations (13 reports over a period of 6.5 years) rather than meeting-by-meeting dynamics. Although this frequency is sufficient to capture regime shifts and sustained changes in policy orientation, as shown by the identification of the dovish turn in 2020 and the hawkish drift in 2022, it inevitably leaves out intra-period volatility and short-term tactical adjustments that are communicated through FOMC statements, press conferences, or speeches. For applications that demand near-time monitoring or high-frequency event studies, the framework would need to be extended to these more frequent sources.
At the same time, the biannual cadence provides distinct analytical advantages. Each MPR represents the Committee’s consolidated view over a six-month horizon, thereby filtering out temporary noise and highlighting more persistent policy concerns. This makes the reports particularly well suited for tracking medium-term stance evolution and for linking narrative shifts to macroeconomic developments that unfold across quarters rather than days. The trade-off between frequency and comprehensiveness is inherent in document selection. The results obtained show that, even at a biannual frequency, validated text-based measures from MPRs capture economically meaningful variation consistent with theoretical benchmarks and documented policy transitions. If finer temporal resolution is required, the framework can be applied to FOMC statements (eight meetings per year), which represent a natural next step for extending coverage without compromising measurement rigor. The proposed measures in this investigation are descriptive, not causal. Any claims about price effects require linking to high-frequency asset moves and controlling for confounders, as emphasized by the identification literature [
27,
40]. Model output may be sensitive to prompt or model updates, though the proposed logging and validation routines mitigate this risk and make changes transparent. Finally, the U.S. institutional context matters; portability to other central banks is promising [
12,
16] but should not be assumed without re-validation.
In sum, the analysis confirms and extends three pillars of literature. First, it reaffirms that central-bank words matter for expectations and risk pricing [
1,
2]. Second, it shows that validated, auditable text measures can recover theory-concordant structure from long-form policy documents, not just short statements [
3,
4]. Third, it contributes new evidence on regime asymmetries, uncertainty dynamics, and impact salience in the MPR setting, delivered at low cost and latency through an agentic RAG pipeline with explicit reliability metrics. These features, together with the transparent validation of
Section 4, make the resulting stance series a credible input for policy monitoring dashboards, event-study classifications, and macro-finance models that incorporate communication as a state variable.
7. Conclusions
This paper set out to solve a concrete research problem: to measure, consistently and quickly, the hawkish–dovish stance embedded in the Federal Reserve’s semiannual Monetary Policy Reports. The motivation is well established as central-bank communication itself shapes expectations and asset prices, beyond the mechanical effect of rate moves [
1,
2], but the MPR has received comparatively less stance-focused, long-form treatment. The question was whether one can build an auditable, low-latency system that turns the MPR’s narrative into quantitative stance indicators with transparent uncertainty and validation.
The objective was to develop a near-real-time NLP pipeline, embedding-based retrieval (FAISS) orchestrated with LangChain and scored by GPT-4o, that (i) produces a continuous stance score in [−2, +2] with a mapped categorical label (Dovish/Neutral/Hawkish), (ii) extracts theme-level rationales (inflation, interest rates, employment, forward guidance, key signals), and (iii) attaches confidence/uncertainty to every observation. We hypothesized that an agentic RAG design would recover stable, theory-consistent measures that track policy transitions and deliver proper decision signals at low cost and latency. The evidence supports this hypothesis: the stance series exhibits economically sensible dynamics; the correlation structure aligns with canonical monetary theory; and the validation suite indicates good reliability.
The methodology combines layout-aware PDF parsing, chunking with provenance, dense embeddings, FAISS indexing, and two-stage agentic prompting (stance scoring followed by thematic synthesis). The dataset comprises 26 MPRs spanning 26 February 2013 to 20 June 2025, ensuring like-for-like comparisons in a stable document format. A four-dimensional validation framework of semantic consistency, numerical consistency, bootstrap stability, and content quality yields an integrated score of 0.796 (B+/Good), with NCS = 0.800 against theory-implied ranges, strong semantic separation in key themes, and high bootstrap stability for most metrics (
Section 3).
The main findings are as follows. First, the stance distribution is Neutral 50.0%, Dovish 26.9%, and Hawkish 23.1% with a mean near zero (~0.019) and volatility σ ≈ 0.866, implying a communication regime that clusters near neutrality but allows decisive excursions at regime shifts. Second, a recent hawkish drift (~+0.8 points) is consistent with the inflation-fighting phase. Third, the thematic decomposition reconciles stance with the dual mandate: inflation concern follows a U-shape (easing pre-2020, rising in 2021–2022), employment strength collapses in 2020 and recovers, growth concern moves inversely to employment, and policy uncertainty peaks at transitions.
Fourth, transition analysis reveals an asymmetric: dovish moves during stress are abrupt and significant, whereas hawkish shifts tend to build cumulatively, consistent with asymmetric central-bank preferences [
31] and regime-variation evidence [
10]. Finally, the Market Impact Score helps to rank report vintages by prospective salience, highlighting that surprises (changes in stance), not levels, are most likely to be consequential for markets [
36,
39].
Relative to the literature, these results both confirm and extend prior findings. They reaffirm that words matter [
1,
2] and that textual signals transmit macro information [
3,
4]. They extend the evidence to the MPR (a long-form, structured report) using a pipeline with confidence/uncertainty and explicit diagnostics. Two stylized facts are the regime asymmetry of stance transitions and the role of uncertainty as a leading co-indicator around stance changes.
There are, however, limitations. The MPR is biannual, so the series has medium frequency and cannot capture meeting-to-meeting dynamics. The proposed measures are descriptive, not causal; price effects require high-frequency identification and careful controls [
33,
38]. Outputs may be sensitive to prompt/model updates, though we mitigate this via strict logging, versioning, and ex-ante parameter choices. External validity beyond the United States is promising but requires re-validation given institutional differences.
Future research should: (i) extend coverage to FOMC statements, minutes, press conferences, and speeches to increase frequency and triangulate signals; (ii) link stance and MIS to high-frequency asset-price moves and expectations (yields, OIS, options) for causal validation; (iii) compare open-source vs. proprietary models and ablate retrieval/scoring choices; (iv) integrate the stance series into macro-finance models (e.g., term-premium decompositions) and forecasting overlays; and (v) formalize guardrails for LLM measurement error and domain drift, following best practices in text-as-data.
In closing, the paper’s principal contribution is a validated, low-cost (around 1 USD for 26 reports, 80 pages on average each, and including embedding or vectorization process), near-real-time framework that transforms the FED’s Monetary Policy Reports into quantitative, auditable measures of policy stance, with theme-level explanations and explicit uncertainty, that researchers and practitioners can immediately incorporate into monitoring dashboards, event-study designs, and macro-finance analyses. By focusing on the MPR and by uniting agentic RAG with a multi-pronged validation suite, the study bridges modern NLP and monetary economics, providing a reliable, extensible measurement of the FED’s narrative stance that complements existing indicators and opens new avenues for empirical work.