Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report

Jiménez-Preciado, Ana Lorena; Durán-Saldivar, Mario Alejandro; Cruz-Aké, Salvador; Venegas-Martínez, Francisco

doi:10.3390/math13203255

Open AccessArticle

Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report

by

Ana Lorena Jiménez-Preciado

,

Mario Alejandro Durán-Saldivar

,

Salvador Cruz-Aké

and

Francisco Venegas-Martínez

^*

Escuela Superior de Economía, Instituto Politécnico Nacional, Red de Medio Ambiente. Av. Plan de Agua Prieta 66, Miguel Hidalgo, Mexico City 11350, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(20), 3255; https://doi.org/10.3390/math13203255 (registering DOI)

Submission received: 27 August 2025 / Revised: 2 October 2025 / Accepted: 7 October 2025 / Published: 11 October 2025

(This article belongs to the Special Issue Advances in Intelligent Computing, Machine Learning and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

This paper develops a Natural Language Processing (NLP) pipeline to quantify the hawkish–dovish stance in the Federal Reserve’s semiannual Monetary Policy Reports (MPRs). The goal is to transform long-form central-bank text into reproducible stance scores and interpretable policy signals for research and monitoring. The corpus comprises 26 MPRs (26 February 2013 to 20 June 2025). PDFs are parsed and segmented and chunks are embedded, indexed with FAISS, retrieved via LangChain, and scored by GPT-4o on a continuous scale from −2 (dovish) to +2 (hawkish). Reliability is assessed with a four-dimension validation suite: (i) semantic consistency using cosine-similarity separation, (ii) numerical consistency against theory-implied correlation ranges (e.g., Taylor-rule logic), (iii) bootstrap stability of reported metrics, and (iv) content-quality diagnostics. Results show a predominant Neutral distribution (50.0%), with Dovish (26.9%) and Hawkish (23.1%). The average stance is near zero (≈0.019) with volatility σ ≈ 0.866, and the latest window exhibits a hawkish drift of ~+0.8 points. The Numerical Consistency Score is 0.800, and the integrated validation score is 0.796, indicating publication-grade robustness. We conclude that an embedding-based, agentic RAG approach with GPT-4o yields a scalable, auditable measure of FED communication; limitations include biannual frequency and prompt/model sensitivity, but the framework is suitable for policy tracking and empirical applications.

Keywords:

monetary policy communication; hawkish–dovish classification; Federal Reserve; embeddings; FAISS; LangChain; retrieval-augmented generation; GPT-4o; policy uncertainty

MSC:

68T50; 62P20; 68T09

1. Introduction

This study tackles a central problem in monetary economics, which is measuring, quickly and consistently the hawkish/dovish stance embedded in the Federal Reserve’s biannual Monetary Policy Reports (MPRs). Markets react not only to policy actions but also to central-bank language, so timely and credible stance measures can improve monitoring, risk management, and empirical research [1,2,3,4]. Therefore, it poses the question of whether a transparent, low-cost NLP workflow can transform unstructured MPR text into reproducible stance scores and thematically organized understandings suitable for academic and policy use.

The objective is to build a near-real-time, auditable pipeline that converts the FED’s MPRs into quantitative stance scores on a −2 (dovish) to +2 (hawkish) scale, together with confidence/uncertainty metrics and thematic rationales (inflation, interest rates, employment, forward guidance, key policy signals). The hypothesis is that an agentic RAG architecture (document parsing and chunking, embeddings, FAISS orchestrated by LangChain, and GPT-4o reasoning) can recover stable and theory-consistent stance measures at very low latency and cost, and that these measures will track known policy transitions. The dataset comprises 26 MPRs spanning 26 February 2013 to 20 June 2025 to ensure comparability of tone and format across time.

Methodologically, PDFs are parsed and chunked, embedded, indexed in FAISS, and retrieved into a two-stage GPT-4o chain: first, the model assigns an overall stance score with confidence and uncertainty; second, it produces category-specific syntheses that surface the arguments driving the score. To ensure reliability and academic rigor, a four-dimensional validation framework that addresses prominent concerns in text-as-data research is implemented [5,6]. Semantic consistency tests use cosine-similarity consistency ratios to verify clear within-stance cohesion and between-stance separation, strongest for key policy signals. Numerical consistency aligns observed correlations with ranges implied by the Taylor-rule tradition and modern monetary analysis [7,8], yielding a Numerical Consistency Score (NCS) of 0.800. Bootstrap stability [9] shows high stability for most metrics while leaving the stance score appropriately variable, reflecting genuine policy dynamics rather than noise, consistent with regime-variation evidence [10]. Content-quality diagnostics based on length consistency, lexical diversity, and information content produce an average CQS of 0.647. The integrated validation score is 0.796 (B+/Good), supporting publication-grade measurement standards.

The main findings indicate a predominant Neutral distribution (50.0%) with Dovish (26.9%) and Hawkish (23.1%) shares; the average stance is close to zero (0.019), the volatility is σ ≈ 0.866, and we detect a recent hawkish drift of about +0.8 points in the latest MPR window. These patterns are consistent with the literature documenting that central-bank communication contains information distinct from rate moves and that tone can shape expectations and asset prices.

Conclusions emphasize that an embedding-based, agent-orchestrated RAG system with GPT-4o can deliver transparent, scalable, and low-cost measurement of FED stance. Relative to dictionary methods, the agentic workflow captures context and nuance while remaining auditable and fast enough for policy tracking, transition detection, and downstream empirical work (e.g., event studies and forecasting overlays). Limitations include the biannual frequency of MPRs, sensitivity to prompts and model updates, and the descriptive (non-causal) nature of stance scores. Future research should extend coverage to Federal Open Market Committee (FOMC) statements, minutes, press conferences, and speeches; link stance shocks to high-frequency asset-price moves for external validation; compare open-source and proprietary models; and formalize guardrails to mitigate Large Language Model (LLM) measurement error.

For orientation, the paper is structured as follows: Section 2 motivates the problem and reviews the literature; Section 3 describes the data and preprocessing and details the agentic RAG architecture; Section 4 presents the validation framework and results; Section 5 deals with Hawkish or Dovish signals; Section 6 reports main empirical findings and discusses limitations; and finally, Section 7 concludes.

The main contribution of the study: Targeting the biannual MPRs and combining agentic RAG with a formal, multi-pronged validation suite, provides a credible, timely, and reproducible measure of the FED’s hawkish/dovish narrative that practitioners can deploy immediately and scholars can audit and extend.

2. Literature Review

Extensive literature shows that central-bank communication is itself a policy instrument that shapes expectations and asset prices and contains independent information beyond rate surprises [1]. Inclusive surveys conclude that communication enhances market understanding and policy effectiveness, even as optimal strategies vary across institutions [2]. These foundations motivate systematic measurement of the stance embedded in the Federal Reserve’s written reports.

Subsequent work decomposes policy news into finer components. In [3] is showed that language in central-bank communication transmits macroeconomic shocks, while [4] separate “monetary” from “non-monetary” news and document that information conveyed in words and guidance matters alongside rate moves [11]. Together, these papers justify stance-oriented text measurement that can be linked to theoretical benchmarks and market reactions.

Within text-as-data methods, early approaches relied on hand-curated dictionaries. They supervised scoring applied to FOMC statements [12], as well as real-time analyses of market responses to central-bank words [13]. Central-bank practitioners also developed guidance on text mining for policy analysis [14]. These efforts established feasibility but faced trade-offs in nuance, coverage, and portability across document types.

A broader methodological canon in economics and political science highlights the importance of transparent feature representations, out-of-sample validation, and principled uncertainty quantification for measurement and inference. In [15] survey representational choices, prediction targets, and the importance of external validation; Grimmer and Stewart [5,6,9] emphasize semantic, predictive, and face validity tests and warn against unvalidated dictionary methods; and reference [6] develops principled lexical scoring (“Fightin’ Words”) to identify discriminating terms. These principles underpin the proposed design choices for semantic consistency, numerical consistency, bootstrap stability, and content quality.

More recently, Large Language Models (LLMs) have expanded the frontier of central-bank communication analysis. An IMF working paper [16] fine-tunes an LLM on a multilingual, decades-long corpus to classify topic, stance, sentiment, and audience, demonstrating scalable classification at sentence level across 169 central banks. In parallel, European Central Bank (ECB) researchers show that ChatGPT 4.o derived sentiment from two pages of PMI commentary significantly improves euro-area GDP nowcasts, underscoring the value of narrative signals even from small text snippets [17]. These advances validate the use of modern embeddings, retrieval, and LLM scoring for policy-relevant measurement, while also motivating rigorous validation and cost/latency tracking.

Beyond text alone, multimodal work highlights that non-verbal cues and delivery matter: tone, prosody, and body language can move markets after controlling for actions and text [18], reinforcing that communication channels are multifaceted and economically meaningful. Recent empirical studies highlight that central bank communications contain much more information than what is conveyed in formal statements alone. Research on FOMC minutes and transcripts shows that committees provide significant forward-looking guidance, which financial markets quickly embed into asset prices [19]. Evidence from inflation-targeting countries further suggests that transparent communication frameworks strengthen policy credibility and help anchor expectations [20]. At the same time, developments in computational finance demonstrate that accelerated diffusion models with jump components can capture the abrupt shifts associated with changes in policy regimes [21]. These models offer a complementary, high-frequency perspective that supports the validation of stance measures derived from textual analysis.

Concerning document classes, prior studies frequently emphasize press conferences and post-meeting statements, whereas the MPR remains understudied despite its statutory role and stable format [22]. This paper addresses that gap by building a reproducible pipeline on the complete set of MPRs from 2013 to 2025, enabling like-for-like comparisons over time.

Lastly, theoretical anchors from monetary-policy norms link textual posture to macroeconomic trade-offs. References [7,8] elucidate the links between inflation, output, and interest rates within the context of policy rules. This framework serves as a basis for determining whether language reflects a hawkish or dovish stance as a systematic policy signal rather than merely a linguistic artifact.

The literature establishes that (i) central-bank words matter for expectations and asset prices; (ii) text can be decomposed into meaningful policy news; (iii) transparent, validated measurement is feasible with modern NLP; and (iv) MPRs offer a tractable, underused corpus for stance measurement linked to theoretical benchmarks and downstream empirical uses.

3. Methodology: Preprocessing, Segmentation, and Embeddings

This section describes a low-latency pipeline that converts the Federal Reserve’s biannual MPRs into quantitative measures of monetary-policy stance and structured thematic signals. The approach follows best practices in text-as-data, clear preprocessing, explicit representations, retrieval with principled similarity, and transparent model prompting, so that outputs are reproducible and suitable for empirical work [5].

The end-to-end data processing and analysis pipeline is visually represented in Figure 1. This diagram shows the step-by-step process from raw PDF files to the final structured analytical output, with details on each key phase and the technologies used. All semiannual MPRs submitted by the Board of Governors to the U.S. Congress from February 2013 through June 2025 (26 reports) have been analyzed. Documents are sourced from the Federal Reserve’s public archives in their original PDF format to preserve layout fidelity and ensure full traceability to the official record.

PDFs are parsed with a layout-aware extractor to recover reading order across multi-column pages and to capture footnotes, tables, and figure captions conservatively. Tables are converted to a plain-text matrix (Markdown) so rows and columns remain machine-readable downstream. All text is normalized (UTF-8, whitespace, hyphenation, page headers/footers removed), and document-level metadata (publication date, URL, and section headers) is retained for auditability.

To maintain semantic coherence while respecting context limits, each document is segmented into overlapping units of

~ 1000 - 1400

characters with a 200 character overlap. Let d denote a document and

{c_{1}, \dots, c_{n}}

its chunks; each chunk carries pointers to its source page and character offsets for exact provenance. Each chunk

c_{1}

is mapped to a dense vector in

R^{d}

via an embedding function:

E : T \to R^{d}, v_{i} = E (c_{i}),

(1)

which produces a semantic representation suitable for similarity search. Transformer-based embeddings provide context-sensitive semantics beyond bag-of-words [23,24]. In practice, we use a compact, low-cost model to minimize latency and expense while preserving retrieval quality.

All vectors

{v_{i}}

are indexed with FAISS for sub-second approximate nearest-neighbor search in high dimensions [25]. Relevance between a user/query vector and a chunk vector is computed via cosine similarity:

c o s_s i m (q, v) = \frac{q \cdot v}{| q | | v |} = \frac{\sum_{k = 1}^{d} q_{k} v_{k}}{\sqrt{\sum_{k = 1}^{d} q_{k}^{2}} \sqrt{\sum_{k = 1}^{d} v_{k}^{2}}}

(2)

To reduce redundancy in the retrieved set, we apply Maximal Marginal Relevance (MMR) [26]:

M M R = a r g m a x_{x \in C / S} [λ s i m (x, q) - (1 - λ) a r g m a x_{s \in S} (x, s)]

(3)

where

C

is the candidate pool,

S

the selected set, and

λ \in [0, 1]

balances relevance and diversity.

It is implemented a Retrieval-Augmented Generation (RAG) architecture [27] with agentic control: a retrieval agent gathers the top-k chunks for a query; a reasoning agent performs scoring/explanation; and a verification agent enforces schema and sanity checks.

In Stage 1, a global stance scoring is made. The system constructs a document-level query (“assess overall monetary stance given dual mandate trade-offs and inflation/growth signals”) and retrieves

k

high-yield chunks spanning the report’s core sections. The LLM returns an overall stance score

s \in [- 2, 2]

and a stance label

L \in {D o v i s h, N e u t r a l, H a w k i s h}

plus a short rationale with citations to chunk IDs.

Later, it is integrated into the thematic extraction. Independent queries target specific dimensions

(C)

: inflation, interest rates, employment, forward guidance, and key policy signal. For each

c \in C

, the agent retrieves focused evidence and produces (i) a concise summary, (ii) a 0–5 intensity score for the relevant concern/strength, and (iii) an uncertainty note. Prompts are fully templated and versioned; temperature and decoding parameters are fixed for stability. Outputs are validated against a JSON schema (types, ranges, labels) before being written to disk.

To cross-check LLM outputs with lightweight, model-free indicators, it is computed:

TF-IDF and cosine checks. For each theme, we build TF-IDF vectors [24]:

$t f i d f (t, d, D) = t f (t, d) \cdot \log \frac{|D|}{|{d^{'} \in D : t \in d^{'}}|}$

(4)

Moreover, to evaluate pairwise cosine similarity within/between stance groups as an internal diagnostic (used later in validation).
Hawkish/Dovish signal counts. Let $H$ and D be counts of curated, policy-specific n-grams (e.g., “further tightening,” “accommodative stance”). The normalized signal index is defined as:

$H S I = \frac{H - D}{H + D + ϵ},$

(5)

with $ϵ > 0$ to avoid division by zero. HSI is not used to set the stance but serves as a sanity check for directionality. The stance label is a deterministic mapping of the continuous score:
- $L = H a w k i s h, i f s > τ_{h}$ ;
- $L = D o v i s h, i f s < - τ_{d}$ ;
- $L = N e u t r a l, o t h e r w i s e$ .

Model-reported confidence is augmented by self-consistency. Let the reasoning agent be executed

m

times on the same retrieved context under low-temperature decoding. Denote by

L

the modal label returned across these rerolls (Hawkish, Neutral, or Dovish) and let

m_{h}

be the count of runs that agree with

L

. The statistics:

\hat{C} = \frac{m_{h}}{m} \in [0, 1]

(6)

There is also reported an analysis-uncertainty metric

U = 1 - \hat{C}

, which exposes both

(s, L, C, U)

in the dataset. The retrieval agent returns a fixed set of top-k chunks; the reasoning agent is then sampled

m

times at low temperature (e.g., 0.2) to reduce gratuitous variability while preserving enough stochasticity to reveal instability. The label

L

is the plurality (or majority) outcome; ties are broken by (i) choosing the label with the higher average stance score magnitude, and, if still tied, (ii) selecting the label with the higher model-reported confidence. The final record stores

(L, C, U

) together with the continuous score.

For each report that is logged: model/version, prompt template hash, retrieval parameters

(k, λ f o r M M R)

, token counts, runtime, and FAISS index version. Every LLM output includes chunk-level citations back to the PDF coordinates (doc, page, character offsets), enabling one-click audit of claims. The final artifact is a single row per MPR with the fields enumerated below. Each row is presented in Appendix A.1 and Appendix A.2. This structured dataset supports descriptive analytics, event-study linkage, and forecasting overlays while preserving full provenance to the underlying text (See Appendix B for output example.

The GPT-4o model is used as the reasoning component because, in pilot runs, it offered the best combination of instruction following, schema adherence (JSON validity), and factual grounding under retrieval, while keeping token-level costs and latency sufficiently low for near-real-time monitoring. Lighter alternatives (e.g., “mini”) reduced unit costs but increased edit rate and degraded semantic/thematic attribution under identical prompts; “turbo-class” variants improved speed but did not consistently match GPT-4o’s reliability in schema-constrained outputs.

Given that the pipeline emphasizes auditable, low-variance outputs with explicit confidence/uncertainty, GPT-4o provided the most stable trade-off. Section 3’s validation metrics (semantic separation, theory-concordant correlations, and bootstrap stability) were computed on GPT-4o outputs.

The baseline run, average end-to-end runtime per report was 18.6 s. Processing the whole corpus of 26 MPRs required ≈USD 0.97 and ≈8.06 min in total, enabling low-cost, transparent monitoring. On the proposed baseline run, the average end-to-end runtime per report was 18.6 s. Average API cost per report was ≈USD 0.037 (mean 3.7¢), for a total of ≈USD 0.97 across 26 reports with 80 pages on average each. Cost per report is computed as:

{Cost}_{report} = p_{prompt} \cdot \frac{{tokens}_{prompt}}{1000} + p_{completion} \cdot \frac{{tokens}_{completion}}{1000}

(7)

with token counts and prices logged at execution time. These figures are audited against the provider dashboard (snapshot dated July 2025). Parallelization can further reduce wall-clock time; total cost is insensitive to concurrency.

About Modelling Selection

To evaluate the robustness of the proposed stance measurements across different model architectures, we conducted a pilot study comparing GPT-4o (proprietary), Llama 3.1–70B (open-source), and Mistral-Large-2 (open-source). Using identical retrieval parameters and prompts on a representative sample of five MPR spanning distinct policy regimes, we assess: (i) stance score correlation with expert benchmarks, (ii) schema adherence, (iii) semantic consistency in thematic summaries, and (iv) computational cost and latency trade-offs.

The results are summarized in Table 1. GPT-4o consistently outperformed the alternatives, achieving the highest correlation with expert annotations (ρ = 0.89) and full schema validity. Llama 3.1–70B provided competitive stance scoring (ρ = 0.82) but exhibited schema instability (12% invalid JSON outputs in baseline runs), necessitating additional prompt engineering. Mistral-Large-2 performed well in thematic extraction (≈85% qualitative similarity) but produced stance scores with higher variance (σ = 1.12) and longer inference times.

From a cost perspective, open-source models substantially reduce marginal API expenses but demand greater computational resources (3–5× longer inference on consumer-grade GPUs) and more extensive prompt adjustments. For a corpus of 26 reports subject to strict schema requirements and validation standards, GPT-4o offered the best balance of accuracy, reliability, and development efficiency, with a total cost of roughly USD 0.97.

Finally, the strong correlation between GPT-4o and Llama 3.1–70B (ρ = 0.82) suggests that the following central findings: predominant neutrality, regime-specific deviations, and theory-consistent correlations, being robust across model types. Nevertheless, as instruction-following capabilities in open-source models continue to improve, fine-tuned versions of Llama or Mistral may provide a viable alternative for larger-scale or real-time applications.

4. Validation Report

This section establishes the reliability and academic rigor of the measurement pipeline using a four-dimensional framework that addresses core concerns in text-as-data research: construct validity, external (theory) coherence, sampling stability, and content quality [5,6]. It is followed a multitrait–multimethod logic to test discriminant/convergent properties of the stance construct [28], align quantitative relations with monetary-theory benchmarks [7,8], examine bootstrap stability [9], and evaluate linguistic quality using standard IR/CL metrics [24].

It is tested whether texts assigned to similar monetary stances (hawkish/neutral/dovish) are also semantically similar (and dissimilar to other stances), conditional on topic (Figure 2). For each theme

C {i n f l a t i o n, i n t e r e s t_r a t e s, e m p l o y m e n t, f o r w a r d_g u i d a n c e, k e y_s i g n a l s}

, TF-IDF vectors are built and cosine similarity is computed. TF-IDF for term

t

in document

d

over corpus

D

(Equation (4)). Cosine similarity for documents

i, j

:

\cos (θ_{i j}) = \frac{\vec{A_{i}} \cdot \vec{A_{j}}}{| \vec{A_{i}} | | \vec{A_{j}} |}

(8)

Let

{\bar{S}}_{w i t h i n, c}

denote similarity among documents sharing the same stance within theme

c

, and

{\bar{S}}_{b e t w e e n, c}

the mean across different stances. The consistency ratio is:

{CR}_{c} = \frac{{\bar{S}}_{w i t h i n, c}}{\bar{{\bar{S}}_{b e t w e e n, c}}}

(9)

An apparent discriminant validity is found: Key policy signals exhibit the strongest separation

(C R = 2.358)

, forward guidance and inflation show meaningful separation

C R \approx 1.03 - 1.04

), while Employment and Interest Rates are moderate (

C R \approx 0.97 - 0.98

), reflecting shared terminology across stances. Within-group similarities range

0.058 - 0.691

versus

0.024 - 0.689

between groups, consistent with a coherent stance construct given topic conditioning [28].

Likewise, it is tested whether observed correlations match sign and magnitude intervals implied by canonical monetary-policy rules [7] and modern analyses [3,8]. Define expected intervals

[ρ_{m i n}, ρ_{m a x}]

for key pairs, e.g.,

$ρ (s, inflation_concern) \in [0.5, 1.0]$ ;
$ρ (s, hawkish_signals) \in [0.6, 1.0]$ ;
$ρ (s, dovish_signals) \in [- 1.0, - 0.3]$ ;
$ρ (s, growth_concern, employment strength) \in [- 0.8, - 0.3]$ .

The Numerical Consistency Score (NCS) is the share of tested pairs who’s empirical

ρ_{o b s}

fall within their theory range:

N C S = \frac{1}{n} \sum_{k = 1}^{n} I (ρ_{obs, k} \in [ρ_{m i n, k}, ρ_{m a x, k}])

(10)

Getting an NCS of 0.800 means that 80% of relationships meet theoretical expectations. Notably, lies squarely in range, consistent with inflation-targeting logic [29].

Now, to assess sensitivity to sample composition, we run bootstrap resampling with

B = 100

replicates. For each metric MMM (e.g., confidence, inflation concern), it is computing the bootstrap means

{\hat{μ}}_{b o o t}

and standard deviation

{\hat{σ}}_{b o o t},

then the coefficient of variation and a bounded stability score:

C V = \frac{{\hat{σ}}_{b o o t}}{{\hat{μ}}_{b o o t}}, S S = m a x (0, 1 - C V)

(11)

Most metrics are highly stable: Confidence

C V = 0.012 = > S S = 0.988

; Inflation Concern

C V = 0.042 = > S S = 0.958

; Employment Strength

C V = 0.039 = > S S = 0.961

; Growth Concern

C V = 0.057 = > S S = 0.943

. The stance score shows a large

C V (~ 8.771 = > S S = 0.000)

, which is interpreted as genuine policy volatility rather than measurement error, consistent with regime shifts in U.S. monetary policy [10].

Following IR/CL standards, we summarize three dimensions (length, lexical variety, and entropy) into a composite quality score by theme. Let

μ_{l}

and

σ_{l}

be the mean and standard deviation of chunk length;

|V|

the vocabulary size; and

p_{i}

the empirical probability of token

i

.

The Length Consistency (LC) is:

LC = 1 - \frac{σ_{l}}{μ_{l}^{2}}

(12)

LD defines the lexical diversity:

L D = \frac{|unique_words|}{|total_words|}

(13)

The information content is normalized by entropy:

I C = - \frac{\sum_{i \in V} p_{i} \log_{2} p_{i}}{\log_{2} |V|}

(14)

Finally, the per-theme composite is:

C Q S = \frac{1}{3} (L C + L D + I C)

(15)

Key policy signals attain the highest quality (QS = 0.771), followed by Forward Guidance (0.743). Employment, Interest Rates, and Inflation are moderate (0.541–0.608), yielding an overall average CQS = 0.647, indicative of concise, information-rich summaries where the FED’s messaging is most explicit [30].

Finally, the four dimensions are aggregated with equal weights to form the Overall Validation Score (OVS):

O V S = \frac{1}{4} (SCS + NCS + BSS + CQS)

(16)

The semantic-consistency dimensions are normalized to SCS = 1.0 (scale-setter), and use NCS = 0.800, BSS = 0.770 (average across metrics), and CQS = 0.647. The OVS = 0.796, which is graded as B+ (Good). This level indicates a robust, publication-grade measurement that (i) separates stances semantically, (ii) comports with theory-grounded correlation structure, (iii) is stable for most reported metrics, and (iv) communicated with sufficient linguistic quality for auditability.

All thresholds (e.g., stance cutoffs, retrieval, MMR) were specified ex ante and held fixed across documents; no parameter was tuned to maximize validation scores. Random seeds, prompt templates, model versions, and FAISS index IDs are logged to enable exact returns. Validation computations are implemented in a separate script to ensure that analytical and reporting layers are decoupled and auditable [15].

5. Hawkish or Dovish

Now that all is set up and validated, Figure 3 traces the stance time series from 2013 to 2025 and reveals a straightforward three-phase narrative: a post-crisis accommodation period (2013–2015), a gradual normalization (2016–2020), and a post-pandemic tightening cycle (2021–2025). The distribution of labels: Neutral (50.0%), Dovish (26.9%), and Hawkish (23.1%), and an average stance close to zero (≈0.02) with volatility around 0.87, indicates that, over long horizons, the FED’s written communications cluster near neutrality with intermittent but decisive excursions during regime shifts.

The finding that 50% of the observations fall into the Neutral category requires careful interpretation. This outcome may reflect either (i) genuine policy equilibrium during periods of balanced risks, or (ii) conservative thresholds in the stance-label mapping. We examine this issue from several perspectives.

First, the continuous stance scores reveal considerable variation (σ ≈ 0.866) with values ranging from −1.8 to +1.5, indicating that the Neutral label encompasses a heterogeneous group of communications. Under this framework, approximately 30% of reports cluster tightly around zero (±0.2), representing truly centrist positions, while the remaining 20% lie between −0.5 and −0.2 or between +0.2 and +0.5, reflecting mild leanings without strong commitments. This pattern is consistent with the Federal Reserve’s well-documented tendency toward gradualism and data-dependent guidance, where conditional signaling is often favored over categorical policy declarations [31,32,33].

Second, we conducted a threshold sensitivity analysis by relaxing the cutoffs to

τ_{h}

= 0.4 and

τ_{d}

= −0.4 compared with the baseline values of

τ_{h}

= 0.5 and

τ_{d}

= −0.5. Under these alternative thresholds, the distribution becomes more balanced (Neutral 38.5%, Dovish 30.8%, Hawkish 30.7%). However, this reclassification increases label instability across bootstrap resamples (average confidence falling from 0.90 to 0.76) and reduces semantic consistency ratios (CR for Key Policy Signals declining from 2.358 to 1.821). These results suggest that the baseline thresholds strike a more reliable balance between classification stability and directional discrimination, favoring conservative labeling to avoid false positives.

Third, we validated the economic content of the Neutral category by analyzing thematic profiles. Reports labeled Neutral are characterized by higher levels of policy uncertainty (normalized intensity 0.68 vs. 0.42 for Hawkish and 0.51 for Dovish), more balanced attention to inflation and growth, and frequent use of conditional language (“if incoming data,” “depending on developments”). Such features point to intentional ambiguity, consistent with the literature that highlights central banks’ use of vague communication to maintain flexibility when the macroeconomic outlook is uncertain [2,33,34].

Finally, comparison with the Hawkish Signal Index (HSI, Equation (5)) shows that 85% of Neutral reports have |HSI| < 0.3, reinforcing the view that these texts convey balanced or mixed signals. The remaining 15%, those with higher |HSI| but Neutral labels, typically reflect offsetting signals across themes (e.g., hawkish on inflation but dovish on employment). In these cases, the GPT-4o reasoning agent correctly synthesizes the signals into an overall centrist stance, underscoring the value of contextual, multi-factor assessment over one-dimensional keyword methods.

In this sense, this pattern accords with the communication literature’s emphasis on the informational content of central-bank language, beyond mechanical rate moves, and its stabilizing role for expectations [1,3,4,29]. Two turning points stand out in the figure: (i) deep dovish readings around the 2020 pandemic response and (ii) a subsequent hawkish drift peeking through the inflation-fighting phase, consistent with the institution’s state-contingent posture in adverse shocks and its later resolve to re-anchor expectations. The prevalence of Neutral observations is also consistent with the “gradualism” doctrine in FED communication, keeping guidance steady unless incoming data warrant a change [35] (which ref. [31] framed to align expectations without over-committing to a path).

Next, Figure 4 introduces the Market Impact Score (MIS) to summarize how much a given communication is likely to matter for markets. The index prioritizes policy surprise, i.e., the magnitude of change in stance relative to the previous report, while recognizing the roles of uncertainty, signal strength, and analytical confidence. In this case,

M I S = 0.4 \times \underset{Policy Surprise}{\underset{⏟}{|s_{t} - s_{t - 1}|}} + 0.3 \times \underset{text-derived}{\underset{⏟}{Policy {Uncertainty}_{t}}} + 0.2 \times \underset{Signal Strength}{\underset{⏟}{\frac{H_{t} + D_{t}}{2}}} + 0.1 \times \underset{Analysis Uncertainty}{\underset{⏟}{(1 - \hat{C_{t}})}}

(17)

As shown in Figure 4, MIS spikes align with historically salient communications (e.g., the 2020 emergency guidance and the 2022 anti-inflation pivot). This accords with the idea that surprises, rather than levels, move prices in efficient markets [36], and with evidence that both “monetary” and “non-monetary” news in central-bank text carry incremental information [4]. The positive correlation we observe between absolute stance and MIS suggests that clearer, more decisive narratives garner more attention (a “clarity premium”), echoing survey conclusions that transparent communication improves market predictability [2]. Methodologically, MIS is complementary (not a replacement) for stance: it aggregates how much a report may matter now, given surprise and uncertainty, while the stance score conveys what direction policy risk points toward.

Figure 5 decomposes the narrative into thematic intensities: Inflation concern, Growth concern, Employment strength, and Policy uncertainty, which helps interpret stance movements economically. The inflation dimension exhibits a U-shaped path: easing concerns through 2019, followed by a sharp rise in 2021–2022 as broad-based pressures emerged, consistent with the inflation-targeting logic that links tighter stances to persistent inflation deviations [29].

Employment strength trends upward through 2019, collapses in 2020, and then recovers, mirroring the dual-mandate lens emphasized by [32]. Growth concern moves inversely with employment strength, reflecting slack and demand shortfalls in downturns. Policy uncertainty peaks at regime transitions, in line with the uncertainty literature’s prediction that shifts in policy regimes or the macro environment temporarily widen beliefs [37]. Together, these components rationalize the time-series behavior of stance in Figure 3: when inflation concern dominates and employment remains robust, the system reads more hawkish; when growth concern spikes amid employment weakness, it reads more dovish; and when signals are mixed, neutrality and higher uncertainty prevail.

Figure 6 examines transitions, changes in stance between consecutive reports. Furthermore, shows that significant moves

(| Δ s | > 0.5

) are clustered around crises and inflationary episodes. The asymmetry is instructive: dovish adjustments during acute stress tend to be sudden and significant, whereas hawkish shifts often build over successive reports as the Committee gauges persistence in inflation pressures.

This pattern is consistent with asymmetric loss or preference functions in central banking [31] and with evidence of regime shifts in U.S. monetary policy over recent decades [10]. Notably, the presence of several moderate transitions between significant moves suggests policy smoothing remains relevant on average [38], even if exceptional conditions trigger outsized adjustments. From a monitoring perspective, the transition chart operationalizes “watch points”: sequences of moderate hawkish steps coupled with rising inflation concern typically precede peaks in hawkish stance, whereas sharp dovish swings coincide with stress episodes and heightened uncertainty.

Finally, Figure 7 contrasts multi-dimensional signal profiles across stance categories via radar charts. Hawkish reports concentrate intensity on inflation concern and hawkish lexicon, with lower growth concern and moderate uncertainty, a profile consistent with Taylor-rule prescriptions when inflation is above target [7].

Dovish reports invert this pattern, emphasizing growth concern and dovish phrasing, often with a stronger employment focus where labor-market slack or participation dynamics motivate accommodation. Neutral reports distribute signal mass more evenly but display the highest uncertainty, reflecting subtle trade-offs and conditional guidance (“data dependence”) that stop short of directional commitments. This cross-section validates that the stance labels are not arbitrary tags but summarize coherent bundles of textual signals, which the validation framework (Section 4) corroborates through semantic separation and theory-concordant correlations:

Two practical takeaways follow. First, the stance series and its components provide a policy dashboard that researchers can join to asset-price data for event studies or forecasting overlays; the MIS offers a ranking of which reports are most likely to matter contemporaneously. Second, the thematic breakdown clarifies why stance moves: for instance, a hawkish drift accompanied by rising inflation concern and stable employment strength is a different policy configuration from one where hawkishness coexists with falling employment strength and spiking uncertainty, the former typically signaling persistence, the latter caution. Overall, these results align with the literature that central-bank words shape expectations [1,2] that textual shocks have macro-relevant content [3], and that nuanced, validated text measures can augment empirical monetary analysis at low cost and latency in real time.

Early efforts to analyze central bank communication relied heavily on dictionary methods and bag-of-words representations [12,14]. While useful as a starting point, these approaches face clear constraints when it comes to capturing contextual nuance and semantic depth. Dictionary-based scoring assigns fixed sentiment weights to individual terms, but it cannot account for negation, conditional phrasing, or domain-specific usage. This often results in misclassification, for instance, the phrase “higher for longer” conveys guidance on persistence rather than simply a hawkish stance. In addition, manually curated word lists introduce subjective bias and limit the portability of the method across different document types or institutional settings.

By contrast, more recent techniques built on embeddings and transformer architectures [23], combined with retrieval-augmented generation (RAG) [27], preserve contextual sensitivity, distinguish between literal and conditional statements, and provide auditable traceability through chunk-level citations. The agent-based RAG framework we propose addresses the limitations of earlier methods by integrating dense semantic representations with explicit reasoning steps. This allows us to produce measurements that capture not only the direction of policy signals but also the confidence behind them, while maintaining full transparency back to the original text.

6. Analysis Results: Main Findings and Limitations

This section interprets the time-series and cross-sectional evidence produced by the pipeline and situates it within the communication-and-markets literature. The stance series in Figure 3, with Neutral as the modal category (50.0%), Dovish at 26.9%, Hawkish at 23.1%, a mean near zero (~0.02), and volatility

σ

approx. 0.87, conveys a communication regime that generally aims for stability, punctuated by decisive excursions during macro inflection points.

That configuration is consistent with the view that central-bank language is itself an instrument for shaping expectations, working alongside policy rates and balance-sheet tools [1,2]. In particular, the deep dovish readings around the 2020 shock and the subsequent hawkish drift, about +0.8 points in the most recent window, mirror the well-documented pivot from emergency accommodation to disinflationary resolve, aligning with evidence that the information content of communications transmits macro shocks [3] and that both “monetary” and “non-monetary” news components in policy texts are priced by markets [4].

What is new in this analysis is the document class and the measurement design. Most prior work emphasizes FOMC statements, minutes, press conferences, or high-frequency windows around meetings [6,11,12]. We instead focus on the semiannual MPR, a longer, structured narrative to Congress that has received less stance-centered treatment but offers a relatively stable format for like-for-like measurement across time. The proposed agentic RAG architecture adds confidence/uncertainty to each observation. It is paired with a multi-pronged validation suite (Section 3), addressing persistent concerns about construct validity, theory concordance, sampling stability, and content quality in text-as-data [5,6,15]. This combination: MPR focus + auditable scoring + explicit validation, provides a stance series that is both reproducible and directly interpretable against monetary theory benchmarks.

The MIS in Figure 4 further clarifies when communications matter most. By construction, MIS weights policy surprise (the absolute change in stance), uncertainty, signal strength, and analysis reliability. Its spikes line up with historically salient communications, emergency dovish guidance in 2020, and the anti-inflation pivot in 2022, consistent with efficient-markets logic that surprises, not levels, move prices [36] and with research that decomposes monetary announcements into target and information components [39].

The empirical co-movement between

|s_{t}|

and MIS suggests a clarity premium: more decisive narratives draw greater market attention, echoing survey conclusions that higher transparency aids predictability [2]. While MIS is not a substitute for stance, the pair jointly distinguishes direction (stance) from salience (impact), a separation that is often blurred in dictionary-based tone indices.

The thematic decomposition in Figure 5 aligns stance fluctuations with the FED’s dual-mandate framework. Inflation concern follows a U-shape, easing through 2019, surging in 2021–2022, consistent with inflation-targeting logic [8,20,29]. Employment strength rises steadily pre-COVID, collapses in 2020, and recovers thereafter; growth concern moves inversely, mirroring slack and demand shortfalls in downturns.

Policy uncertainty peaks near transitions, consistent with the broader uncertainty literature [35] and with the idea that regime changes temporarily widen belief distributions. These components rationalize the stance dynamics in Figure 3: hawkish readings co-occur when inflation concern is elevated, and the labor market remains resilient; dovish readings emerge when growth concern rises, and employment weakens; neutral readings dominate when signals are mixed, often alongside higher uncertainty and conditional, data-dependent guidance. Importantly, this internal logic is validated externally by the Numerical Consistency Score (NCS = 0.800) in Section 4, which checks that observed correlations conform to theory-implied intervals [7,8].

The transition analysis in Figure 6 adds a dynamic layer. Large swings cluster around crises and inflationary episodes; dovish adjustments appear sudden and sizable during acute stress, whereas hawkish shifts often build cumulatively over successive reports as persistence is assessed. This asymmetry is consistent with models of asymmetric central-bank loss functions [31] and with evidence of regime variation in U.S. monetary policy [10]. The presence of several moderate moves between large steps illustrates policy smoothing on average [38]. For monitoring, such patterns operationalize “watch points”: sequences of modest hawkish steps paired with rising inflation concern often precede hawkish peaks; sharp dovish swings co-occur with stress episodes and heightened uncertainty.

Relative to dictionary or bag-of-words approaches [12,14], the agentic RAG system captures contextual nuance, for example, distinguishing “higher for longer” as persistence guidance rather than simply classifying “higher” as hawkish. It also audits each inference to specific text spans, addressing reproducibility critiques in automated content analysis. The validation results (Section 4), semantic separation (high CR in Key Policy Signals), bootstrap stability for most metrics, and quality scores that are highest precisely where policy language is most explicit, reinforce that the pipeline measures a coherent construct rather than artifacts of prompt phrasing. Moreover, attaching confidence and uncertainty to every observation provides a principled way to down-weight low-reliability readings in downstream empirical work (e.g., weighing in event studies or state-space filters).

These findings have practical implications. First, the stance index and its components can be joined to yield-curve and equity data to study market responses to narrative variation beyond rate surprises, complementing high-frequency identification of announcement shocks [11,31,37]. Second, the MIS can inform portfolio nowcasting by flagging which report vintages merit closer scrutiny when allocating attention and risk. Third, the uncertainty trace, peaking at transitions, connects naturally to policy-uncertainty measures [33] and encourages explicit modeling of uncertainty channels in term-premium or macro-finance frameworks. Finally, by structuring outputs at the theme level (inflation, rates, employment, guidance, key signals), the dataset supports interpretability-first applications (e.g., “which arguments drive hawkishness?”), complementing black-box sentiment scores.

Two stylized facts emerge that have not been documented for the MPR corpus specifically. First, stance asymmetry across regimes: dovish responses are lumpy and front-loaded during stress, whereas hawkishness often accumulates as inflation persistence is established. Second, uncertainty as a leading co-indicator of stance change: peaks in the text-derived uncertainty metric frequently coincide with or anticipate categorical stance transitions, reinforcing the view that communication transitions involve “narrative re-anchoring” before policy levels fully adjust. Both facts are consistent with, but not implied by, prior work focused on short statements or press conferences; they arise from sustained, long-form communication in the MPR and the granular theme scoring we implement.

Several caveats temper interpretation. The MPR is published twice a year, which restricts the proposed series to medium-frequency observations (13 reports over a period of 6.5 years) rather than meeting-by-meeting dynamics. Although this frequency is sufficient to capture regime shifts and sustained changes in policy orientation, as shown by the identification of the dovish turn in 2020 and the hawkish drift in 2022, it inevitably leaves out intra-period volatility and short-term tactical adjustments that are communicated through FOMC statements, press conferences, or speeches. For applications that demand near-time monitoring or high-frequency event studies, the framework would need to be extended to these more frequent sources.

At the same time, the biannual cadence provides distinct analytical advantages. Each MPR represents the Committee’s consolidated view over a six-month horizon, thereby filtering out temporary noise and highlighting more persistent policy concerns. This makes the reports particularly well suited for tracking medium-term stance evolution and for linking narrative shifts to macroeconomic developments that unfold across quarters rather than days. The trade-off between frequency and comprehensiveness is inherent in document selection. The results obtained show that, even at a biannual frequency, validated text-based measures from MPRs capture economically meaningful variation consistent with theoretical benchmarks and documented policy transitions. If finer temporal resolution is required, the framework can be applied to FOMC statements (eight meetings per year), which represent a natural next step for extending coverage without compromising measurement rigor. The proposed measures in this investigation are descriptive, not causal. Any claims about price effects require linking to high-frequency asset moves and controlling for confounders, as emphasized by the identification literature [27,40]. Model output may be sensitive to prompt or model updates, though the proposed logging and validation routines mitigate this risk and make changes transparent. Finally, the U.S. institutional context matters; portability to other central banks is promising [12,16] but should not be assumed without re-validation.

In sum, the analysis confirms and extends three pillars of literature. First, it reaffirms that central-bank words matter for expectations and risk pricing [1,2]. Second, it shows that validated, auditable text measures can recover theory-concordant structure from long-form policy documents, not just short statements [3,4]. Third, it contributes new evidence on regime asymmetries, uncertainty dynamics, and impact salience in the MPR setting, delivered at low cost and latency through an agentic RAG pipeline with explicit reliability metrics. These features, together with the transparent validation of Section 4, make the resulting stance series a credible input for policy monitoring dashboards, event-study classifications, and macro-finance models that incorporate communication as a state variable.

7. Conclusions

This paper set out to solve a concrete research problem: to measure, consistently and quickly, the hawkish–dovish stance embedded in the Federal Reserve’s semiannual Monetary Policy Reports. The motivation is well established as central-bank communication itself shapes expectations and asset prices, beyond the mechanical effect of rate moves [1,2], but the MPR has received comparatively less stance-focused, long-form treatment. The question was whether one can build an auditable, low-latency system that turns the MPR’s narrative into quantitative stance indicators with transparent uncertainty and validation.

The objective was to develop a near-real-time NLP pipeline, embedding-based retrieval (FAISS) orchestrated with LangChain and scored by GPT-4o, that (i) produces a continuous stance score in [−2, +2] with a mapped categorical label (Dovish/Neutral/Hawkish), (ii) extracts theme-level rationales (inflation, interest rates, employment, forward guidance, key signals), and (iii) attaches confidence/uncertainty to every observation. We hypothesized that an agentic RAG design would recover stable, theory-consistent measures that track policy transitions and deliver proper decision signals at low cost and latency. The evidence supports this hypothesis: the stance series exhibits economically sensible dynamics; the correlation structure aligns with canonical monetary theory; and the validation suite indicates good reliability.

The methodology combines layout-aware PDF parsing, chunking with provenance, dense embeddings, FAISS indexing, and two-stage agentic prompting (stance scoring followed by thematic synthesis). The dataset comprises 26 MPRs spanning 26 February 2013 to 20 June 2025, ensuring like-for-like comparisons in a stable document format. A four-dimensional validation framework of semantic consistency, numerical consistency, bootstrap stability, and content quality yields an integrated score of 0.796 (B+/Good), with NCS = 0.800 against theory-implied ranges, strong semantic separation in key themes, and high bootstrap stability for most metrics (Section 3).

The main findings are as follows. First, the stance distribution is Neutral 50.0%, Dovish 26.9%, and Hawkish 23.1% with a mean near zero (~0.019) and volatility σ ≈ 0.866, implying a communication regime that clusters near neutrality but allows decisive excursions at regime shifts. Second, a recent hawkish drift (~+0.8 points) is consistent with the inflation-fighting phase. Third, the thematic decomposition reconciles stance with the dual mandate: inflation concern follows a U-shape (easing pre-2020, rising in 2021–2022), employment strength collapses in 2020 and recovers, growth concern moves inversely to employment, and policy uncertainty peaks at transitions.

Fourth, transition analysis reveals an asymmetric: dovish moves during stress are abrupt and significant, whereas hawkish shifts tend to build cumulatively, consistent with asymmetric central-bank preferences [31] and regime-variation evidence [10]. Finally, the Market Impact Score helps to rank report vintages by prospective salience, highlighting that surprises (changes in stance), not levels, are most likely to be consequential for markets [36,39].

Relative to the literature, these results both confirm and extend prior findings. They reaffirm that words matter [1,2] and that textual signals transmit macro information [3,4]. They extend the evidence to the MPR (a long-form, structured report) using a pipeline with confidence/uncertainty and explicit diagnostics. Two stylized facts are the regime asymmetry of stance transitions and the role of uncertainty as a leading co-indicator around stance changes.

There are, however, limitations. The MPR is biannual, so the series has medium frequency and cannot capture meeting-to-meeting dynamics. The proposed measures are descriptive, not causal; price effects require high-frequency identification and careful controls [33,38]. Outputs may be sensitive to prompt/model updates, though we mitigate this via strict logging, versioning, and ex-ante parameter choices. External validity beyond the United States is promising but requires re-validation given institutional differences.

Future research should: (i) extend coverage to FOMC statements, minutes, press conferences, and speeches to increase frequency and triangulate signals; (ii) link stance and MIS to high-frequency asset-price moves and expectations (yields, OIS, options) for causal validation; (iii) compare open-source vs. proprietary models and ablate retrieval/scoring choices; (iv) integrate the stance series into macro-finance models (e.g., term-premium decompositions) and forecasting overlays; and (v) formalize guardrails for LLM measurement error and domain drift, following best practices in text-as-data.

In closing, the paper’s principal contribution is a validated, low-cost (around 1 USD for 26 reports, 80 pages on average each, and including embedding or vectorization process), near-real-time framework that transforms the FED’s Monetary Policy Reports into quantitative, auditable measures of policy stance, with theme-level explanations and explicit uncertainty, that researchers and practitioners can immediately incorporate into monitoring dashboards, event-study designs, and macro-finance analyses. By focusing on the MPR and by uniting agentic RAG with a multi-pronged validation suite, the study bridges modern NLP and monetary economics, providing a reliable, extensible measurement of the FED’s narrative stance that complements existing indicators and opens new avenues for empirical work.

Author Contributions

Conceptualization, data gathering, simulations, numerical tests, methodology, formal analysis, investigation, writing—original draft preparation and writing—review and editing, A.L.J.-P., M.A.D.-S., S.C.-A. and F.V.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in Federal Reserve’s semiannual Monetary Policy Reports at https://www.federalreserve.gov/newsevents/testimony/powell20250624a.htm (accessed on 15 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FED	Federal Reserve
FOMC	Federal Open Market Committee
LLM	Large Language Model
MIS	Market Impact Score
MMR	Maximal Marginal Relevance
MPR	Monetary Policy Reports
NLP	Natural Language Processing
NCS	Numerical Consistency Score
RAG	Retrieval-Augmented Generation

Appendix A

Appendix A.1. Data Dictionary: Core Dataset (One Row per MPR)

Table A1. Theme scores (inflation_concern, growth_concern, employment_strength) are ordinal (0–5) and can be stored as integers. All summaries should retain bracketed chunk_id references for exact traceability back to the PDF.

Field	Type	Allowed Values/Domain	Description
report_date	Date	YYYY-MM-DD	Publication date of the Monetary Policy Report (MPR).
source_document	String (URL or path)	—	Canonical source of the PDF for auditability.
monetary_stance	Categorical	{Dovish, Neutral, Hawkish}	Discrete stance label mapped from stance_score.
stance_score	Float	[−2.00, +2.00]	Continuous hawkish–dovish score (−2 dovish, +2 hawkish).
confidence	Float	[0, 1]	Agreement from self-consistency rerolls ( $\hat{C} = m_{h} / m$ ).
analysis_uncertainty	Float	[0, 1]	$U = 1 - C^{U} = 1 - \hat{C}$ .
inflation_summary	String	—	LLM synthesis of inflation discussion with inline chunk refs.
rates_summary	String	—	Narrative on policy rate path/tightening bias.
employment_summary	String	—	Narrative on labor market conditions.
forward_guidance	String	—	Explicit/implicit guidance extracted from text.
key_policy_signal	String	—	Short extract capturing the strongest policy cue.
inflation_concern	Integer	{0, 1, 2, 3, 4, 5}	Intensity of concern about inflation (theme score).
growth_concern	Integer	{0, 1, 2, 3, 4, 5}	Intensity of concern about growth/real activity.
employment_strength	Integer	{0, 1, 2, 3, 4, 5}	Strength of labor market conditions.
policy_uncertainty	Categorical	{Low, Medium, High}	Qualitative uncertainty tag from the reasoning agent.
hawkish_signals	Integer	≥0	Count of curated hawkish n-grams/phrases detected.
dovish_signals	Integer	≥0	Count of curated dovish n-grams/phrases detected.
HSI	Float	[−1, +1]	Hawkish Signal Index $(H - D) / (H + D + ϵ) (H - D) / (H + D + ϵ) .$
source_citations	List [String]	—	Chunk-level provenance (doc:page:char range).

Appendix A.2. Provenance & Runtime (Logged per Report, for Replication)

Table A2. Model/embedding versions and FAISS index are recorded for exact reruns. Retrieval uses top-k with MMR

(λ \in [0, 1]);

chunk_size_chars/chunk_overlap_chars define segmentation. rerolls_m and fixed decoding (e.g., temperature = 0.2) govern confidence. Token counts are provider-reported; cost_usd is approximate; runtime_sec is wall-clock and hardware-dependent. Any change to {model, embeddings, prompts, index} triggers versioning in the replication package.

Table A2. Model/embedding versions and FAISS index are recorded for exact reruns. Retrieval uses top-k with MMR

(λ \in [0, 1]);

chunk_size_chars/chunk_overlap_chars define segmentation. rerolls_m and fixed decoding (e.g., temperature = 0.2) govern confidence. Token counts are provider-reported; cost_usd is approximate; runtime_sec is wall-clock and hardware-dependent. Any change to {model, embeddings, prompts, index} triggers versioning in the replication package.

Field	Type	Allowed Values	Description	Details
model_name	String	—	Generation model used.	gpt-4o
model_version	String	—	Model build/hash if available.	June 2025
embedding_model	String	—	Text embedding model used.	text-embedding-3-small
embedding_dim	Integer	>0	Dimensionality of embeddings.	1536
chunk_size_chars	Integer	>0	Target chunk length.	1200
chunk_overlap_chars	Integer	≥0	Overlap between adjacent chunks.	200
retrieval_k	Integer	>0	Top-k chunks retrieved per query.	12
mmr_lambda	Float	[0, 1]	Relevance–diversity trade-off in MMR.	0.35
rerolls_m	Integer	≥1	Number of low temp rerolls for self-consistency.	5
temperature	Float	[0, 1]	Decoding temperature for generation.	0.2
tokens_prompt	Integer	≥0	Prompt tokens per report.	7420
tokens_completion	Integer	≥0	Completion tokens per report.	1980
tokens_total	Integer	≥0	Sum of prompt + completion tokens.	9400
runtime_sec	Float	≥0	End-to-end processing time per report.	18.6
cost_usd	Float	≥0	Approximate API cost per report.	0.21

Appendix B

Table A3. Agent csv: hawkish LLM output example.

Field	Value
report_date	20230616
monetary_stance	Hawkish
stance_score	1.5
confidence_level	0.9
inflation_summary	Current Assessment: Current Inflation Level and Trend Direction: The report provides a historical and projected analysis of PCE inflation. The presence of a confidence interval suggests ongoing uncertainty about future inflation trends. Key Inflation Measures Mentioned: Personal Consumption Expenditures (PCE) inflation is the primary measure discussed. Core PCE inflation is also mentioned. Comparison to Fed’s Target or Objectives: The text does not explicitly mention the Fed’s target, but typically, the Fed aims for a 2% inflation rate over the long term. Key Drivers & Factors: Primary Drivers of Current Inflation Dynamics: The focus on PCE and core PCE suggests an analysis of consumer spending patterns and underlying price pressures. Supply vs. Demand Factors Mentioned: The use of core PCE implies a focus on demand-side factors by excluding volatile supply-side elements like food and energy. Fed’s Outlook & Policy Implications: Fed’s Inflation Projections or Expectations: The report includes median projections for PCE inflation, with a 70% confidence interval. Risk Assessments (Upside/Downside Risks): The text includes assessments of uncertainty and risks around PCE and core PCE inflation projections. How Inflation Assessment Influences Monetary Policy Stance: The detailed analysis of inflation projections and risks suggests that these assessments are crucial for informing the Fed’s monetary policy stance.
interest_rates_summary	Policy Decision: Federal Funds Rate Decision: The text does not specify the current level or any change in the federal funds rate. FOMC Vote Breakdown: There is no mention of a vote breakdown. Policy Rationale: Key Economic Factors Driving the Decision: The FOMC is committed to promoting maximum employment, stable prices, and moderate long-term interest rates. Balance of Risks Considered: The text implies that the FOMC considers economic and financial disturbances. Forward Guidance & Path: Fed’s Communication About Future Rate Path: The FOMC emphasizes the importance of clear communication. Conditional Factors for Future Decisions: Future decisions will likely be influenced by fluctuations in employment, inflation, and long-term interest rates.
employment_summary	Labor Market Conditions: Current Unemployment Rate and Employment Trends: Initial claims for unemployment insurance increased notably in the first two months of the year but have since stabilized at a relatively low level. Labor Force Participation Dynamics: The labor force participation rate has been gradually increasing but remains below its pre-pandemic level. Key Labor Market Indicators: No specific data on wage growth, job openings, or quits rates. Policy Implications: How Employment Conditions Affect Fed’s Dual Mandate: The gradual improvement in labor force participation and stabilization of unemployment claims suggest progress towards maximum employment. Employment Outlook and Risks to Forecast: Cautiously optimistic employment outlook.
forward_guidance	FOMC will adjust policy as needed to address risks to goals.
key_policy_signal	Continued rate increases to combat inflation.
inflation_concern	1
growth_concern	0.5
employment_strength	1
policy_uncertainty	0.5
dovish_signals	0
hawkish_signals	1

References

Gürkaynak, R.S.; Sack, B.; Swanson, E.T. Do actions speak louder than words? The response of asset prices to monetary policy actions and statements. Int. J. Cent. Bank. 2005, 1, 55–93. [Google Scholar]
Blinder, A.S.; Ehrmann, M.; Fratzscher, M.; De Haan, J.; Jansen, D.-J. Central bank communication and monetary policy. J. Econ. Lit. 2008, 46, 910–945. [Google Scholar] [CrossRef]
Hansen, S.; McMahon, M. Shocking language: Understanding the macroeconomic effects of central bank communication. J. Int. Econ. 2016, 99 (Suppl. S1), S114–S133. [Google Scholar] [CrossRef]
Cieslak, A.; Schrimpf, A. Non-monetary news in central bank communication. J. Int. Econ. 2019, 118, 293–315. [Google Scholar] [CrossRef]
Grimmer, J.; Stewart, B.M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal. 2013, 21, 267–297. [Google Scholar] [CrossRef]
Monroe, B.L.; Colaresi, M.P.; Quinn, K.M. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Anal. 2008, 16, 372–403. [Google Scholar] [CrossRef]
Taylor, J.B. Discretion versus policy rules in practice. Carnegie–Rochester Conf. Ser. Public Policy 1993, 39, 195–214. [Google Scholar] [CrossRef]
Clarida, R.; Galí, J.; Gertler, M. Monetary policy rules and macroeconomic stability: Evidence and some theory. Q. J. Econ. 2000, 115, 147–180. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: Boca Raton, FL, USA, 1994. [Google Scholar]
Sims, C.A.; Zha, T. Were there regime switches in U.S. monetary policy? Am. Econ. Rev. 2006, 96, 54–81. [Google Scholar] [CrossRef]
Swanson, E.T. Measuring the effects of Federal Reserve forward guidance and asset purchases on financial markets. J. Monet. Econ. 2021, 118, 32–53. [Google Scholar] [CrossRef]
Lucca, D.O.; Trebbi, F. Measuring Central Bank Communication: An Automated Approach with Application to FOMC Statements; NBER Working Paper No. 15367; National Bureau of Economic Research: Cambridge, MA, USA, 2009. [Google Scholar]
Rosa, C. Words that shake traders: The stock market’s reaction to central bank communication in real time. J. Empir. Financ. 2011, 18, 915–934. [Google Scholar] [CrossRef]
Bholat, D.; Hansen, S.; Santos, P.; Schonhardt-Bailey, C. Text Mining for Central Banks; Bank of England: London, UK, 2015. [Google Scholar]
Gentzkow, M.; Kelly, B.; Taddy, M. Text as data. J. Econ. Lit. 2019, 57, 535–574. [Google Scholar] [CrossRef]
Silva, T.C.; Moriya, K.; Veyrune, R.M. From Text to Quantified Insights: A Large-Scale LLM Analysis of Central Bank Communication; IMF Working Paper 2025/109; International Monetary Fund: Washington, DC, USA, 2025. [Google Scholar] [CrossRef]
Sun, Y.; de Bondt, G. Enhancing GDP Nowcasts with ChatGPT: A Novel Application of PMI News Releases; ECB Working Paper No. 3063; European Central Bank: Frankfurt, Germany, 2025. [Google Scholar]
Gorodnichenko, Y.; Pham, T.; Talavera, O. The voice of monetary policy. Am. Econ. Rev. 2023, 113, 548–584. [Google Scholar] [CrossRef]
Acosta, M.; Meade, E. How much information do monetary policy committees disclose? Evidence from the FOMC’s minutes and transcripts. J. Money Credit Bank. 2015, 47, 1420–1445. [Google Scholar] [CrossRef]
Bernanke, B.S.; Laubach, T.; Mishkin, F.S.; Posen, A.S. Inflation Targeting: Lessons from the International Experience; Princeton University Press: Princeton, NJ, USA, 1999. [Google Scholar]
Ivanov, R.K.; Temnov, G. Monte Carlo method in stock trading research based on accelerated diffusion theory with jumps. Comput. Econ. 2018, 52, 765–789. [Google Scholar] [CrossRef]
Federal Reserve Board. Monetary Policy Report; Board of Governors of the Federal Reserve System: Washington, DC, USA, 2025.
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Carbonell, J.; Goldstein, J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the SIGIR’98, Melbourne, Australia, 24–28 August 1998. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP. In Proceedings of the NeurIPS 2020 Conference, Online, 6–12 December 2020. [Google Scholar]
Campbell, D.T.; Fiske, D.W. Convergent and discriminant validation by the multitrait–multimethod matrix. Psychol. Bull. 1959, 56, 81–105. [Google Scholar] [CrossRef]
Powell, J.H. Monetary Policy in a Changing Economy. In Proceedings of the “Changing Market Structure and Implications for Monetary Policy” Federal Reserve Bank of Kansas City Economic Policy Symposium, Jackson Hole, WY, USA, 24 August 2018; Available online: https://www.federalreserve.gov/newsevents/speech/powell20180824a.htm (accessed on 18 June 2025).
Krippendorff, K. Content Analysis: An Introduction to Its Methodology, 4th ed.; SAGE: Thousand Oaks, CA, USA, 2018. [Google Scholar] [CrossRef]
Sack, B. Does the Fed act gradually? A VAR analysis. J. Monet. Econ. 2000, 46, 229–256. [Google Scholar] [CrossRef]
Bloom, N. The impact of uncertainty shocks. Econometrica 2009, 77, 623–685. [Google Scholar] [CrossRef]
Baker, S.R.; Bloom, N.; Davis, S.J. Measuring economic policy uncertainty. Q. J. Econ. 2016, 131, 1593–1636. [Google Scholar] [CrossRef]
Aceves-Mejía, M.; Absalón-Copete, C.; Anthony, P. Inflación y crecimiento, una relación no lineal entre diferentes economías. Rev. Mex. Econ. Finanz. Nueva Época 2024, 19, 1–22. [Google Scholar] [CrossRef]
Yellen, J.L. Normalizing Monetary Policy: Prospects and Perspectives. In Proceedings of the “The New Normal Monetary Policy”, Federal Reserve Bank of San Francisco Research Conference, San Francisco, CA, USA, 27 March 2015. [Google Scholar]
Fama, E.F. Efficient capital markets: A review of theory and empirical work. J. Financ. 1970, 25, 383–417. [Google Scholar] [CrossRef]
Surico, P. The Fed’s monetary policy rule and U.S. inflation: The case of asymmetric preferences. J. Econ. Dyn. Control 2007, 31, 305–324. [Google Scholar] [CrossRef]
Jarociński, M.; Karadi, P. Deconstructing monetary policy surprises—The role of information shocks. Am. Econ. J. Macroecon. 2020, 12, 1–43. [Google Scholar] [CrossRef]
Nakamura, E.; Steinsson, J. High-frequency identification of monetary non-neutrality. Q. J. Econ. 2018, 133, 1283–1330. [Google Scholar] [CrossRef]
Romer, C.D.; Romer, D.H. Federal Reserve information and the behavior of interest rates. Am. Econ. Rev. 2000, 90, 429–457. [Google Scholar] [CrossRef]

Figure 1. End-to-end FED report analysis pipeline. Source: Authors’ own elaboration.

Figure 2. Methodological validation framework. (A) displays overall validation scores across four dimensions. (B) shows semantic consistency analysis using cosine similarity metrics. (C) presents bootstrap stability assessment with coefficient of variation measures. (D) illustrates content quality metrics including length consistency and lexical diversity. Source: Authors’ own elaboration.

Figure 3. Federal Reserve Monetary Policy Stance Evolution (2013–2025). The timeline shows quantitative stance scores ranging from −2.0 (very dovish) to +2.0 (very hawkish), with confidence bands indicating analytical certainty. Background shading represents distinct policy regimes.

Figure 4. Market impact score analysis. (A) shows the Market Impact Score timeline, (B) displays the correlation between stance and impact, (C) ranks highest-impact reports, and (D) breaks down impact components. Source: Authors’ own elaboration.

Figure 5. Temporal evolution of FED policy components. Four key macroeconomic dimensions tracked over time: Inflation Concern, Growth Concern, Employment Strength, and Policy Uncertainty. Dashed lines indicate linear trends. Source: Authors’ own elaboration.

Figure 6. Federal Reserve policy stance transitions and regime changes. Bars indicate magnitude and direction of stance changes between consecutive reports. Orange triangles mark categorical stance changes (Hawkish/Neutral/Dovish). Labels identify significant transitions (>0.5 magnitude).

Figure 7. Multi-dimensional Policy Signal Patterns by Monetary Stance. Radar charts show average intensity of six key dimensions across Hawkish, Neutral, and Dovish communications. Each axis represents normalized signal strength (0–1 scale).

Table 1. Performance metrics of proprietary and open-source models in stance detection.

Model	Correlation with Experts	JSON Validity	Variance of Stance Scores (σ)	Thematic Similarity	Inference Time (vs. GPT-4o)	Notes
GPT-4o	φ = 0.89	100%	0.87	High	Baseline	Strongest overall performance
Llama 3.1–70B	φ = 0.82	88%	0.95	High	3–4× longer	Requires prompt refinement for schema validity
Mistral-Large-2	≈0.78–0.80 (with GPT-4o)	85–88%	1.12	≈85%	4–5× longer	Strong in thematic extraction but less stable stance scores

Source: Authors’ own elaboration.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiménez-Preciado, A.L.; Durán-Saldivar, M.A.; Cruz-Aké, S.; Venegas-Martínez, F. Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report. Mathematics 2025, 13, 3255. https://doi.org/10.3390/math13203255

AMA Style

Jiménez-Preciado AL, Durán-Saldivar MA, Cruz-Aké S, Venegas-Martínez F. Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report. Mathematics. 2025; 13(20):3255. https://doi.org/10.3390/math13203255

Chicago/Turabian Style

Jiménez-Preciado, Ana Lorena, Mario Alejandro Durán-Saldivar, Salvador Cruz-Aké, and Francisco Venegas-Martínez. 2025. "Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report" Mathematics 13, no. 20: 3255. https://doi.org/10.3390/math13203255

APA Style

Jiménez-Preciado, A. L., Durán-Saldivar, M. A., Cruz-Aké, S., & Venegas-Martínez, F. (2025). Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report. Mathematics, 13(20), 3255. https://doi.org/10.3390/math13203255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report

Abstract

1. Introduction

2. Literature Review

3. Methodology: Preprocessing, Segmentation, and Embeddings

About Modelling Selection

4. Validation Report

5. Hawkish or Dovish

6. Analysis Results: Main Findings and Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Data Dictionary: Core Dataset (One Row per MPR)

Appendix A.2. Provenance & Runtime (Logged per Report, for Replication)

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI