GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Markets and policymakers around the world hang on the consequential monetary policy decisions made by the Federal Open Market Committee (FOMC). Publicly available textual documentation of their meetings provides insight into members’ attitudes about the economy. We use GPT-4 to quantify dissent among members on the topic of inflation. We find that transcripts and minutes reflect the diversity of member views about the macroeconomic out-look in a way that is lost or omitted from the public statements. In fact, diverging opinions that shed light upon the committee’s “true” attitudes are almost entirely omitted from the final statements. Hence, we argue that forecasting FOMC sentiment based solely on statements will not sufficiently reflect dissent among the hawks and doves.


The Road to FOMC Transparency
The Federal Open Market Committee (FOMC) is responsible for controlling inflation in the United States, using instruments which dramatically affect the housing and financial markets, among others.For most of the 20 th century, conventional wisdom held that monetary policy is most effective when decision-making was shrouded in secrecy; the tightlipped Alan Greenspan, a past chairman of the Fed, quipped about "learning to mumble with great incoherence."But times change.Blinder et al. (2008) show how the emergence of greater transparency and strategic communication became an important feature of 21 st century central banking.Fed communication is now an integral component of monetary policy, and "Fed watchers" dote on every word.The FOMC first started releasing public statements following their meetings in February 1994.This meager documentation grew and now consists of three types for each official meeting: carefully produced and highly stylized one page statements are released immediately after each FOMC meeting, followed about three weeks later by lengthier minutes, and finally five years later by full, verbatim transcripts.Subsequently this triplet is referred to as documents.We find minutes closely reflect the content of transcripts, so to avoid redundancy, we focus our analysis on transcripts and statements.
Increased FOMC communication has prompted social science research spanning the disciplines of economics, sociology, finance and political science (Section 2.1).Financial market participants are also keenly interested.Billions, if not trillions of dollars are traded on the Fed's words.The interpretationsright or wrong-of what the FOMC "really means" move markets and affect the economy.However, relying upon documents as data in the social sciences is a challenge due to the lack of structure and the cost of annotation (Grimmer and Stewart, 2013;Gentzkow et al., 2019;Ash and Hansen, 2023).Hansen and Kazinnik (2023) show that Generative Pre-training Transformer (GPT) models outperform a suite of commonly used NLP methods on text quantification.Motivated by these results, we set out to quantify the language of the FOMC using GPT-4 (OpenAI, 2023) by preparing a combined data set of FOMC documents from 1994- 2016 (Section 2). 1 We conclude that transcripts contain more dissent than statements (Section 3).

FOMC Data: Transcript to Statement
The FOMC normally meets eight times per year in order to assess current economic conditions, ultimately deciding upon the path for monetary policy.We aggregate and release the official publicly available text documenting these deliberations by the Fed as an aligned corpora of documents from 1994 to 2016.2These text documents are similar in content, but transcripts and statements are dramatically different in style and detail (Figure 1). 3 For our purposes, the statements required no pre-processing.For the transcripts, we use regular expressions to partition and then re-aggregate the text by each unique speaker.See Appendix D for an example of each document type.

Lessons from Social Science
Past work has used at most one form of FOMC meeting documentation, but rarely multiple in conjunction.For example, in the finance literature, Mazis and Tsekrekos (2017) apply Latent Semantic Analysis to FOMC statements to identify the main "themes" used by the committee and how well they explain variation in treasury yields.Gu et al. (2022) use minutes to investigate how the tonality of committee deliberations impacts subsequent stock market valuations.Political scientists use transcripts to estimate committee members' preferences on inflation and unemployment (Baerg and Lowe, 2020).Economists have assessed the role of communication in achieving monetary policy objectives by looking at similar documents (Romer and Romer, 2004;Handlan, 2020).Hüpper and Kempa (2023) investigate the extent to which shifting inflation focus is reflected in full transcripts.Edison and Carcel (2021) apply Latent Dirichlet Allocation (LDA) to transcripts to detect the evolution of prominent topics.Hansen et al. (2017) use LDA to quantify transcripts and identify how transparency affects the committee's deliberations.

Hawks and Doves
We take the transcripts to best represent FOMC members' underlying attitudes and think of the statements as stylized representations of what they wish to communicate publicly.To identify disagreement, we need to go beyond the statements and look closely at the language employed by members in their remarks throughout meeting transcripts.Dissenting votes are rare because of a strong historical norm: members dissent only if they feel very strongly that the committee's decision is wrong.Modest disagreements do not merit dissent. 4That said, members of the committee do frequently voice detectable disagreements with one another at meetings.Disagreements frequently concern the state of the economy, the outlook for inflation, and many other things, including where the range for the federal funds rate should be set that day.Such debate and deliberation among members is a routine and productive element of the meetings.These disagreements are more clearly expressed in the transcripts.Daniel Tarullo's comments in a 2016 transcript illustrate the point: "it is institutionally important for us to project an ability to agree, even if only at a fairly high level, and that is why I abstained rather than dissented over each of the past several years [...] I have gone out of my way in the past four years not to highlight publicly my points of difference with the statement."

Manual Analysis to Create a Gold Label
Dissent amongst speakers is normally concentrated on the discussion of the economic and financial situation of the U.S, specifically inflation targeting.For example, in the January 2016 meeting transcript, the committee discusses their 2 percent inflation projection in the context of factors such as oil prices, the job market, and the Chinese economy.Many of the members argue that the inflation projection of 2 percent will not be accurate, while others who do support the 2 percent projection qualify their support with varying degrees of uncertainty.In this meeting, President Mester concludes, "My reasonable confidence that inflation will gradually return to our objective over time recognizes there is and has always been large uncertainty regarding inflation forecasts." While Mr. Tarullo argues in opposition, ". . .I didn't have reasonable confidence that inflation would rise to 2 percent.
Nothing since then has increased my confidence.To the contrary, a few more doubts have crept in." If the meeting statement was an accurate representation of what transpired at the meeting, it would follow that the uncertainty of the committee regarding their inflation forecast would be communicated.Instead, the diverging individual opinions are omitted in the final statement, where: "The Committee currently expects that, with gradual adjustments in the stance of monetary policy, economic activity will expand at a moderate pace and labor market indicators will continue to strengthen.Inflation is expected to remain low in the near term, in part because of the further declines in energy prices, but to rise to 2 percent over the medium term as the transitory effects of declines in energy and import prices dissipate and the labor market strengthens" We apply this level of granular analysis to create a gold label for statements, classifying each one according to Hawk and Dove definitions proposed by Hansen and Kazinnik (2023): Dovish (-1.0), Mostly Dovish (-0.5),Neutral (0), Mostly Hawkish (.5), Hawkish (1.0).A trained undergraduate does a first pass and escalates any borderline cases to a FOMC expert for adjudication.We manually review the January 2016 transcript and pair it with a simple computational analysis, which finds that most dissent at that time surrounded topics relating to inflation (Appendix C).

GPT-4 "Reads" Terse Documents
GPT-4, and Large Language Models more broadly, are a suitable tool for rapid linguistic processing at scale.We produce three different measurements of hawk/dove sentiment using statements and two using transcripts.Hansen and Kazinnik (2023) use GPT-3 to quantify 500 sentences selected uniformly at random from FOMC statements between 2010 and 2020.We extend this by using GPT-4 to quantify all 3728 sentences in statements from 1994 to 2016 (Appendix B).A limitation of this approach is that the holistic sentiment of the meeting is not captured because each sentence is scored independently-without context.
The first statement measurement we propose is to simply take an unweighted mean of all individual  sentence scores for each meeting.Because most sentences have nothing to do with inflation (62% of sentences scored as Neutral), we hypothesize that this method has further limitations.We resolve this by ingesting each entire statement into our GPT-4 prompt.Both of these measurements, along with manual gold labels, can be seen in Figure 3.
For the final statement measurement, we construct a logit-scaled score (Lowe et al., 2011), Hawk + 0.5 Dove + 0.5 where Hawk and Dove are the sums of the hawk-ish and dovish scores, respectively.In this approach θ (L) ignores sentences scored as Neutral, placing more emphasis on the relative rather than absolute differences between hawkish and dovish sentiment.Furthermore, since θ (L) has no predefined end points, this allows us to generate positions at any level of extremity, which more appropriately reflects the outlier meetings.When measuring hawk/dove sentiment using transcripts, the vast amount of text adds an additional challenge.Rather than evaluating a 100+ page transcript at the sentence level, we instead evaluate transcripts at the speaker level.That is to say, we use GPT-4 to quantify hawk/dove sentiment for each distinct speaker within each transcript, and then aggregate those by meeting date.In contrast to the 3728 individual sentence observations across all statements, we evaluate 5691 speaker observations across all transcripts.We then follow the same steps outlined above to calculate unweighted average and logit-scaled scores.
The logit-scaled measurements for transcripts and statements track one another quite closely over most of the years, demonstrating that GPT-4 is effective at identifying similar content across document types of dramatically different length and style.It is worth noting, however, that the logitscaled transcript scores have larger extremes than statements, especially the upper bound: (-3.66, 3.37) and (-3.40, 2.40), respectively.This underscores how the FOMC devotes considerable attention to curating their communication strategy to convey confidence and unity despite dissent among the hawks and doves.The logit-scaled scores begin to diverge in 2012, with transcripts trending more neutral/hawkish while statements remain mostly dovish.Hence, we propose a direct comparison.

Measuring Dissent
We can use the sentence-level statement and speaker-level transcript scores from GPT-4 to compute a measure of dissent for each meeting using the following algorithm: 1. From the list of scores for each meeting, count the number of hawkish/mostly hawkish and the number of dovish/mostly dovish scores.2. If there is at least one hawkish score and at least one dovish score within the same meeting, assign Dissent = 1.Else, Dissent = 0. We find that 47% of statements and 82% of transcripts contain dissent.We also compute the conditional probability of a transcript containing dissent given the associated statement binary, P (T = 1|S = 1) and P (T = 1|S = 0).We find that when a statement contains dissent, P (T = 1|S = 1), the transcript agrees more than 97% of the time.However, for statements scored as having no dissent, P (T = 1|S = 0) we find that more than 69% of associated transcripts are scored as containing dissent.This means that for the 53% of statements that don't show signs of dissenting opinions, there is likely dissent in the transcript as evidenced by the speaker-level hawk/dove scores.

Conclusion and Next Steps
Our method of ingesting the entire statement for an aggregate prediction better captures the extremes, which more closely mirror the gold label human annotation and suggests that Large Language Models can avoid the noise in this nuanced context.The F 1 score for this comparison is 0.57.While this is rather low as a measure of model "fit", it is important to note that the results rarely flip sentiment (from hawkish to dovish, or mostly hawkish to mostly dovish), rather, it just seems to mostly disagree on adjacent categories.See Figure 3 for a visual comparison of the sentence-level, entire text, and manual scores.Of note, the inconsistent provision of statements and relatively high volatility in hawk/dove sentiment before 2000 is consistent with Meade and Stasavage (2008) and Hansen et al. (2017) who have also studied the 1993 change in FOMC communication strategy.We demonstrate that GPT-4 can identify the extremes in dissenting hawk and dove perspectives despite the indications of a clear consensus in the statements.This empirical finding supports our manual analysis.
While we focus on transcripts and statements, future work may consider an even more fine grained analysis, incorporating minutes as well.We found the content of the minutes to more closely resemble the transcripts than the statements, but differences do exist and remain underexplored.
Additionally, we note that GPT-4 scores made more neutral predictions than the gold standard manual labels.To improve upon this, we created a balanced few-shot example using sentences from FOMC statements not included in our samplemeetings since 2020.This marginally improved the prediction "fit" (F 1 of 57% to 58%), but we expect that this could be improved much further with additional prompt engineering.
GPT-4 is able to quickly quantify stylized economic text.Our results from quantifying dissent support the hypothesis that dissenting opinions on the topic of inflation omitted from FOMC statements can be found in the associated transcripts.As LLMs continue to improve, we expect that it will be possible to study even more nuanced questions than the ones we answer here.

Limitations
Substantively, strategic signaling in the FOMC is a challenging topic and this is only an initial investigation.Dissent does not have clear ground truth labels and thus we are reliant on human judgment and our team's substantive expertise on monetary policy.Finally, as with much current research, our work relies on OpenAI's GPT API, which poses challenges to computational reproducibility, as it relies on the stability of an external system that we cannot control.

C Computational Analysis
We paired our manual review of the January 26-27, 2016 transcript with a computational analysis of dissent in the meeting.We stratified the meeting into nine topics, each corresponding to a portion of the transcript content.As a baseline, we counted the number of speakers in each section to see if this metric could reflect dissent.This technique, however, seemed to reflect the length of the conversation as opposed to the degree to which members disagreed with one another.
Our next approach was to do a sentiment analysis of each topic to see if the prevalence of negativity could indicate dissent.We supposed that negative sentiment would be high if the speakers opposed the stance of either other individuals or the committee as a whole.Using the VADER lexicon (Hutto and Gilbert, 2014), we calculated the sentiment of each sentence within the nine topics.Since VADER is trained on web-based social media content, which is typically more abrupt than the formal language appearing in the FOMC transcript, we conducted the sentiment analysis by sentence to optimize the method's performance.
To analyze dissent more specifically, we computed the fraction of negative sentences in each topic.For this analysis, we set the threshold negativity score to be 0.1.That is, sentences with a negativity score of 0.1 or higher were classified as negative while all others were not.This number determined by manually reviewing what sentences were captured by varying thresholds and evaluating whether or not they conveyed dissent.When the threshold was set too low (0.05), four out of ten randomly selected sentences conveyed dissent.When set too high (0.15), seven out of ten randomly selected sentences conveyed dissent, but many sentences that indicated dissent were omitted.At the threshold of 0.1, still seven out of ten randomly selected sentences conveyed dissent, and more sentences that conveyed dissent were captured.

D Document Examples
See Figures 6, 7, and 8 for examples of the documents.

Figure 1 :
Figure 1: The transcripts (left) contains opinions and disagreements but statements (right) are concise.We analyze both datasets using GPT-4 prompting.

Figure 2 :
Figure 2: Text taken from an FOMC meeting on December 11, 2001.A dovish statement does not reflect the hawkish sentiment of Speaker 1. GPT-4 can quantify dissent lost from transcripts to statements.

Figure 3 :
Figure 3: Our analysis of statements finds that averaging at the sentence-level (gray) loses information since the average sentence is Neutral.Ingesting the overall statement (blue) better mirrors the manual gold label (gold).

Figure 4 :
Figure 4: Our analysis of transcripts with GPT-4 at the speaker level shows that meetings consistently have dissenting opinions.Similarly to the previous figure, 2001 to 2004 is dovish (blue), while 2004 to 2006 is hawkish (orange).
D o v i s h : S t r o n g l y e x p r e s s e s a b e l i e f t h a t t h e economy may be g r o w i n g t o o s l o w l y and may n e e d s t i m u l u s t h r o u g h mon− e t a r y p o l i c y .M o s t l y d o v i s h : O v e r a l l m e s s a g e e x p r e s s e s a b e l i e f t h a t t h e economy may be g r o w i n g t o o s l o w l y and may n e e d s t i m u l u s t h r o u g h m o n e t a r y p o l i c y .N e u t r a l : E x p r e s s e s n e i t h e r a h a w k i s h n o r d o v i s h view and i s m o s t l y o b j e c t i v e .M o s t l y h a w k i s h : O v e r a l l m e s s a g e e x p r e s s e s a b e l i e f t h a t t h e economy i s g r o w i n g t o o q u i c k l y and may n e e d t o be s l o w e d down t h r o u g h m o n e t a r y p o l i c y .Hawkish : S t r o n g l y e x p r e s s e s a b e l i e f t h a t t h e economy i s g r o w i n g t o o q u i c k l y and may n e e d t o be s l o w e d down t h r o u g h m o n e t a r y p o l i c y .</ l a b e l s > Which l a b e l b e s t a p p l i e s a p p l i e s t o t h e s t a t e m e n t ( Dovish , M o s t l y Dovish , N e u t r a l , M o s t l y Hawkish , Hawkish ) ? " " "

Figure 5 :
Figure 5: Discussion of inflation (Red and Blue topics) is more contentious than other topics and the average of all data (dotted line).