Guideline Learning for In-context Information Extraction

Large language models (LLMs) can perform a new task by merely conditioning on task instructions and a few input-output examples, without optimizing any parameters. This is called In-Context Learning (ICL). In-context Information Extraction (IE) has recently garnered attention in the research community. However, the performance of In-context IE generally lags behind the state-of-the-art supervised expert models. We highlight a key reason for this shortfall: underspecified task description. The limited-length context struggles to thoroughly express the intricate IE task instructions and various edge cases, leading to misalignment in task comprehension with humans. In this paper, we propose a Guideline Learning (GL) framework for In-context IE which reflectively learns and follows guidelines. During the learning phrase, GL automatically synthesizes a set of guidelines based on a few error cases, and during inference, GL retrieves helpful guidelines for better ICL. Moreover, we propose a self-consistency-based active learning method to enhance the efficiency of GL. Experiments on event extraction and relation extraction show that GL can significantly improve the performance of in-context IE.


Introduction
Information extraction (IE), whose primary goal is to extract structured information from unstructured plain text, serves as a critical foundation for numerous downstream tasks such as question answering and knowledge base construction (Wang et al., 2022a;Fei et al., 2022).IE tasks typically have complex task settings due to their requirement of translating diverse real-world facts into a few predefined classes.This often necessitates a large number of rules and examples to thoroughly and accurately define the target concept of the task.For example, the guidelines for ACE relation extraction extend over 33 pages (Consortium, 2008).In the past, the supervised learning paradigm has been applied to fine-tune numerous parameters on massive data to accurately learn the concept (Li et al., 2020;Zheng et al., 2019).This approach, while effective, is data-intensive, hard to train, and difficult to update.
Recently, however, the NLP community witnesses the rapid rise of large language models (LLMs), such as PaLM (Chowdhery et al., 2022), ChatGPT (OpenAI, 2023a) and LLaMA (Touvron et al., 2023).These LLMs have achieved great performance on a wide range of NLP tasks with their superior language understanding power, but finetuning them faces closed-source and high-trainingcost issues.In-Context Learning (ICL) (Brown et al., 2020), a characteristic feature of LLMs, offers a solution to harness the power of LLMs while sidestepping these issues.ICL enables LLMs to perform new tasks without tuning any parameters.Instead, they are given only the task instruction and a few input-output examples as the prompt.It achieves promising performance on many tasks like natural language inference and sentiment classification (Brown et al., 2020), demonstrating a new paradigm in the NLP community.
Several recent studies have explored the ICL paradigm for IE (Han et al., 2023;Wei et al., 2023).Impressively, by merely providing task instructions and a handful of in-context examples, LLMs can achieve significant performance on many IE tasks.However, they still lag behind supervised SOTA models (Han et al., 2023).
We underline one primary reason for the suboptimal performance: underspecified task description.As discussed earlier, the target concept of IE is inherently complex.But the input context utilized for elucidating the target concept to LLMs is constrained by its limited length.Consequently, the comprehended concept by LLMs might deviate from the target concept.An example of this is illustrated in Figure 1.In the sentence "The shipments have arrived into the stock", the pre-defined relation types Content-Container and Entity-Destination presents a grey area concerning the relation between the entities "shipments" and "stock".The target concept is embodied in a rule in the annotation guidelines 1 -"motion verbs prevailing over stative relations" -which is misaligned with the LLM's comprehended concept.
This paper attempts to mitigate this problem by introducing a Guideline Learning (GL) framework.This framework replicates the human annotation process, which first gathers annotation guidelines, and then annotates accordingly.Specifically, it has two phrases.In the learning phase, a set of guidelines are iteratively learned from scratch based on a few labeled instances.A guideline here is a natural language rule derived by integrating the appropriate extrapolation of an error instance and its true label.This is different from previous supervised learning methods, which learn a set of model parameters.In the inference phase, given a new instance, it retrieves relevant rules from the guideline to compose a prompt, which includes the task instruction, the retrieved rules, a few examples, and the input instance.It then asks an LLM agent to finish the task given the prompt.This failure-driven remind- ing mechanism, similar to Madaan et al. (2022), is inspired from the theory of recursive reminding in psychology (Jacoby and Wahlheim, 2013).This theory suggests that human learn from the error cases and recall the most helpful experiences when encountering a new case.
Furthermore, we incorporate a self-consistencybased active learning method to enhance the efficiency of label utilization.we also propose a "generalizer" to assist in the generation and retrieval of guidelines.Finally, we conduct in-depth experiments on two representative IE tasks: (1) event extraction on financial documents, and (2) relation extraction on general domain resources, which both feature relatively complex target concepts.Experimental results indicate that the use of 50 labeled samples per class can greatly boost the performance of ICL in both tasks.

Overview
Figure 2 presents an overview of the Guideline Learning (GL) framework.For the inference phase, assuming we have collected a set of guidelines for a task.Given an input instance x, the GL framework first retrieves a set of relevant rules from the guidelines.A query is constructed by assembling the task instruction, few in-context examples, the instance, and the retrieved rules.The query is then forwarded to an LLM agent, which generates both the answer and the references (rules that the agent deems beneficial for this particular instance).During the training phrase, the framework iterates over a few training instances to generate and learn guidelines from scratch.For each instance, if the predicted answer from the LLM agent is different from the annotation, another LLM agent generates a new rule and update the existing guidelines.
In the following sections, we will detail the inference phrase (Sec 2.2), the learning algorithm (Sec 2.3), and an active instance selection method for effective guideline learning (Sec 2.4).

Inference
In this section, we introduce how to predict the answer of an instance x in the GL framework.Suppose we have collected the Guidelines G = {r i }| |G| i=1 which is a set of rules that supports read, write, and retrieve operations.Each rule, expressed as a natural language description, explicates an aspect of the task, while the guidelines implicitly reflect the target concept of the task.The inference process unfolds as follows.
Retrieve.We retrieve the top-k rules R from G that are most relevant to x: R = Retrieve(x, G) where R ⊂ G.We can also retrieve some inputoutput examples N from the training dataset D.
Reason.The task instruction T , the instance x, the few-shot examples N , and the retrieved rules R are integrated to create a query q, which is used to ask the reasoner about which class the instance belongs to: where reasoning is performed by an LLM agent with ICL capability, ŷ is the predicted answer, and R * ⊂ R is a returned subset of retrieved rules that the agent deems helpful during reasoning.R * is used to evaluate the quality of the rules in Sec 2.3.

Learning Algorithm
In this section, we introduce the learning algorithm which reflectively learns guidelines from a collection of instance-label pairs.The pseudo code of this algorithm is presented in Algorithm 1.In each epoch, we first predict on all instances to get the response comprising the answer ŷ and references R * .If the answer is wrong, an LLM agent will generate a new guideline and append it in a cache.We don't update guidelines immediately to ensure stable reasoning inside one epoch.After the iteration, we update rules in the cache to the guidelines.Besides, we keep a score for each rule based on whether it leads to correct answers.At the end of an epoch, rules with a score below a threshold are regarded as harmful and are removed from the guidelines.
Specifically, the rules are generated as follows.
If the predicted answer ŷ is wrong, the instance x, the predicted ŷ, and the true label y are given to an LLM agent to write a rule: The score of a rule is computed as follows.For a rule r ∈ G, we compute its prior score based on its statistics: where N retr , N hit , and N wrong are the number of instances in which the model retrieves r (r ∈ R), refers to r (r ∈ R * ) and predicts correctly, and refers to r and predicts wrongly.The prior score indicates the helpfulness of a rule based on the historical responses.

Active Instance Selection
In this section, we investigate how to select instances for annotation, to construct the training dataset for effective guideline learning (Sec 2.3).Random sampling could result in a low efficiency as the model may already be capable of accurately predicting a large portion of instances.To alleviate this problem, we propose an active learning approach that prioritizes instances where the model is most uncertain.Assume we have a collection of instances I = {x m } |I| m=1 .Following self-consistency chain-ofthoughts (Wang et al., 2022b), for each instance x, we first sample T reasoning paths and answers {(r t , ŷt )} T t=1 with a relatively high temperature.Then we obtain the model's probability on each class c by marginalizing out the sampled reasoning paths: The consistency of the sampled answers indicates the model's confidence.A sharp probability distribution indicates a high confidence on this instance, whereas a flatten distribution indicates a low confidence.We compute the negative entropy of the probability distribution to measure the model's confidence on this instance: We select the top-k instances with the lowest confidence score.The underlying assumption here is that the model is more prone to committing errors for instances with lower confidence.

Task and Implementation
Initially, we implement the guideline learning framework for two information extraction tasks: event extraction (Sec 3.1) and relation extraction (Sec 3.2).We choose these tasks because the target concepts of these tasks are relatively complex.

Event Extraction
Event extraction (EE) aims to extract structured events from unstructured texts.Figure 3 gives an example of EE.The event structure is predefined by an event schema, consisting of event classes and corresponding event roles.For example, the equity repurchase event has roles like company name, repurchased shares, closing date, etc.In this paper, we decompose EE into three sequential sub-tasks: 1. event trigger identification (ETI) that identifies all candidate event triggers from the text; 2. event trigger classification (ETC) that classifies candidate event triggers to event classes; 3. event argument extraction (EAE) that identifies the event arguments of a given trigger and recognize the specific roles they play.
For this task, we apply guideline learning to ETC.Specifically, given an event schema and a set of candidate triggers in a text, one instance here is the text and one candidate trigger.Note that it's also feasible to apply guideline learning to EAE.We leave it as future work.

Relation Extraction
Relation extraction (RE) aims to predict semantic relations between a pair of entities in texts.Figure 1 presents an example of RE.According to a recent report (Han et al., 2023), even when equipped with chain-of-thought prompting, ChatGPT can only achieve a maximum performance of 43% compared to state-of-the-art RE methods.
For RE, we directly apply guideline learning to assist in distinguishing relation concepts.Specifically, given a set of relation types and one entity pair from a text, one instance here is the text and one entity pair.

Icon Template
For all LLM agents, we use the official API2 of ChatGPT (OpenAI, 2023a) to generate outputs.To prevent the influence of dialogue history, we generate the response separately for each testing sample.Generalizer We introduce an important LLM agent generalizer to narrow the shallow semantic gap between instances and rules.The generalizer is an LLM agent which extrapolates the instance x properly to a more general form x by abstracting common properties, such as company names, dates.We use the x instead of x to retrieve and generate rules.Figure 3 presents an example of the generalizer in EE.We provide some intuition of the generalizer in Appendix A.3.Retrieval For an input instance, we use its general form x to sort rules in guidelines by the semantic similarity between x and the rules.Specifically, we use the embedding API (text-embedding-ada-002) from OpenAI (2023b) to obtain the embeddings of x and r, and use cosine similarity as the

Implementation of Trigger Classification
Instance (summarized): 23,000,000 shares are the total shares that Guanao Group Co., LTD.has repurchased as of June 30, 2011.

General form:
The total shares that a company has repurchased as of a specific date.
Label: Equity_Repurchase General form + Rule: The total shares that a company has repurchased as of a specific date triggers an equity repurchase event.

Event Extraction Example
Figure 3: An example (translated) of event extraction from ChFinAnn dataset (Zheng et al., 2019).We decompose EE into three sub-tasks: event trigger identification, event trigger classification, event argument extraction.We present the output of each sub-tasks.
semantic similarity score.The few-shot demonstrations are randomly chosen from the training data, and are fixed for all instances and methods in each task.
Reflect In this paper, we simply concatenate the general form x of the instance i and the golden label to generate a rule.Figure 3 presents an example of this process in EE.Note that our implementation only requires the official APIs without any parameter updating.

Experiments
We conduct experiments3 to demonstrate the effectiveness of the GL framework on event extraction (Sec 4.1) and relation extraction (Sec 4.2).In the last section, we analyze the quality of learned guidelines and conduct case studies (Sec 4.3).

Setup
Dataset We use the ChFinAnn dataset (Zheng et al., 2019), a distant-supervised document-level event extraction dataset on Chinese financial documents, to conduct our experiments.Zheng et al. (2019) highlighted one challenge is to detect multiple event instances in one document.We focus on four event types: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), and Equity Overweight (EO).For the test set, We randomly sample at most 200 documents with proper token length for each event type from the original test set due to the token length limit of OpenAI's API.More details are presented in Appendix A.1.1.

Metrics
We use role-level micro precision, recall, and F1 for evaluation, following previous work (Zheng et al., 2019).
Method Though only working on ETC, we also provide simple solutions for the other two subtasks for comparison with other methods.Specifically, for ETI, as all event types are related to equity transaction, we identify text spans with the format "{number} shares" as candidate triggers via string matching.For ETC, we apply guideline learning framework and conduct binary classifications for each event type.As the documents in this dataset are long, we apply an extra LLM agent to generate a description for each trigger about its meaning according to the document.We use the generated description as the input instance to conduct the classification.For EAE, we apply an LLM agent to generate an event table in the markdown format given predicted event triggers.
Compared Models (1) ReDEE (Liang et al., 2022) and DE-PPN (Yang et al., 2021): Two supervised methods.We reproduce DE-PPN on the entire dataset strictly following the official code.ReDEE runs out of memory on 12G GPU so we do not reproduce it.(2) EE-ICL: Prompt the LLM to directly output the event table without predicting event triggers.(3) EE-GL-b: Baseline version of our guideline learning method with empty guidelines.(4) EE-GL-r: Our guideline learning method.We randomly sample 50 documents from the training set and annotate event triggers.( 5 randomly sampled documents from the training set and annotate event triggers.
We use the same human-annotated demonstrations for all EE methods.

Results and Analysis
Main Results We show our main experimental results in Table 1.We can observe that: (1) ICL achieves promising results (-7.7, +0.6, -4.1, -11.1 micro-F1 compared with DE-PPN) on four event types.Note that previous studies (Han et al., 2023;Wei et al., 2023) have shown that in-context learning performs poorly on other event extraction datasets.We suppose that the performance is better on this dataset because the financial disclosure documents are required to organize in a highly homogeneous format.This result indicates the power of in-context learning.(2) Both GL-r and GL-a outperform ICL on four event types by at most +2.9, +1.2, +3.3, +4.1 micro-F1.Note that we only use extra trigger labels of 50 documents per class.(3) Though out three-step methods and the summary agent can slightly improve the performance (GL-b vs. ICL), the main performance gain comes from the learned guidelines (GL-r vs. GL-b).(4) GL-a consistently outperforms GL-r by a small margin, which verifies the effectiveness of our active learn-ing method.Note that DE-PPN is trained on 25631 fully annotated examples, while our methods are trained on 200 examples in total with only trigger annotation.
Results on Human-Annotated Test Set As the label constructed by distant supervision is noisy, we manually annotate the test set of Equity Underweight.The results on this test set are shown in Table 2.It shows that: (1) GL-r and GL-a improve 1.7, 2.5 F1 scores over ICL, respectively.
(2) ICL and GL-r/a outperform DE-PPN by over 10% micro-F1.This implies that though only provided few manual labels, LLMs are more capable of aligning with human annotation than supervised methods trained on a large-scale weakly-supervised dataset.(3) Supervised method DE-PPN performs much poorer on multi-event documents than singleevent document (53.4 vs. 79.3),while ICL-based methods are more robust (more discussion on Appendix A.1.4).

Results and Analysis
The results are shown in Table 3.We can observe that (1) GL-r and GL-a outperform ICL by 3.1, 4.2 F1-scores, respectively.This verifies the effectiveness of applying our guideline learning framework for relation extraction.
(2) The performance of ICLbased RE is still far behind SOTA methods (66.9 vs.

Quality Evaluation of Guidelines
We manually evaluate the quality of learned guidelines.Specifically, for each task, we randomly sample guidelines from the best epoch and compute the accuracy where we count a hit if the guideline is precise and unambiguous.The results are shown in Figure 4.For both GL-r and GL-a, which are provided manual labels, the accuracy is above 90%.This indicates that LLMs can well perform the generalizing task when appropriately prompted.To investigate how the label quality effects the quality of generated guideline, we conduct experiments (GL-r-ds) with the same setting as GL-r but providing the distant supervised labels.The accuracy drops dramatically by 17.2 points.The forgetting mechanism (w/ discard, detailed in Sec 2.3) helps to discard harmful guidelines boosting the accuracy by 3.3 points, but it is still significantly lower than GL-r.This indicating the necessity of label quality for generating high-quality guidelines.

Case Study of Guidelines
Note that we generate guidelines by first generalizing the input instance to its general form, then combining it with its golden label.This implementation can successfully generate helpful guidelines, while inevitably makes some mistakes.We show some cases in Figure 5.We find some helpful guidelines imply annotation rules in the annotation guidelines (e.g., He-4).The cause of the harmful guidelines is mainly due to the inadequate generalization (e.g.Ha-1, Ha-3) and annotation error (e.g.Ha-2).Besides, in extreme cases, the relation between two entities is only based on the literal meaning of the entity (e.g.Ha-4), which is hard to generate a general guideline.

Comparison with DE-PPN in Data Scarcity Settings
We Ha-3."early in the Y (Product)" indicates that the relation between X and Y is MESSAGE AND TOPIC.
He-1.The shares sold through the trading system trigger an equity underweight event.
He-2.The freely tradable shares held before the reduction don't trigger an equity underweight event.
Ha-1.The shares bought and sold mistakenly triggers an equity underweight event.
Ha-2.The outstanding shares sold through the centralized bidding trading system don't trigger an equity underweight event.
(NOT CONTAIN AND CONTAINER because of the motion verbs prevailing over "stative" relations criteria.) Ha-4. "X (Food) Y (Food)" indicates that the relation between X and Y is ENTITY AND ORIGIN.
(The original sentence: Homemade tomato soup is so much better than the shop bought versions.)learning approach (EE-GL) on the same test set.The F1 score of each event type is shown in Figure 6.We find that DE-PPN fails when only providing 192 labeled documents, with very low F1 scores on all event types.The problem is alleviated when providing 5k labeled documents.DE-PPN relies on a large amount of annotated data to work well.This indicates the superiority of ICL approaches over data-hungry supervised approaches.Our guideline learning approach further improves the few-shot ICL approach (EE-ICL) on all event types.
5 Related Work

In-Context Information Extraction
Information extraction (IE) extracts structured knowledge of interest from unstructured text, includes entities, relations between entities, event arguments, etc.Previous studies mainly focus on fine-tuning a task-specific model under the supervision from large-scale datasets (Zhao et al., 2021;Zheng et al., 2019;Yang et al., 2021;Liang et al., 2022).Though achieving remarkable performance, these models heavily rely on high-quality manually-annotated datasets and may fail in new scenario.
On the other hand, Brown et al. (2020) shows that in-context learning (ICL) of large language models (LLMs) can perform numerous tasks when provided a few examples in a natural language prompt.ICL is a highly promising new learning paradigm because it is tuning-free, user-friendly, and data-efficient.There are many studies applying in-context learning to perform IE tasks.Wan et al. (2023) proposes GPT-RE to bridge the gap between ICL and finetuning baselines for RE via two strategies: entity-aware demonstration retrieval and goldlabel induced reasoning.Chen et al. (2023) propose an in-context learning-based NER approach and model PLMs as a meta-function, which can inject in-context NER ability into PLMs and recognize entities of new types on-the-fly using only a few demonstrative instances.However, though focusing on ICL, these methods still requires training over large-scale datasets.
Recently, ChatGPT (OpenAI, 2023a) has stimulated the research boom in the field of LLMs.Chat-GPT has been the most well-known and powerful LLM so far, with amazing ability of ICL and instruction following.There are many studies exploring ChatGPT's capability on IE tasks.Many studies (Han et al., 2023;Wei et al., 2023;Gao et al., 2023) evaluate ChatGPT's capability on IE tasks by directly prompting and find a huge performance gap between ChatGPT and SOTA results.They mainly focus on performance evaluation without in-depth investigations to boost ICL ability for IE tasks.

Retrieval-augmented ICL
Many studies propose to retrieve relevant evidence from extra knowledge sources to enhance the performance of ICL.Demonstration retrieval aims at designing more effective strategies for judiciously selecting in-context examples from a large training set.For example, Liu et al. (2022) applies kNNretrieval based on sentence-level representations.GPT-RE (Wan et al., 2023) further finetunes an entity-aware representation on the training set for better retrieval.However, similar to the supervised paradigm, these methods still rely on a large-scale annotated dataset.Some studies retrieve relevant information from an extra memory to assist in ICL.Madaan et al. (2022) proposes a memory-assisted framework that correct errors via user interactions.They pair the GPT-3 (Brown et al., 2020) with a growing memory of recorded cases and user feedback, which allows the system to produce enhanced prompts for any new query.However, their method heavily replies on the quality of user interaction.As they use simulated user feedback in experiments, the effectiveness and stability have not been verified in real-world cases.
Our approach utilizes similar memory and retrieval mechanism.With a focus on IE, our framework can automatically learn high-quality guidelines from few error cases, obviating the need for user feedback, which is more efficient and stable.

Guideline Learning differs from two main branches of previous work on instruction learning:
Instruction induction via ICL.Honovich et al. (2023) predict the task instruction by prompting instruction-tuned LLMs.They conduct explorative experiments, focusing on tasks that have "clear and simple instructions".In contrast, our GL framework focuses on more complex instructions with a highlight on IE tasks: extraction of complex concepts.We propose the "guideline" as a bridge to learn and utilize more specific instructions from error cases automatically, which can be viewed as an in-depth extension of previous work.
Instruction learning for meta-training.Ye et al. (2023) propose to utilize instruction learning to better finetune LLMs and boost the zero-shot performance.Our GL framework aims at boosting the model performance under the tuning-free setting, which is orthogonal to their work.

Conclusion
This paper explores the underspecified task description problem in in-context information extraction.We propose a guideline learning framework to alleviate the problem, which automatically learns guidelines from few labeled instances during the learning phrase, and retrieving helpful guidelines to assist in reasoning during inference.Our experiments on event and relation extraction show that a straightforward implementation of guideline learning can enhance vanilla in-context learning by approximately 4%.

Limitations
The guideline learning (GL) framework establishes a powerful and reproducible starting point for incontext learning research.However, our work still lacks depth in certain aspects and many potential research directions within this framework warrant further investigation.Broader applications In this paper, we only apply GL to IE tasks to alleviate the underspecified task description problem.It's encouraging to transfer GL to other tasks with complicated task specifications.More specialized retriever We implement an elementary retriever by utilizing OpenAI's embedding API.Though sufficient to verify the effectiveness of our framework, the performance is suboptimal.It's promising to establish a more powerful retriever that specializes in retrieving relevant guidelines based on input cases.More sophisticated generalizer We generate guidelines by prompting an LLM agent to properly extrapolate each error case.The guidelines are mostly precise but still lack generality.It's possible to design a more sophisticated generalizer to summarize a guideline based on multiple similar error cases.Enhance the rule-following capability of LLMs One key necessary capability of the reasoner is to generate responses while faithfully following input rules.We observe that gpt-3.5-turbo, the backbone LLM agent in our experiments, still struggles to truly refer to relevant rules.We present a preliminary discussion in Appendix A.4.It would be intriguing to evaluate and enhance the rule-following ability of LLMs.

A.1.1 ChFinAnn Dataset
The ChFinAnn dataset (Zheng et al., 2019) is constructed from real-world Chinese financial documents via event-level distant supervision.It contains 32040 documents in total, focusing on five event types: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP).We don't conduct experiments on EP events as we suppose there exists event confusion that both equity pledge and release of pledge are labeled as EP events.As the official API (gpt-3.5-turbo-0301)has a max token length of 4096 tokens, we only keep the documents with a length less than 1000 Chinese characters.We sample at most 200 documents for each event type from these documents.Table 4 presents the data statistics.
We calculate the ratio of negative triggers (i.e.candidate shares that refer to non-events) in each event type on our test set.The results are shown in Table 5.The ratio of negative triggers varies across different event types, ranging from a minimum of 23.1% to a maximum of 63.9%.The simple trigger expression "number shares" we use for this dataset ensures high recall (every event record on this dataset involves such expression), however, it also introduces unnecessary negative triggers, resulting in additional cost of event trigger classification.This indicates that identifying and classifying triggers on this dataset is non-trivial.Note that our experiments are designed to validate the GL framework, with a focus on trigger classification.Consequently, we do not place much emphasis on trigger identification.In practice, it's more efficient to design a powerful event trigger identifier beyond the simple pattern.For example, it's promising to prompt the LLM to identify candidate triggers with few in-context demonstrations.We leave it as future work.

A.1.2 Prompts
For guideline learning, we conduct binary classification for each event type.We present the prompt of the equity underweight events.Only the demonstrations in the prompt are different across different event types.We use 6-8 demonstrations for each LLM agent.We introduce our method in section 4.1.1.Here we briefly recap the input and output of each LLM agent: 1.The summarizer takes a document and one share in it as the input and output a summary of this share, which we call share description.
The prompt is presented in Figure 7.
2. The generalizer takes the instance (share description) as input and output its general form by abstracting common properties.The prompt is presented in Figure 8.
3. The reasoner takes the instance (share description) and the retrieved guidelines as input and output the reasoning process (CoT), the predicted answer and the index of the used guideline.The prompt is presented in Figure 9.
For EAE, we prompt the LLM to output the event table in the markdown format.As the documents in this dataset are long, we only use 2 demonstrations in each prompt.EE-ICL and EE-GL use the same task instruction and demonstrations.The only difference is that EE-GL provides the candidate trigger shares identified by the ETC method.The prompts are presented in Figure 10 and Figure 11.

A.1.3 Hyper-parameters
Note that for the reasoner, we apply Self-Consistent Chain-of-Thoughts (SC-CoT) prompting (Wang et al., 2022b).We show the hyper-parameter settings in Table 6.For EAE, we use a very low temperature 0 to generate stable outputs.

A.2.1 Prompts
For guideline learning, we directly conduct multiclass relation classification.There are two main components: 1.The generalizer takes the instance (a sentence and one entity pair) as input and output the general form.This is decomposed into two steps: extracting relevant text pieces and abstract entity types.The prompt is presented in Figure 12.The generalizer combines the two responses (the text span and entity types) to get the final general form.
2. The reasoner takes the instance (a sentence and one entity pair) and the retrieved guidelines as input and output the reasoning process, the predicted answer and the index of the used guideline.The prompt is presented in Figure 13.

EE -Generalizer
Please generate a general description of the given share description.This description focuses on the trading behavior involved in the shares.Please be careful to retain all descriptions related to the trading behavior and be as concise and general as possible.
To ensure generality, do not contain specific company names, personal names, and dates.

Demonstration:
- Share description：48574700 shares are the number of company shares held by Mr. Wang Yun before this reduction.

Answer:
The shares held before the reduction. - Share description：3670000 shares are the total number of shares that Daguan Investment has sold through the Shanghai Stock Exchange's bulk trading system during the implementation period of the reduction plan.
Answer: The total shares reduced through the bulk trading system.
- For RE-ICL, we apply chain-of-thought prompting.The prompt is presented in Figure 14.We use 10 demonstrations for the reasoner and RE-ICL.

A.2.2 Hyper-parameters
Note that for the reasoner and RE-ICL, we apply Self-Consistent Chain-of-Thoughts (SC-CoT) prompting (Wang et al., 2022b).We show the hyper-parameter settings in Table 7.

A.3 Discussion on Generalizer
The intuition of the generalizer is in two folds.First, the guideline should have some generalizability to cover/handle similar cases.Secondly, in practice, the generalizer is helpful to the guideline retrieval task based on the input case.If the guidelines are composed of corrected error cases, the retrieval would be case-to-case, which is very sensitive.For example in the following quote block, the input case is more similar to G1 literally.However, G2 is more relevant as they both describe an active underweight event.If we generate their general form by abstracting common properties (company name, number of shares, date), it will be more similar to G2.
Input case: Xinguang Investment actively reduced its shareholdings of this company by 300,000 shares.G1: Xinguang Investment passively reduced its shareholding in the company by 300,000 shares.This does not trigger an EU event.
G2: Jinying Technology actively reduced its shareholdings of this company by 200,000 shares today.This triggers an EU event.
General form of input case: One company actively reduced its shareholdings of another company.
General form of G1: One company passively reduced its shareholdings of another company.This does not trigger an EU event.
General form of G2: One company actively reduced its shareholdings of another company.This triggers an EU event.
In our experiments, we implement the generalizer by few-shot prompting an LLM agent.Though the generalizer is critical for the GL framework, we don't put the implementation details into section 3 as there may be other underlying implementations, for example, finetuning a more effective generalizer, and we want to highlight our contribution on the guideline learning framework itself.

A.4 Discussion on Rule-following Capabilities
We manually evaluate the following aspects of responses from Reasoner: 1. relevant: whether the rules referred to by the model are truly relevant to the instance; 2. well-referred: whether the model genuinely follows the rules, i.e. the response is consistent with the rules it refers to.We analyze 50 responses generated by the gpt-3.5-turboand gpt-4

RE -Generalizer -1
Find the text span in the quoted sentence that may indicate the relation between two entities.Remove irrelevant words in the text span and make sure your answer is only the text span.

Figure 1 :
Figure 1: An example of conceptual bias in the relation classification task (SemEval 2010 Task 8).

Algorithm 1 :R
Guideline Learning Input :number of epoch N e , task description T , training set D = {(x m , y m )} N d m=1 Output :guidelines G 1 Initialize G = ∅, cache = ∅; 2 for e = 1...N e do 3 for (x, y) in D do 4

Figure 4 :
Figure 4: The manual evaluation results of the learned guidelines on ChFinAnn-EU (EE) and SemEval (RE) dataset (randomly select 50 for each evaluation).

Figure 5 :Figure 6 :
Figure 5: Case study of guidelines learned in EE and RE task.We use colors for better illustration.

Figure 7 :
Figure 7: The prompt (translated) of the summarizer for the underweight event (ChFinAnn dataset).The {document} and {share} denotes the input document and share, respectively.

Figure 8 :
Figure 8: The prompt (translated) of the generalizer for the equity underweight event (ChFinAnn dataset).The {description} denotes the input share description.
You are an NLP expert.You are knowledgeable in taxonomy.Please tell me the category of an entity.Note that the category should be general and precise.For example, the following category is good: Person, Location, Organization, Event, Product, Action, Time.Your answer should only contain one word or phrase.Sentence: {sentence}The category of {entities} is:

Figure 12 :
Figure 12: The prompt of the generalizer in RE.The {sentence} and {entities} denotes the input sentence and the entity pair, respectively.

Figure 13 :
Figure13: The prompt of the reasoner in RE.The {sentence}, {entities}, and {retrieved_guidelines} denotes the input sentence, the entity pair, and the retrieved guidelines, respectively.

Table 2 :
Results for the Equity Underweight type on the single-event and multi-event sets (human-annotated label).

Table 4 :
Dataset statistics about the number of documents for the test set (# Test) and the test set in our experiments (# Our Test).

Table 5 :
Dataset statistics about the number of candidate triggers and negative triggers in our test set.