Boosting Event Extraction with Denoised Structure-to-Text Augmentation

Event extraction aims to recognize pre-defined event triggers and arguments from texts, which suffer from the lack of high-quality annotations. In most NLP applications, involving a large scale of synthetic training data is a practical and effective approach to alleviate the problem of data scarcity. However, when applying to the task of event extraction, recent data augmentation methods often neglect the problem of grammatical incorrectness, structure misalignment, and semantic drifting, leading to unsatisfactory performances. In order to solve these problems, we propose a denoised structure-to-text augmentation framework for event extraction DAEE, which generates additional training data through the knowledge-based structure-to-text generation model and selects the effective subset from the generated data iteratively with a deep reinforcement learning agent. Experimental results on several datasets demonstrate that the proposed method generates more diverse text representations for event extraction and achieves comparable results with the state-of-the-art.


Introduction
Event extraction is an essential yet challenging task for natural language understanding. Given a piece of text, event extraction systems discover the event mentions and then recognize event triggers and their event arguments according to pre-defined event schema (Doddington et al., 2004;Ahn, 2006). As shown in Figure 1, the sentence "Capture of the airport by American and British troops in a facility that has been airlifting American troops to Baghdad." contains two events, a Movement:Transport event triggered by "airlifting" and a Transaction:Transfer-Ownership event triggered by "Capture".
In the Movement:Transport event, three event roles are involved, i.e., Artifact, Destination, * Corresponding author. Generative Aug.
Capture of the airport by American and British troops in a facility that has been airlifting American troops to Baghdad.
Capture of the American British troops airport by and in a facility that airlifting has American troops to Baghdad.
Troops in a facility that has been airlifting American military supplies to Baghdad. Capture of the airport by Iraqi forces.

Movement:Transport
Trigger Destination Artifact Origin Text-aware Aug.
Capture of the airport where American troops have a facility that has been airlifting Baghdad troops to Iraq.
Event-aware Aug.
Capture of the airport would give American and British troops a facility for airlifting equipment and troops to Baghdad. and Origin, and their arguments are troops, airports, and Baghdad, respectively. As to the Transaction:Transfer-Ownership event, the event roles are Beneficiary, Origin, and Artifact. Accordingly, the arguments are troops, Baghdad, and airports.
Traditional event extraction methods regard the task as a trigger classification sub-task and several arguments classification sub-tasks (Du and Cardie, 2020;Lin et al., 2020;Zhang and Ji, 2021;Nguyen et al., 2021Nguyen et al., , 2022a, while some of the recent research casting the task as a sequence generation problem (Paolini et al., 2021;Li et al., 2021;Hsu et al., 2022;Huang et al., 2023). Compared with classification-based methods, the latter line is more data-efficient and flexible. Whereas, the data containing event records are scarce, and the performance is influenced by the amount of data as the results shown in Hsu et al. (2022).
As constructing large-scale labeled data is of great challenge, data augmentation plays an important role here to alleviate the data deficient prob-lem. There are three main augmentation methods, i.e., Rule-based augmentation method (Wei and Zou, 2019b;Dai and Adel, 2020), generative method (Wu et al., 2019;Kumar et al., 2020;Anaby-Tavor et al., 2020;Wei and Zou, 2019a;Ng et al., 2020), and text-aware method (Ding et al., 2020). However, they have different drawbacks. 1) Grammatical Incorrectness. Rule-based methods expand the original training data using automatic heuristic rules, such as randomly synonyms replacement, which effectively creates new training instances. As the example of Rule-based Aug illustrated in Figure 1, these processes may distort the text, making the generated syntactic data grammatically incorrect. 2) Structure Misalignment. Triggers and arguments are key components of event records, whether for both the original one and the augmented one. Nonetheless, triggers and arguments may not always exist in previous augmentation methods. As the example of Generative Aug illustrated in Figure 1, even though the meaning of the generated augmented sentence is quite similar to the original one, the important argument "airport" is missing. This may mislead the model to weaken the recognition of the DESTINATION role.
3) Semantic Drifting. Another important aspect of data augmentation is semantic alignment. The generated text needs to express the original event content without semantic drifting. However, this problem is commonly met in the Text-aware Aug method. As the example illustrated in Figure 1, the sentence completely contains all the triggers and arguments. But instead of Baghdad, Iraq is regarded as the ORIGIN in generated sentences, which may confuse the model to recognize the correct ORIGIN role.
In order to solve the aforementioned problem when applying data augmentation to event extraction, we proposed a denoised structure-to-text augmentation framework for event extraction (DAEE). For structure misalignment problems, a knowledgebased structure-to-text generation model is proposed. It is equipped with an additional argumentaware loss to generate augmentation samples that exhibit features of the target event. For the Semantic Drift problem, we designed a deep reinforcement learning (RL) agent. It distinguishes whether the generated text expresses the corresponding event based on the performance variation of the event extraction model. At the same time, the agent further guides the generative model to pay more attention to the samples with the Structure Misalignment and Grammatical Incorrectness problems and thus affords the Event-aware Aug text that both contain important elements and represent appropriate semantics. Intuitively, our agent is able to select effective samples from the combination of generated text and its event information to maximize the reward based on the event extraction model.
The key contributions of this paper are threefold: • We proposed a denoised structure-to-text augmentation framework. It utilizes an RL agent to select the most effective subset from the augmented data to enhance the quality of the generated data.
• Under the proposed framework, a knowledgebased structure-to-text generation model is proposed to satisfy the event extraction task, which generates high-quality training data containing corresponding triggers and arguments.
• Experimental results on widely used benchmark datasets prove that the proposed method achieves superior performance over stateof-the-art event extraction methods on one dataset and comparable results on the other datasets.
2 Related Work

Event Extraction
Many existing methods use classification-based models to extract events (Nguyen et al., 2016;Wang et al., 2019;Yang et al., 2019;Wadden et al., 2019;Liu et al., 2018). And some global features are introduced to make an enhancement for joint inference (Lin et al., 2020;Li et al., 2013;Yang and Mitchell, 2016). With the large-scale use of PLMs, some of the researchers dedicated to developing generative capabilities for PLMs in event extraction, i.e., transforming into translation tasks (Paolini et al., 2021), generating with constrained decoding methods (Lu et al., 2021), and template-based conditional generation (Li et al., 2021;Hsu et al., 2022;Liu et al., 2022;Du et al., 2022). Compare with the above method directly uses a limited number of the training set, we use a denoised structure-to-text augmentation method to alleviate the problem of insufficient data.

Data Augmentation
Rather than starting from an existing example and modifying it, some model-based data augmentation approaches directly estimate a generative process produce new synthetic data by masking randomly chosen words from the training set and sample from it (Anaby-Tavor et al., 2020;Hou et al., 2018;Xia et al., 2019;Wu et al., 2019;Kumar et al., 2020). Other research design prompt (Wang et al., 2022(Wang et al., , 2021 or use conditional generation (Ding et al., 2020) for the data augmentation. However, the above methods are mainly applied to generation tasks or comprehension tasks with simpler goals, such as text classification. When faced with complex structured extraction tasks, post-processing screening becomes a cumbersome problem. Inspired by RL, we use a policy model to automatically sift through the generated data for valid and semantically consistent samples.

Method
In this paper, we focus on generating the additional training set from structured event records for augmentation. Previous augmentation methods usually have Structure Misalignment and Grammatical Incorrectness, and Semantic Drifting problems as mentioned in the introduction. Instead, we introduce a policy-based RL strategy to select intact augmentation sentences.

Task Definition
In the generation-based event extraction task, the extraction process is divided into several subtasks according to event types E. For each event type e ∈ E, the purpose of the event extraction model is to generate Y e according to the predefined prompt P e and context C, where Y e is the answered prompts containing extracted event records. Except for the original data T o , we use a policy model as RL agent to select the effective subset P i from the generated data G i in the i-th epoch, thus improving the data efficiency by filtering the generated samples.

Framework
Our proposed denoised structure-to-text augmentation framework is mainly composed of the event extraction model, structure-to-text generation model, and policy model. As the policy-based RL process shown in Figure 2, the event record is first fed into the structure-to-text generation model to obtain the additional training data. Then they are filtered ac-

Knowledge-based text
Weight Structure-to-text Generation Model  cording to the action selected by the policy-based agent. Thus, we obtain the denoised augmentation training data for event extraction model. We use the filtered training data to retrain the event extraction model and the enhancement of the F1 score is regarded as a reward to retrain the policy model. The guidance of the event extraction model further helps the policy model select efficient samples. Finally, the generation model is retrained according to the weighted training data, and the weight is the removing action probability calculated by the retrained policy model. The retraining captain the generation model produces superior-quality sentence and consequently help the other components. The components of our proposed method will be described in the following.

Reinforcement Learning components
The definitions of the fundamental components are introduced in the following. The States include the information from the current sentence and the corresponding golden event records. These two parts are both converted to the sentence vector through PLMs for the decision of action. We update states after re-generate the text guided by the previous action probability. At each iteration, the Actions decided by the policy model is whether to remove or retain the generated instance according to whether the sentences generated do express the corresponding event records. We use the enhancement of the F1 score as the Rewards for the actions decided by the policy model. Specifically, the F1 score of argument classification F i at i-th epoch on the development set is adopted as the performance evaluation criterion. Thus, the reward R i can be formulated as the difference between the adjacent epochs: where α is a scaling factor to convert the reward into a numeric result for RL agent.

Structure-to-text Generation Model
Capture of the airport would give American and British troops a facility for airlifting equipment and troops to Baghdad.

Entity & Relation Description
Event Description Masked Background Figure 3: Example of structured information representations and structure-to-text generation.

Event Extraction Model
We is the corresponding separate marker. Following (Li et al., 2021) to reuse the predefined argument templates, the prompt P e contains the type instruction and the template, and the event records are parsed by template matching and slot mapping according to their own event description template.

Structure-to-text Generation Model
As to the structure-to-text generation model, T5 (Raffel et al., 2020) is used because of its outstanding generation performance. Similar to its original setting, we define the task as a sequence transformation task by adding the prefix "translate knowledge into sentence" at the beginning as P g to guide the generation model. It is difficult to directly generate text from structured event records with limited training data, so we randomly mask the original sentence with the special token [M] to produce the masked sentence C ′ , and the mask rate is λ. C ′ is used as the background in the input of the generation model X g . As shown in Figure 3, the structured information annotated in the training set is transformed into event description D g and relation description R g , respectively. They are further used as background knowledge to assist in the structure-to-text generation and the original sentence C is regarded as the generation target Y g . Given the previously generated tokens y <s and the input X g . It is notable that the entire probability p(Y g | X g ) is calculated as: In addition, an argument-aware loss L a is added to enforce the model to help the model to pay more attention to the event arguments during the generation process. For all event arguments that have not been generated, we search for text spans in the generated text most similar to the remaining event arguments. Detailly, we aggregate the triggers and arguments which not included in the generated text. These triggers and arguments are transformed into a one-hot embedding set A and each element is denoted as a m ∈ A denote. And the probability of selecting the token at each position in the generation model is extracted for matching the optimalrelated position. By setting the window size to the number of words in a m , we divide the probability sequence into pieces using the sliding window and obtain all the candidate set K m for each a m in A. We first calculate the L1 distance between a m and each element in K m as the distance score between them. Then, all distance scores are mixed together in the back of completely traversing A. in the case of avoiding the conflict of matching positions, greedy search is finally utilized to check each element in A to the position with the lowest distance score. Together with the original language model loss function L lm , the loss function of the generation model L g is defined as: where N is the number of instances, T is the number of elements contained in the current unmatched set, k t and k ′ t denote the start and end position of t-th unmatched element in the original sentence, and y k is the k-th corresponding trigger or argument word.

Policy Model
For each input sentence, our policy model is required to determine whether it expresses the target event records. Thus, the policy model makes a removal action if it is irrelevant to the target event records and it is analogous to a binary classifier. For each generated sentence G ∈ G i , the input of the policy model X p consists of G and corresponding event description D g . The symbolic representation of input is formulated as . We fine-tune the BERT model by feeding the [CLS] vector into the MLP layer. And then a softmax function is utilized to calculate the decision probability for retaining the sample G. A binary cross-entropy loss function is introduced for this classifier, where y n is the golden action for n-th sample, and N is the number of instances.

Pre-training
The three components, i.e., event extraction model, structure-to-text generation model, and policy model, are pre-trained with different strategies. Since the policy model has no task-specific information at the very beginning, the generation model is trained for several epochs at first to establish the training set for the policy model. We stop training the generation model until more than 70% of the trigger and arguments could be generated. The generated sentences containing their corresponding triggers and arguments are considered positive samples for the policy model, while the others are treated as negative samples. To get a balance between positive and negative samples, we randomly select some event descriptions and sentences irrelevant to the event descriptions as negative samples as well. We early stop training the policy model when the precision reaches 80% ∼ 90%. This can preserve the information entropy of the result predicted by the policy model, and extend the exploration space. Then we continue to pre-train the generation model and the event extraction model with the original training set for fixed epochs. These two pre-trained models are used as our initialized generation model and extraction model in the retraining process, respectively.

Retraining with Rewards
For i-th epoch in retraining the agent, the policy model selects actions for each element in generated dataset G i . According to the actions, G i is divided into negative samples N i and positive samples set P i . Then we sample a subset from the original training data, and T o is mixed with P i as the reconstructed training set T i and used to retrain the event extraction model. Except for the improvement of argument F1 score, the growth on trigger F1 is also beneficial for the model. Therefore, we updated the checkpoint while either the trigger or argument F1 score improved to avoid falling into a local optimum. Following (Qin et al., 2018), we employ two sets for training the policy model, Since we can't explore all directions to get the maximum reward for a single step, we select a constant number of samples from D i−1 and D i for training, respectively, named D ′ i−1 and D ′ i . Referring to Equation (6), the retraining loss function of our policy model L ′ p is defined as: The probability of being considered an invalid sample is taken as the weight for retraining the corresponding instance in the generation model. So we use the probability of removing the sample w n = 1 − log p(y n | X p ) as the sample weight and retrain the generation model with the following retraining loss function L ′ g referring to Equation (3): where L n lm and L n a are the language model loss and argument-aware loss for n-th sample, respectively. The detail of the retraining algorithm is shown in Appendix A.  Table 1: Results on ACE05-E + . We reported the average result of eight runs with different random seeds, our results are like "a ±b ", where "a" and "b" represents the mean and the variance, respectively. We bold the highest scores and underline the second highest scores.
Model  Following previous work (Zhang et al., 2019;Wadden et al., 2019), we use precision (P), recall (R), and F1 scores to evaluate the performance. More specifically, we report the performance on both trigger classification (Trig-C) and argument classification (Arg-C). In the task of trigger classification, if the event type and the offset of the trigger are both correctly identified, the sample is denoted as correct. Similarly, correct argument classification means correctly identifying the event type, the role type, and the offset of the argument. Following (Lu et al., 2021;Liu et al., 2022), the offset of extracted triggers is decoded by string matching in the input context one by one. For the predicted argument, the nearest matched string is used as the predicted trigger for offset comparison.  Table 3: Results on ACE05-E. The first group is the classification-based methods and the second group is the generation-based methods.

Baselines
We illustrate the event extraction results between our proposed DAEE and the baselines conducted in two categories, i.e., classification-based models and generation-based models. The first category is classification-based models, DYGIE++ (

Main Results
The performance comparison on dataset ACE05-E + is shown in Table 1. It can be observed that DAEE achieves the SOTA F1 score on ACE05-E + and obtain 1.1% and 0.7% gain of F1 scores forTrg-C and Arg-C, respectively. The improvement indicates  Table 4: Ablation Study on ACE05-E + for event extraction. AL denotes the argument-aware loss L a , RG denotes the process of retraining the generation model, and RL denotes the reinforcement learning strategy. that DAEE is able to guide the generation model to generate the text containing events and select suitable samples to improve the effectiveness of the event extraction model. Table 2 presents the performance of baselines and DAEE on ERE-EN. The performance of DAEE decreases compared with GTEE-DYNPREF, but the performance is still higher than other methods, which may be affected that ERE-EN contains more pronoun arguments. The pronoun roles would offer less information for the generation model thus reducing the role of structured text in guiding the generation model.
Comparing the results on ACE05-E as Table 3 shows, we gain an improvement of 1.1% on Trg-C and a competitive F1 score on Arg-C with the SOTA classification-based method ONEIE, outperforming the others. This observation supports that structured information used in the knowledgebased generation model makes up for the information gap used by multi-task extraction.

Ablation Study
We further conducted an ablation study by removing each module at a time. The experimental results on ACE05-E + are presented in Table 4. We can see that the F1 score of Arg-C decreases by 0.4% and 0.8% when removing the argument-aware loss L a and stopping retraining the generation model, respectively. The results indicate that the deployment of argument-aware loss and retraining strategy is conducive to the generation module in our framework. Then, we remove the RL strategy, which means that the generated samples are directly mixed with the original training samples for training the event extraction model from scratch. The F1 score of Trg-C and Arg-C decreases by 1.6% and 1.0%, respectively. This demonstrates that the RL strategy could ensure that the generated data is more suitable for downstream event extraction tasks and guide the improvement on both Trg-C and Arg-C.

Iterative Generation Discussion
To illustrate our framework is able to enhance the quality of generated sentences, we calculate the masked language model score pseudolog-likelihood scores (PLLs) 1 following (Salazar et al., 2020) for each training epoch. The token w s in the sentence is masked and predicted using all past and future tokens W \s := (w 1 , . . . , w s−1 , w s+1 , . . . , w |W | ), and the PLLs for each sentence is calculated as The results for each epoch are the average of sentence scores over the entire training set as shown in Figure 4. PLLs is declining with the iterative process, which demonstrates that DAEE enhances the fluency of generated data and improves the effect of event extraction under the guidance of RL agent. Furthermore, we compare DAEE with a rule-based sequence labeling data augment method SDANER (Dai and Adel, 2020). SDANER contains four rule-based augmentation methods. Synonym replacement is selected according to its lowest average PLLs. DAEE generates sentences with lower PLLs compared with the rule-based method. The results demonstrate that DAEE generates more fluency and grammatically correct data.

Argument Loss Analysis
To verify the effectiveness of argument-aware loss L a in reducing mismatches triggers and arguments, we alter the hyperparameter γ and explore the change of the unmatched number of arguments   Meanwhile, the number of unmatched arguments converges around 30 after adding L a , while the number converges to around 120 without L a .

Diversity Analysis
Intuitively, diverse sentence description in the training set is able to enhance the model performance. We thus verify the diversity of the generated text. The degree of diversity is reported by calculating the number of distinct bigrams and trigrams in the generated text which has not appeared in the original text and the results are shown in Table 6. In the following, we use GENERATION MODEL to represent the directly trained structure-to-text generation model. Referring to the indicators proposed in (Li et al., 2016), The diversity, the argument-aware loss L a helps the GENERATION MODEL to produce more diverse synthetic data, which is because the argument-aware loss makes the model focus more on retaining the triggers and arguments rather  than generating more similar content to the original text. The diversity is affected by the RL strategy due to the concentration on the effect of event extraction. Horizontally compared to Table 4, the experimental results demonstrate that diversified text can enable the model to obtain more information based on similar event records. Table 5 shows representative examples generated by our proposed DAEE and other methods and we can see the following comparative phenomena. In the case of comparing whether to add the argumentaware loss, the GENERATION MODEL generates all the triggers and arguments in three examples, which demonstrate the generation model without L a shuffles the text leaking problem. There is a misalignment in the first example for the text generated through GENERATION MODEL. The original sentence contains two roles, i.e., ARTIFACT and BUYER, and their arguments are we and partner, but the two arguments have been swapped in the synthetic text. In the second example, the government should play the role of AGENT in LIFE:DIE event according to the output of GENERATION MODEL, which is not appeared in the golden event record and resulting in redundancy. Neither of the above errors occurs in DAEE shown in the table, which proves the RL strategy could also be guidance for improving the effectiveness of generative models.

Conclusion
In this paper, we studied DAEE, the denoised structure-to-text augmentation framework for event extraction. The structure-to-text generation model with argument-aware loss is guided by the reinforcement learning agent to learn the task-specific information. Meanwhile, the reinforcement learning agent selects effective samples from generated training data that are used to reinforce the event extraction performance. Experimental results show that our model achieves competitive results with the SOTA on ACE 2005, which is also a proven and effective generative data augmentation method for complex structure extraction.

Limitation
This paper proposes a denoised structure-to-text augmentation framework for event extraction (DAEE), which generates and selects additional training data iteratively through RL framework. However, we still gain the following limitations.
• The framework uses reinforcement learning to select effective samples, which is a process of iterative training and predicting the generation model, policy model, and event extraction models. The iterative training framework is complicated and time-consuming compared to the standalone event extraction model. • Even the Argument Loss decreases the number of unmatched arguments in a generated sentence, the generation model generates more fluent sentences while at the expense of the ability to ensure that all the event arguments are included completely.

Acknowledgement
This work was supported by the Joint Funds of the National Natural Science Foundation of China (Grant No. U19B2020). We would like to thank the anonymous reviewers for their thoughtful and constructive comments.

A Details of Methods
The detail of the retraining algorithm is shown in Algorithm 1.

B.1 Data Statistics
In this paper, we use the three datasets to verify our proposed method, the statistics of the datasets are shown in Table 7.

B.2 Implementation Details
All experiments were conducted with NVIDIA A100 Tensor Core GPU 40GB. For the pre-trained language model, we reuse the three English models released by Huggingface 2 . Specifically, γ and β are set to 0.1 and 0.9 in Equation (2), respectively, the RL training epoch is set to 80, the reward scale α is set to 10, the sample ratio from original event extraction training set is set to 0.5, the negative sample ratio for GTEE-BASE in training is set to 12% for event extraction, and the other hyperparameters used are shown in Table 8.

B.3 Generation Reliability Discussion
To verify the verifies the convince of the generated data, we train GTEE-BASE through the samples 2 https://huggingface.co/t5-base, https://huggingface.co/bert-base-uncased, https://huggingface.co/facebook/bart-large  with event record, which is because that only the samples with event record are used for data augmentation. The results are shown in Table 9. The F 1 score trained on DD increases by 1.1% and 2.5% compared with the results trained on OD and GD, respectively. The data generated by DAEE achieves a closer effect to original data, which thus could be utilized for training the competitive event extraction models.
Algorithm 1 The process of retraining the reinforcement learning framework.
Parameter:The original event extraction training set T o , parameters of policy model θ p , event extraction model θ e , generation model θ g , generated sentence set, n-th generated sentence G n , positive samples set P i , negative samples set N i 1: Initialize trigger F 1 score F t max and role F 1 score F a max through θ e 2: for epoch i in 1 → K do Calculate Retrain policy through D i and D i−1 according Equation 6 21: Update training weight 1 − log p(Y p | X p ) → w n for each sample in Y g ,

22:
Retrain the generation model through weighted Y g according Equation 3 23: Update θ g and generate G i 24: end for ACL 2023 Responsible NLP Checklist A For every submission:

A1. Did you describe the limitations of your work?
We report it in Section 6.
A2. Did you discuss any potential risks of your work?
Event extraction is a standard task in NLP and we do not see any significant ethical concerns. We evaluate the event extraction task with conventional metrics. As the extraction evaluation is our main focus, we do not anticipate the production of harmful outputs on our proposed task.
A3. Do the abstract and introduction summarize the paper's main claims?
We report it in Section 1.
A4. Have you used AI writing assistants when working on this paper?
Left blank.

B Did you use or create scientific artifacts?
We report it in Section 4.

B1. Did you cite the creators of artifacts you used?
We report it in Section 4.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The first institution of our paper is the Beijing Institute of Technology, which has obtained the authorization of the LDC User Agreement for ACE 2005 and Rich-ERE data. The code (https://github.com/huggingface/trans we used is licensed under Apache License 2.0. B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Our work is free for public research purposes.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? The data set used in this paper was annotated from the publicly available text when annotated by their authors.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? No response.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. We report it in Appendix A.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.