A Match Made in Heaven: A Multi-task Framework for Hyperbole and Metaphor Detection

Hyperbole and metaphor are common in day-to-day communication (e.g.,"I am in deep trouble": how does trouble have depth?), which makes their detection important, especially in a conversational AI setting. Existing approaches to automatically detect metaphor and hyperbole have studied these language phenomena independently, but their relationship has hardly, if ever, been explored computationally. In this paper, we propose a multi-task deep learning framework to detect hyperbole and metaphor simultaneously. We hypothesize that metaphors help in hyperbole detection, and vice-versa. To test this hypothesis, we annotate two hyperbole datasets- HYPO and HYPO-L- with metaphor labels. Simultaneously, we annotate two metaphor datasets- TroFi and LCC- with hyperbole labels. Experiments using these datasets give an improvement of the state of the art of hyperbole detection by 12%. Additionally, our multi-task learning (MTL) approach shows an improvement of up to 17% over single-task learning (STL) for both hyperbole and metaphor detection, supporting our hypothesis. To the best of our knowledge, ours is the first demonstration of computational leveraging of linguistic intimacy between metaphor and hyperbole, leading to showing the superiority of MTL over STL for hyperbole and metaphor detection.


Introduction
The use of figurative language is very common in natural discourse, and it is reflected in the content generated in social media networks (Abulaish et al., 2020). Figurative languages are used to establish some communicative goals such as expressing a negative emotion, drawing attention to a part of the text, or adding interest to a subject. (Roberts and Kreuz, 1994). The understanding of figurative * Equal contribution. 1 Code and data are available at: https://github.com/ abisekrk/multitask_hyperbole_metaphor_detection languages like sarcasm, metaphor, simile, irony, and hyperbole is crucial for many NLP tasks such as building accurate sentiment analysis systems or developing conversational AI systems that can hold meaningful conversations ( Figure 1). This has led to great interest and value in understanding these figurative languages. Figurative languages like metaphor (Rai and Chakraverty, 2020) and sarcasm (Joshi et al., 2017) are studied extensively while hyperbole remains less explored.
Metaphor is the most common choice of figurative language, while hyperbole is the second most adopted rhetorical device in communication (Roger J., 1996) and hence it is important to study and process them automatically. Hyperbole is a figurative language that uses exaggeration to emphasize a point, while metaphor makes a comparison between two things to indicate a resemblance.

Motivation
Relevance theorists had long treated both metaphors and hyperboles as not genuinely distinct categories as they are very closely related to each other (Sperber and Wilson, 2008). Recent research has highlighted the distinctive features of hyperboles over metaphors (Carston and Wearing, Both metaphors and hyperboles use figurative elements to express an idea rather than presenting them literally, but this linguistic insight hasn't been exploited computationally in previous works. We hypothesize that this shared characteristic can be captured at the embedding level by training transformer models to learn these representations jointly using multi-task learning. Existing metaphor detection systems focus on identifying metaphoricity at the token-level, whereas hyperbole detection systems focus on sentence-level classification. In our work, we highlight the effectiveness of performing sentence-level classification for both hyperboles and metaphors in a multi-task setting.

Contributions
Our contributions are: 1. Extensions to the existing datasets amounting to 16, 024 sentences which include, (a) HYPO and HYPO-L datasets annotated with metaphor labels. (b) TroFi and LCC datasets annotated with hyperbole labels.
2. Demonstration of the superiority of multitasking over single-tasking for hyperbole and metaphor detection.

Background and Definitions
Metaphor Metaphor is a literary device that uses an implicit comparison to drive home a new meaning. Metaphors consist of a source and target domain in which the features from the source domain are related to the features in the target domain through comparable properties (Lakoff, 1993). For instance, "Life is a journey," implies a comparison between life and journey through the idea of having a beginning and an end. In this work, we do not consider similes as metaphors as they make an explicit comparison.
Hyperbole Hyperbole is a figurative language in which the literal meaning is exaggerated intentionally. It exaggerates expressions and blows them up beyond the point they are perceived naturally with the objective of emphasizing them (Claridge, 2010). For example, "I'm tired, I can't lift my hand," exaggerates the speaker's exhaustion. Figure 3 shows examples of metaphor and hyperbole.

Related Work
Metaphors and hyperboles are the most used figures of speech in everyday utterances (Roger J., 1996). In recent years, significant efforts have been made to understand metaphors and hyperboles, giving rise to interesting techniques to automatically detect and generate them. Troiano et al. (2018) introduced hyperbole detection as a binary classification task, using traditional machine learning algorithms. They also released a dataset named 'HYPO' for hyperbole detection. Kong et al. (2020) introduced 'HYPO-cn', a Chinese dataset for hyperbole detection, and showed that deep learning models can perform better at hyperbole detection with increased data. Biddle et al. (2021) used a BERT (Devlin et al., 2018) based detection system that used the literal sentences of the hyperbolic counterparts to identify the hyperbolic and nonhyperbolic use of words and phrases. They also released a test suite for evaluating models. Tian et al. (2021) proposed a hyperbole generation task. Zhang and Wan (2022) introduced an unsupervised approach for generating hyperbolic sentences from literal sentences and introduced two new datasets 'HYPO-XL' and 'HYPO-L' for their experiments. Metaphors have been extensively studied even before hyperbole detection was introduced. Tsvetkov et al. (2014) introduced the TSV dataset with 884 metaphorical and non-metaphorical adjective-noun (AN) phrases. They showed that conceptual mapping learnt between literal and metaphorical words is transferable across languages. Mohler et al. (2016) introduced the LCC dataset which contains sentence-level annotations for metaphors in four languages totaling 188, 741 instances. Steen (2010) studied metaphor at the word level and was the first to include function words for metaphor detection with the new VUA dataset. Birke and Sarkar (2006) introduced the TroFi dataset that consists of verbs in their literal and metaphoric form. In recent years, metaphor detection has been explored with the aid of large language models. Choi et al. (2021) used the contextual embeddings from BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) to classify metaphorical sentences. Aghazadeh et al. (2022) probed and analyzed the metaphorical knowledge gained by large language models by testing them on metaphor datasets across languages.
Previous research on metaphor and hyperbole detection typically treats these figurative language forms separately, despite their common properties. In this work, we propose a multi-task approach that simultaneously detects both hyperboles and metaphors, and demonstrate that this approach outperforms individual detection tasks with experimental results and detailed analysis.

Task Formulation
For a sentence x and a corresponding label y or labels y 1 , .., y k (k > 1), we can mathematically formulate the different learning tasks shown in Fig where E and f represent the encoder and the feedforward neural network (classification head) respectively, θ represents the weights from both E and f , and ρ represents the softmax function. The crossentropy loss function can be defined as: where D is the number of training samples, y i and y i are the i th true and predicted labels.
Multi-Task Learning with shared Encoder (MTL-E) where k represents the number of tasks, y k are the labels, f k are feed-forward neural networks and θ k are the weights for the k tasks respectively. For k = 2 the loss function can be written as: where L 1 , L 2 are task specific losses calculated similar to Eq. 3 and λ is the weighting factor.
Multi-Task Learning with Fully shared layers (MTL-F) y * 1 , y * 2 = argmax y 1 ,y 2 ∈{0,1} P (y 1 , y 2 |x; θ) Here, the loss is a binary cross-entropy loss: where σ is the sigmoid function and m is the number of labels and l ij represents the logit value for the i th instance and the j th label.

Datasets and Annotations
In this section, we delve into the hyperbole and metaphor datasets used and their annotation details.

Hyperbole Datasets
Our experiments used two hyperbole datasets: HYPO (Troiano et al., 2018) and HYPO-L (Zhang and Wan, 2022). The HYPO dataset contains 709 hyperbolic sentences each with a corresponding paraphrased literal sentence resulting in 1, 418 sentences. The HYPO-L dataset includes 1, 007 hyperbolic sentences and 2, 219 paraphrased sentences. For each sentence in the HYPO and HYPO-L datasets, we added metaphor labels. Table 1 shows the statistics of the annotated hyperbole datasets.

Annotation Details
We employed four annotators proficient in English in the age group of 24-30. Three annotators were Dataset (# sentences) Met. Hyp. # sent.

TroFi
(3,838)  master's students and one had an M.A in linguistics. They were provided with detailed annotation instructions along with examples of hyperbole and metaphors. Each instance of the dataset was annotated once and the annotations were equally divided among the four annotators. We first conducted pilot studies for annotation with randomly sampled 100 sentences from each dataset before proceeding to the final annotation. The Inter Annotator Agreement (IAA) was computed using pairwise Cohen's Kappa score (κ) and Fleiss' Kappa score (K) as reported in Table 3. The IAA between any two annotators is above 0.60 (0.61 ≤ κ ≤ 0.80; Cohen (1960)), indicating substantial agreement between them. The Fleiss' Kappa score of 0.674 is also considered substantial (0.61 ≤ K ≤ 0.80; Landis and Koch (1977)).
To ensure the quality of annotations, we randomly sampled 1100 instances with an equal split of hyperbole and metaphor labels across all datasets. The annotators were asked to mark sentences as hyperbole if there was any exaggeration and as metaphors if there were any implicit comparisons. In addition to giving binary labels, we also asked the annotators to mark the part of the sentence that influenced their decisions. Doing this helped us identify any discrepancies in their understanding and correct them. All four annotators received stipends suitable for the tasks.   Table 5: Comparison of Transformer models using 10-fold cross-validation over three different runs for hyperbole and metaphor detection task on the HYPO-L dataset. Significance test (t-test) p-value ( * ) = 0.0438 (< 0.05).

Experiments
We conduct four experiments: 1) Comparing STL and MTL-F on hyperbole and metaphor datasets, 2) Comparing STL, MTL-E, and MTL-F models, 3) Obtaining sentence-level benchmark results on the metaphor dataset, and 4) Comparing with established baselines for the hyperbole dataset.
For our experiments, we used label-balanced metaphor datasets to address the imbalance caused by fewer hyperbole (Refer to Appendix A.2). To ensure a fair comparison, we used mean 10-fold cross-validation obtained over three different runs to compare our models. However, we did not compare our results with existing work on metaphor detection as it does token-level instead of sentencelevel metaphor prediction. Finally, we used simple models to highlight the efficacy of a multi-tasked framework for a sophisticated task. Troiano et al. (2018) used cognitive features, such as imageability, unexpectedness, polarity, subjectivity, and emotional intensity for hyperbole detection, referred to as QQ (i.e. Qualitative and Quantitative). We compare our results with their best-performing Logistic Regression and Naive Bayes models, referred to as LR+QQ and NB+QQ in Table 9. Kong et al. (2020) used a combination of the QQ features and a pre-trained BERT, referred to as BERT base +QQ in Table 9. The QQ features were concatenated with the BERT's output and passed through a linear classifier to predict hyperbole. Biddle et al. (2021) used literal paraphrases as privileged information and incorporated this information using a triplet loss. We refer to this model as BERT base +PI in Table 9. We show that our multitask model outperforms all these baselines.

Experimental Setup
We experiment with bert-large-uncased (BERT lg ) (Devlin et al., 2018), albert-xxlarge-v2 (ALBERT xxl2 ) (Lan et al., 2020), and roberta-large (RoBERTa lg ) (Liu et al., 2019) models (h = 16, l = 24). The best-performing models use the following hyperparameters: For the STL model we use a learning rate of 1e − 4 for 5 epochs and a batch size of 16. For the MTL-E model, the learning rate is 1e − 5 for 20 epochs, a batch size of 32, and the loss weighting factor λ of 0.5 whereas, for the MTL-F model, the learning rate is 1e − 5 for 10 epochs and a batch size of 16. We use Adam (Kingma and Ba, 2015) with eps of 1e − 4 to optimize all our models.

Hypothesis Testing
We used t-test, which is a statistical test used to determine if there is a significant difference between the means of two groups. The p-value here is a statistical measure that is used to assess the evidence against a null hypothesis. A p-value < 0.05 is typically considered to be statistically significant. The null hypothesis to reject here is that both the samples for STL and MTL-F models come from the same distribution.
For all our experiments, we obtain a p-value < 0.05 indicating that the samples are indeed coming from different distributions. This shows that the improvement obtained by the MTL-F model over the STL model is statistically significant.
7 Results STL vs. MTL-F models We use identical experimental setups to compare the results obtained from the STL and MTL-F approach on all four datasets.
1. HYPO results: The comparative analysis results for the HYPO dataset are in Table 4. For all the models we observe that the MTL-F performs better than the corresponding STL. Overall the RoBERTa lg MTL-F model achieves the best recall of 0.884 and F1 of 0.881 (1.96% ↑) for hyperbole detection and a p-value of 0.0322.
2. HYPO-L results: The comparative analysis results for the HYPO-L dataset are in Table 5. For all the models we observe that the MTL-F performs better than the corresponding STL for hyperbole detection. Overall the RoBERTa lg MTL-F model achieves the best precision of 0.706, recall of 0.668, and F1 of 0.687 (2.99% ↑) for hyperbole detection and a p-value of 0.0438.
3. TroFi results: The comparative analysis results for the label-balanced TroFi dataset is in Table  6. For all the models we observe that the MTL-F performs better than the corresponding STL for metaphor detection. Overall the RoBERTa lg MTL-F model achieves the best precision of 0.565, recall of 0.587, and F1 of 0.573 (16.93% ↑) for metaphor  Table 8: Comparison of STL, MTL-E and MTL-F models using 10-fold cross-validation over three different runs on the HYPO dataset for hyperbole detection and the label balanced LCC dataset for metaphor detection. The metaphor column gives the benchmark results (sentence-level) on the label-balanced LCC dataset.  Table 9: HYPO Results. Precision (P), recall (R) and F1 score for baseline models compared to our work. detection and a p-value < 0.0001. 4. LCC results: The comparative analysis results for the label-balanced LCC dataset are in Table 7. For all the models we observe that the MTL-F performs better than the corresponding STL for metaphor detection. Overall the RoBERTa lg MTL-F model achieves the best recall of 0.812, and F1 of 0.805 (1.38% ↑) for metaphor detection and a p-value of 0.0221.
We observe: a) The MTL-F model helps in achieving generalization under the presence of both hyperbole and metaphor labels. b) The p-values (30 samples) suggest that the MTL-F results are statistically significant over the STL results with 95% confidence for all the datasets (Appendix 6.4). Table 8 reports the comparison of these three models on the HYPO and LCC datasets for hyperbole and metaphor detection respectively. We observe that, in comparison to the STL model, the MTL-E model performs better in general whereas the MTL-F model performs significantly better, achieving the best F1 score of 0.881 and 0.805 on the HYPO and LCC datasets respectively. (See Appendix A.3). Benchmark Results We report the benchmark results for sentence-level detection on the label balanced LCC dataset in Table 8 (check the Metaphor column). Our RoBERTa lg MTL-F model achieves the best recall of 0.812 and F1 of 0.805. Baseline Comparison Table 9 reports the comparison of our work with baseline models on the HYPO dataset for hyperbole detection. Our RoBERTa lg MTL-F model achieves the best recall of 0.884 (8.59% ↑) and F1 of 0.881 (12.03% ↑) as compared to the recall of 0.814 and F1 of 0.781 of the state-of-the-art system.

Analysis
We divide our analysis into two subsections: 1) A comparison of the STL and MTL-F models, and 2) Error analysis of the MTL-F model.

Comparative Analysis
Under similar experimental setups, we compare the STL and MTL-F models on example sentences obtained from the different test sets of the crossvalidation run of the HYPO dataset as shown in Table 10. We consider the following 4 cases: 1. Hyperbolic and Metaphoric: "They cooked a turkey the size of a cow," is both hyperbolic and metaphorical. Here, the exaggeration is evident as the size of the turkey is being compared to that of a cow, which allows both the STL and MTL-F models to make correct hyperbole predictions. However, for metaphor prediction, the MTL-F model correctly identifies the implicit meaning of "size being big" under the influence of the correct hyperbole label, while the STL model fails to do so.
Next, for the example sentence, "Your plan is too risky, it's a suicide," the exaggeration and the metaphoricity are very intricate. The words risky and suicide make it difficult for the STL model to detect the labels, but the MTL-F model accurately identifies them. This can be attributed to the MTL-F model's ability to learn from both labels.
2. Non-Hyperbolic and Non-Metaphoric: In some cases, the STL model may incorrectly classify sentences that are non-hyperbolic and non-metaphoric due to ambiguous language. For example, in the sentence "I'm not staying here any longer!" the words staying and longer may give the impression of exaggeration, causing the STL model to incorrectly classify it as hyperbolic.
However, the MTL-F model, by learning both hyperbole and metaphor detection simultaneously, is able to identify such cases as non-hyperbolic. Similarly, in "My ex boyfriend! Treacherous person!" the word treacherous may lead the STL model to incorrectly classify it as hyperbolic and metaphoric, but the MTL-F model classifies it correctly.
3. Hyperbolic and Non-Metaphoric: For this category, we notice that similes can cause confusion. For instance, in the sentence "This kind of anger rages like a sea in a storm," anger is explicitly compared to sea in a storm through the word like. The MTL-F model is able to distinguish this as a simile, whereas the STL model fails to do so.
4. Non-Hyperbolic and Metaphoric: Here we observe that the use of figurative language is subtle. For instance, in "Her strength awoke in poets an abiding love," awoke is used metaphorically, which is correctly identified by both the STL and MTL-F models. However, the STL model incorrectly tags it as hyperbolic, while the MTL-F model learns to identify such sentences as non-hyperbolic.

Analysis of attention weights:
Additionally, we also examine the attention weights from the final layer to gain an insight into the performance of the MTL-F model compared to the STL model. We use the weights associated with the [CLS] / <s> ([CLS] for BERT and <s> for RoBERTa) token normalized over all the attention heads.
First, we compare the STL and MTL-F models for the task of hyperbole detection. Figure 4. shows attention weight comparison of example sentences. For the sentence "Hope deferred makes the heart sick," we observe that the MTL-F model focuses on the words heart and sick that indicate exaggeration, while the STL model focuses on other irrelevant words. Similarly, for "Books are food for avid readers," the MTL-F model correctly focuses on the words Books, food and readers. This suggests that the MTL-F model is better at paying attention to relevant words in the sentence due to its knowledge of both hyperbole and metaphor detection.
Next, for metaphor detection, the presence of hyperbole labels during training helps the MTL-F to learn to correctly attend to relevant tokens. For example, in "After workout I feel I could lift a sumo wrestler," the MTL-F focuses on the words lift and wrestler to correctly identify it as metaphoric. Similarly, for "Seeing my best friend again would mean the world to me," the MTL-F pays the maximum attention to the words would, mean, and world which is the reason for metaphoricity here.

Error Analysis
We also analyzed the misclassifications for the MTL-F model, some of which have been included in Table 11. We observe that the primary reason for misclassifications in the MTL-F model is the lack of context in identifying the exaggeration or metaphoricity. For instance, "What kind of sorcery is this?" is a commonly used figurative sentence but the absence of any context makes it difficult for the MTL-F model to classify it correctly as both hyperbolic and metaphoric.
Next, we found cases such as "You're grumpy," where the MTL-F model tags them incorrectly as metaphoric. Such mistakes could be attributed to the model learning to identify implicit comparisons but failing to identify that grumpy here is an attribute not a comparison.

Conclusion and Future work
We have presented a novel multi-tasking approach to the detection of hyperboles and metaphors. We augmented the annotations of two hyperbole datasets with metaphor labels and that of two metaphor datasets with hyperbole labels. This allowed multi-task learning of metaphor and hyperbole detection, which outperforms single-task learning on both tasks. We establish a new SOTA for hyperbole detection and a new benchmark for sentence-level metaphor detection. The take-away message is that metaphor and hyperbole detection help each other and should be done together.
We plan to extend our framework of exploiting linguistic relatedness and thereby creating MTL detection systems, to all forms of figurative languages like proverbs, idioms, humour, similes, and so on.

Limitations
The scope of this work is limited to sentence-level detection due to the absence of any span-level annotated datasets for hyperbole detection. Also, we could only partially annotate the metaphor datasets due to resource constraints. Finally, we did not try sophisticated large language models in our work as our goal was to demonstrate the effectiveness of multitasking using a simple model, rather than to test the performance of more sophisticated models.

Ethical Considerations
We perform our experiments on existing hyperbole and metaphor datasets by adding additional labels to them. Some of the examples in these datasets use slurs, abuses, and other derogatory terms to bring out exaggeration or implicit comparison. Our models may also propagate these unintended biases due to the nature of the datasets. We urge the research community to use our models and these datasets with caution and we are fully committed to removing discrepancies in the existing hyperbole and metaphor datasets in the future.

A.2 Label Balanced Metaphor Datasets
As discussed in Section 6, we used label-balanced metaphor datasets to address the imbalance caused by fewer hyperbole.

A.3 STL vs. MTL-E vs. MTL-F models
Detailed comparison of the STL, MTL-E and MTL-F models are reported in Table 14 and Table 15.
For hyperbole detection we used the HYPO dataset whereas for metaphor detection we used label balanced LCC dataset. Table 8 in the paper reports the comparison of only the best performing models for brevity.

B Case study: Hyperbole and Metaphor Detection using LLMs
We tested the zero-shot capabilities of recent LLMs for hyperbole and metaphor detection. We experimented with a varied set of prompts on the open sources LLMs-LLaMA (Touvron et al., 2023) and BLOOM (Scao et al., 2022). We found that they do not have any deeper understanding of hyperboles and metaphors and performed poorly in the classification task. Our observations with ChatGPT are as follows. ChatGPT had a good understanding of what hyperbole and metaphor are and was able to classify popular examples correctly. However, it struggled to perform multi-class classification when sentences had both hyperbole and metaphor properties. It also struggled to identify many hyperboles. When pointed out explicitly that it has made a mistake, and a particular part of the sentence is an indication of hyperbole or metaphor, it was able to correct itself and explain the reason. We experienced such issues with all combinations of presence or absence of metaphor and hyperbole. The example instances demonstrating the same are shown below.

B.1 Example 1
User: Can you detect if the sentence has hyperbole or metaphor or both in it? Sentence: That sharp bend is a death-trap for motorists.
ChatGPT: Yes, I can detect that the sentence contains a metaphor. The sentence "That sharp bend is a death-trap for motorists" is a metaphor because it is comparing the sharp bend to a death-trap, suggesting that the bend is extremely dangerous and potentially deadly for motorists. The metaphor is used to create a vivid and impactful image in the reader's mind. User: Does it have a hyperbole? ChatGPT: No, the sentence does not contain hyperbole. Hyperbole is an exaggerated statement or claim not meant to be taken literally. The sentence "That sharp bend is a death-trap for motorists" is not an exaggeration, but rather a metaphorical comparison between the sharp bend and a deathtrap. User: Isn't "death-trap" a hyperbole? ChatGPT: You are correct, "death-trap" in this context can be considered a hyperbole. . . ..

B.2 Example 2
User: Can you detect if the sentence has hyperbole or metaphor in it? Sentence: Retailers, meanwhile, are trying out a number of ways to target these shoppers.
ChatGPT: The sentence "Retailers, meanwhile, are trying out a number of ways to target these shoppers" does not contain hyperbole or metaphor. It is a straightforward statement conveying information about the actions of retailers in targeting a particular group of shoppers. User: Isn't the word "target" a metaphor here? ChatGPT: You are correct that "target" in this sentence can be considered a metaphor. In this context, "target" is being used figuratively to mean that retailers are trying to aim their marketing efforts towards a particular group of shoppers. The word "target" is being used in a non-literal sense to describe the retailers' actions. Therefore, the sentence does contain a metaphor. Thank you for pointing that out.
It can be seen that in both examples, the model initially makes the wrong assumption about the sentence being a hyperbole or metaphor. It was able to correct itself only after bringing attention to the important word in the sentence. We have shown that the correct words get more attention through our multi-tasked approach indicating the reason for better detection accuracy.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 10
A2. Did you discuss any potential risks of your work?

Section 11
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract and Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank. B Did you use or create scientific artifacts?
Sections 5 and 6 B1. Did you cite the creators of artifacts you used?
Section 5 and Section 6.2 B2. Did you discuss the license or terms for use and / or distribution of any artifacts? Not applicable. Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Not applicable. Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 5 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. Section 5 C Did you run computational experiments? Section 6.2 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Section A.1 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.