SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research

Despite its relevance, the maturity of NLP for social media pales in comparison with general-purpose models, metrics and benchmarks. This fragmented landscape makes it hard for the community to know, for instance, given a task, which is the best performing model and how it compares with others. To alleviate this issue, we introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval, which includes a heterogeneous set of tasks and datasets combined, adapted and constructed from scratch. We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.


Introduction
There is overwhelming evidence that generalpurpose NLP systems suffer from a significant drop in performance when exposed to tasks in specialised domains.This has been shown in disparate domains such as the legal, medical and financial, to name a few.Specifically, for these domains, specialised language models (LMs) such as Legal-BERT, SciBERT, PubMedBERT, or BloombergGPT (Chalkidis et al., 2020;Beltagy et al., 2019;Gu et al., 2021;Wu et al., 2023) have shown to exhibit lower perplexity and higher downstream performance across the board.The need for specialised LMs and corresponding evaluation benchmarks is exacerbated in social media, where instead of (or rather, in addition to) a specialised domain, an NLP system has to deal with idiosyncrasies such as emoji (Miller et al., 2017;Cappallo et al., 2018;Barbieri et al., 2017a), poor capitalisation (Derczynski et al., 2015), vulgarisms and colloquialisms (Camacho-Collados et al., 2020), fast language change (Del Tredici et al., 2019;Loureiro et al., 2022a), and dynamic and platform-specific communicative structures (Bastos et al., 2013).
Therefore, unsurprisingly, a strand of Twitterspecific NLP research has produced what we would consider now de-facto models and datasets.On one hand, specialized LMs, either pre-trained on multilingual Twitter text alone (Nguyen et al., 2020;DeLucia et al., 2022;Barbieri et al., 2022b), or including social engagements (Zhang et al., 2022b); and on the other, joint Twitter-specific models and datasets such as TweetEval (Barbieri et al., 2020).However, one area where social media NLP research seems to be lacking behind is in matching with appropriate resources the current shift towards in-context learning (ICL) that Large LMs (LLMs) enable (Min et al., 2022).Benchmarks such as TweetEval, while helpful, are constrained to tweet classification tasks, crucially neglecting sequence tagging, generation and question answering (QA), for instance, which not only are more diverse, but better test beds for LLMs and comparing fine-tuning vs ICL (Liu et al., 2022).
The contributions of this paper can be summarized as follows.First, we introduce the SUPER-TWEETEVAL benchmark 1 .SUPERTWEETEVAL fills an important gap in the current NLP landscape by unifying diverse social media tasks and datasets beyond tweet classification (e.g., NER, question answering or tweet similarity) into one single benchmark, which we argue will contribute to faster and more replicable experiments on Twitter.Our second contribution is a suite of experiments that serve two purposes.First, to establish baselines using fine-tuned and ICL approaches, and second, for analysing and producing insights from our experimental results, namely that overall better performance is obtained with fine-tuned masked language models when compared to equally sized text generation architectures, and that zero and few-shot approaches generally struggle as they are not eas-ily to adapt for certain social media tasks.In sum, our results show that SUPERTWEETEVAL is still challenging for general-purpose language models, regardless of their size and domain, and that specialised models are instead more competitive.

Related Work
The advent of general-purpose NLP systems (namely LMs) has led to a proliferation of unified benchmarks agglutinating diverse NLP tasks.The General Language Understanding Evaluation benchmark (Wang et al., 2018, GLUE) was one of the first efforts to provide a large-scale benchmark composed of diverse tasks such as natural language inference or textual similarity.GLUE was composed of relatively simple and homogeneous tasks, and saw automatic systems quickly reach human performance in some cases (Yang et al., 2019).Because of this, SuperGLUE (Wang et al., 2019) was developed with the same spirit but included a wider range of tasks and settings.Since then, other general-purpose benchmarks for language models, especially those of the new generation, have emerged, such as MMMU (Hendrycks et al., 2021) and BIG-Bench (Srivastava et al., 2022).
In terms of social media research, there are many tasks that require modelling textual content.Tweet-Eval (Barbieri et al., 2020) was the first unified benchmark that agglutinated different tasks into the same benchmark.However, TweetEval is limited to tweet classification, including emotion recognition (Mohammad et al., 2018), emoji prediction (Barbieri et al., 2018a), irony detection (Van Hee et al., 2018), hate speech detection (Basile et al., 2019a), offensive language identification (Zampieri et al., 2019b), sentiment analysis (Rosenthal et al., 2017), and stance detection (Mohammad et al., 2016).Similar to the evolution of GLUE into SuperGLUE, the aim of this paper and SUPERTWEETEVAL is to construct a benchmark that is robust, large, and especially consisting of diverse tasks and settings for social media NLP research, and in particular, Twitter.

Datasets
SUPERTWEETEVAL includes a variety of Twitterspecific NLP tasks.For each of them, we include a relevant dataset that can be used as a proxy to test the performance of models on that task 2 .In this 2 More details and justifications for the selection of datasets used in this benchmark are provided in Appendix A.1.section, we describe the datasets used for each task, which we have split into three types: (1) existing datasets that have been included in the benchmark as they are; (2) datasets that have been adapted to suit the needs of the benchmark; and (3) datasets that we have constructed from scratch as part of this paper, which did not exist in previous literature.

Existing Datasets
Intimacy Analysis (TWEETINTIMACY) Intimacy is an important social aspect of language communication (Pei and Jurgens, 2020).We use the English subset of the MINT dataset (Pei et al., 2022), which contains 1,983 English tweets annotated with intimacy scores ranging from 1 to 5, with 1 meaning "Not intimate at all" and 5, "Very intimate".
Meaning-Shift Detection (TEMPOWIC) This task focuses on the understanding of language evolution through time which, while a popular research topic (Luu et al., 2022;Agarwal and Nenkova, 2022), remains a challenging problem, specifically in Twitter.In SUPERTWEETEVAL, we utilise TEM-POWIC (Loureiro et al., 2022b), a dataset comprised of 3,294 tweets.Here, two tweets from different time periods and a target word are provided, and the goal is to recognise whether the target word's meaning is the same in the two tweets (binary classification).

Sentiment Classification (TWEETSENTIMENT)
Sentiment analysis has been extensively studied both in general (Medhat et al., 2014;Wankhade et al., 2022) and social media context (Barbieri et al., 2022a;Marcec and Likic, 2022).In SUPER-TWEETEVAL, we utilise the data presented in the SemEval 2017 Task 4, subtask C (Rosenthal et al., 2017).The data are formatted as a "Topic Based Sentiment Analysis" task where each tweet is given a sentiment label on a 5-point scale ('strongly negative', 'negative', 'negative or neutral', 'positive', 'strongly positive') regarding a specific target.In total 43,011 tweets and 325 different topics are present.

Emotion Classification (TWEETEMOTION)
Similar to sentiment analysis, emotion classification has been a popular topic of study (Kušen et al., 2017;He et al., 2016) and has been used to better understand users' behaviours and intentions in social media (Son et al., 2022;Corbett and Savarimuthu, 2022).For our use case, we utilise the English subset of the 2018 SemEval task 1: Affect in Tweets, subtask: E-c (Mohammad et al., 2018).A total of 7,168 tweets are present and are labelled with one or more emotions based on their context.The labels are selected from a taxonomy of 11 emotions (plus neutral indicating the absence of emotion) such as anger, fear, and joy. 3opic Classification (TWEETTOPIC) Topic classification is a method commonly used to perform targeted analysis on social media data.This is a challenging task, due to the ever increasing amount of data produced in social platforms (Weller, 2015;Stieglitz et al., 2018).In SUPER-TWEETEVAL we use the multi-label setting of the TWEETTOPIC dataset (Antypas et al., 2022) for topic classification.The dataset consists of 6,837 tweets that have been assigned one or more topics.The taxonomy of topics used was selected by a team of social media experts from Snap Inc. and consists of 19 broad topics tailored for social media content such as sports or music.4Question Answering (TWEETQA) As a generative task, we consider an abstract question answering (QA) task on Twitter.To this end, we rely on TWEETQA (Xiong et al., 2019), which consists of a tweet and an answer as input, with the answer to the question as the output.Note that the answer may not be explicitly included in the tweet.The dataset contains 9,489/1,086/1,203 tweets for training/validation/test splits, respectively.NER (TWEETNER7) For Name Entity Recognition (NER), we include the TWEETNER7 dataset (Ushio et al., 2022b).This dataset contains 6,837 tweets and seven different labels: person, location, corporation, creative work, group, product, while also offering a temporal split (train and test splits stemming from different time periods).

Adapted Datasets Named Entity Disambiguation (TWEETNERD)
The original TWEETNERD dataset (Mishra et al., 2022) is a collection of tweets, a target phrase within the tweet and a Wikidata entity ID to which the target phrase refers in the context of tweet.To make the task more accessible for the evaluation of language models, we convert the dataset to a binary classification task: given the tweet, target phrase and a possible definition of the target phrase, the system's objective is to determine whether the provided definition aligns with the target phrase in the given context (positive) or not (negative).
First, to obtain positive instances with matching definitions, we use the definition provided for the gold Wikidata item ID.Then, we associate a negative instance for each target word's positive instances.For this, a maximum of top 10 candidates were pulled from the Wikidata API by searching for the target phrase of a positive.Then, candidates with low page views were eliminated to remove noise, and negative instances were chosen randomly from the pool of candidate definitions.
Hate Speech Detection (TWEETHATE) The presence of hate speech in social media is an ever increasing problem, with hateful content being spread in various online communities (Udanor and Anyanwu, 2019;Walther and McCoy, 2021).We utilise Measuring Hate Speech (Sachdeva et al., 2022) which consists of 39,565 social media (YouTube, Reddit, Twitter) manually annotated comments.The coders were asked to annotate each entry on 10 different attributes such as the presence of sentiment, respect, insults, and others; and also indicate the target of the comment (e.g.age, disability).The authors use Rasch measurement theory (Rasch, 1960) to aggregate each annotator's rating for every label in a continuous value which then can be mapped to a binary value.
For our needs, only entries extracted from Twitter were considered.Each tweet was assigned a label if at least two out of five annotators agreed on it.We opted out of a majority rule in order to acquire a dataset that can be used to train models capable of handling real-world, complex data (Mohammad et al., 2018;Antypas et al., 2022).A small amount of tweets with more than one label were discarded.The final dataset contains 7,168 tweets and 8 different labels5 .
Question Generation (TWEETQG) By leveraging the TWEETQA dataset, we re-frame it as a question generation (QG) task, where we use the tweet and the answer as the model input, while the question is the output.

New Datasets
In addition to the previous datasets that were directly integrated into the benchmark with minimal preprocessing or adapted from existing ones, we also constructed two additional datasets from scratch that complement the initial list of tasks and datasets.These are emoji prediction over 100 labels (Section 3.3.1)and tweet similarity (Section 3.3.2).

Emoji Prediction (TWEETEMOJI100)
This task aims to expand on previous emoji classification problems with 20 classes (Barbieri et al., 2017b(Barbieri et al., , 2018b) by introducing a more challenging dataset with 100 different emojis (TWEETEMOJI100).TWEETEMOJI100 consists of a more recent corpus, an important feature to consider in the ever evolving setting of social media, and takes into account a wider variety of emojis.TWEETEMOJI100 considers tweets with only one emoji present at the end of the text which is removed and used as our label to create a multi-class classification setting.
For the creation if the dataset, an existing large collection of tweets (37,083,959) (Loureiro et al., 2022a) is taken into account to extract the hundred most frequent emoji.For each emoji selected, we collected 500 tweets every day for the time period between 01-01-2020 to 01-01-2023.In total 7,379,453 new tweets were gathered through the Twitter API utilising the Twarc library (Summers, 2013).
Following tweet collection, we filtered all entries that contained more than one emoji and entries where the emoji was not present at the end of the tweet.To avoid highly similar tweets that may have different emojis, we also removed near duplicated entries.This is accomplished by applying a normalisation step where (1) URLs and mentions are removed, and (2) entries that are considered duplicated based on their lemmatised form are ignored.Finally, colour variations of the heart, circle, and square emoji were ignored.All emojis present in TWEETEMOJI100 and their distribution can be found in Figure 1 of the Appendix.

Tweet Similarity (TWEETSIM)
Given the importance of textual similarity dataset in NLP (Cer et al., 2017) and the lack of such datasets in social media, we decided to construct a new dataset focused on tweet similarity.Given two tweets as input, the tweet similarity task consists of assigning them a 0 to 5 score according to their similarity.
Sampling Similarly to the TEMPOWIC tweet sampling procedure, we followed the approach of Chen et al. (2021) to detect trending hashtags for the period between 2020 and 2021, based on the corpora collected for TimeLM (Loureiro et al., 2022a).Afterwards, we randomly selected a diverse set of hashtags and collected an additional sample of tweets featuring those hashtags (i.e., most common hashtag only appears on 25 pairs).The resulting dataset features 1,000 tweet pairs, with the inclusion of 20% randomly paired tweets for completeness.
Annotation All the tweet pairs were then rated with by Likert-like scale by three independent annotators6 .All the annotators were native English speakers and were paid fairly through our institutional student job provider7 .The final interannotator, as measured by annotator pairwise Spearman correlation, was 0.70.

SUPERTWEETEVAL: The Unified Benchmark
We convert all datasets presented in the previous section to the same JSON format, unifying them with the same notation, preprocessing and style.Table 1 provides an overview of each task and dataset while also providing example entries.

Preprocessing
A common preprocessing pipeline is applied to all the collected datasets aiming to standardise them and provide a uniform and easy-to-use format.
Firstly, all URL links are masked as {URL}, both for privacy reasons, and for concentrating the focus of our tasks to the main text context of each tweet.Furthermore, all mentions of non-verified users are masked with @user to maintain their anonymity.
Finally, an attempt is made to unify features and label/score naming to increase the datasets' easeof-use.

Evaluation Metrics
To unify evaluation metrics for a better understandability, we selected and converted all metrics in a percentage-based 0-100 scale, in which higher scores represent better model predictions.is selected as the evaluation metric for the Sentiment Classification task.To better integrate it in our benchmark, we use 1 − M AE M as our score and cap the negative values to 0. In contrast to F1scores, M AE M (also used in the original SemEval competition) takes into account the order of the labels and provides a better understanding of the performance of the models.
TWEETEMOTION, TWEETTOPIC and TWEET-NER7 For the multi-label classification tasks of TWEETEMOTION and Topic Classification the standard average macro-F1 score is used.Metrics like Accuracy score, and Jaccard Index were initially considered, however, macro-F1 will encourage the development of more precise and accurate models across classes.Average macro-F1 is also used for the NER task (similar to the TWEETNER7 original paper).
TWEETEMOJI100 Considering the large number of labels present in the Emoji Classification task and the fact that some of the emojis can be a close match for more than one entry, Accuracy at top 5 is selected as the official evaluation metric.
TWEETSIM & TWEETINTIMACY For both regression tasks, Spearman's correlation r is used as the main evaluation metric.This metric focuses on the relationship between predicted and actual ranks instead of the absolute errors between them (i.e.Mean Absolute Error).Negative values are also capped to 0.
TWEETNERD & TEMPOWIC For both binary classification tasks we use Accuracy score.The classes in both datasets are relatively balanced (fully balanced in the case of TWEETNERD) and thus Accuracy provides a reliable metric.
TWEETHATE In this task we utilise a combination of micro and macro F1 scores as an evaluation metric.Specifically, we first group all entries classified as hate speech as one class, and together with the not-hate class the micro-F1 score is calculated.
Then, the macro-F1 for only the hate speech subclasses is calculated.Finally, we report the average of these two scores.This "combined F1" score is selected because: (1) it weights heavily the most important decision (a tweet being hateful or not) and (2) does not penalise unfairly poor performance in low frequency hate speech sub-classes.TWEETQA & TWEETQG For the evaluation metrics of generative tasks, we employ the answerspan F1 score for TWEETQA following Rajpurkar et al. (2016), and METEOR (Denkowski and Lavie, 2014) for TWEETQG, which has been shown to be a well-correlated metric for QG (Ushio et al., 2022a).

Statistics
Overall, SUPERTWEETEVAL consists of 255,170 tweets across twelve different datasets.For each task, we consider the training/validation/test splits as presented in their resource paper (or in the model released by the authors of the papers).Exceptions are the TWEETHATE, TWEETSIM and TWEETE-MOJI100 tasks where new data splits were created.Table 2 displays the final distribution of tweets in each split for each task.

Experimental Setting
For the evaluation, we rely on the datasets and splits presented in Section 4.3.In particular, we evaluate all models on the test splits.Each dataset uses a different evaluation metric, as introduced in Section 4.2.

Naive Baselines
To establish a lower performance threshold for each task, naive baselines are also included.For the classification tasks ( Training The implementations provided by Hug-gingFace (Wolf et al., 2020) are used to train and evaluate all language models, while we utilise Ray Tune (Liaw et al., 2018) for optimising the number of epochs, learning rate, warmup steps, and batch size hyper-parameters.The hyper-parameter optimisation is done by performing a random search over 10 different runs for each model.

Zero & Few Shot
Further to the fine-tune experimental setting, two in-context learning settings are established: zero and few shot.Aiming to explore the challenges that arise when testing SUPERTWEETEVAL in such settings, we select the larger versions of the FlanT5 models (FlanT5 XL & FlanT5 XXL ), OPT-IML 1.3B 9 and also text-ada-001 from OpenAI, a small version of GPT-3 (Brown et al., 2020), and chat-gpt-3.5-turbo 10 , and are evaluated in each task.
In both settings, we prompt the models three times and report the average result of the runs.Specifically, in the few-shot setting we sample different examples extracted from the validation set of each task for each run.The number of examples sampled for few-shot was based on the given task.For regression and text generation tasks, we provide five random examples in each prompt, while for classification tasks we include one example per class with a maximum of five examples.The prompts used in our experiments are, in their majority, based on the instructions used in the FlanT5 (Chung et al., 2022), and OPT-IML (Iyer et al., 2023) papers 11 .
As a final note, we forfeit reporting the results of zero/few-shot settings on TWEETNER7 and TWEETEMOJI100 as our initial experiments were unsuccessful.This is mainly due to: (1) limitations of the models themselves (e.g.FlanT5 models are not trained with emojis); (2) evaluation difficulties (TWEETEMOJI100 is evaluated using Accuracy at top 5 which leads to complications on the fewshot setting as only one emoji is included in the gold standard); and (3) issues that arose with the prompts tested (see Section C in the Appendix).

Results
Fine-tuning The results from the fine-tuning setting (Table 3) provide an indication of the level 9 The IML version (Iyer et al., 2023) is selected as it is trained in a similar way to FlanT5.
10 https://openai.com/chatgpt 11Detailed prompts for each task can be found in Appendix C.
of difficulty of each task.Not surprisingly, most models seem to perform relatively well on simpler datasets such as TWEETHATE (best: 0.8254) and TWEETNERD (best: 0.8530).However, the majority of the tasks appear to still offer an adequate challenge, with the best performing overall model (TimeLM LARGE ) achieving an average score of 0.63 across all tasks tested.Specifically, the most difficult tasks appear to be TWEETEMOJI100 and TWEETQG, where all models perform below 0.5.
Finally, regarding the models' architectures tested, our results indicate that the RoBERTa models (and specifically TimeLM LARGE ) display a better performance across most tasks when compared to FlanT5 and OPT counterparts.
Zero & few shot When considering the results of our zero/few shot experiments (Table 4), a clear trend is revealed where most models tested fail to achieve better, or even similar, results to those that were fine-tuned.Exception to this is chat-gpt-3.5-turbowhich in some tasks such as TEMPOWIC and TWEETNERD achieves similar, but still lower, performance to the fine-tuned models, while it also achieves the best score in the TWEETQA.However, its performance must be viewed with caution as due to its closed nature as there is a possibility that the GPT models may have already been trained on the datasets collected (including test splits) providing them an unfair advantage.
The difference in performance, particularly in the regression and ordinal classification tasks of TWEETINTIMACY and TWEETSENTIMENT, is significant.The best performing model, FlanT5 XXL , achieves, in a few-shot setting, scores of 29.96 and 25.72 in TWEETINTIMACY and TWEETSENTI-MENT respectively, which is more than a 50% drop ) are included for completeness.We refrained from highlighting ChatGPT results due to its closed and irreproducible nature, as well as the possibility to have been directly trained on some of the test sets.
in performance compared to the scores achieved by the best performing fine-tuned model (68.95 and 54.65). 127 Analysis Aiming to acquire a better understanding of capabilities for the model, and also the challenges that the tasks present, we organise the tasks in smaller subbenchmarks or clusters that are aimed at testing a particular feature, and investigate their performance.The clusters defined are created based on the nature of each task as well as the features that are present in them.
Temporal 13 .For this cluster, all the datasets that feature a temporal aspect are grouped together.In particular, we include those datasets 12 The goal of this paper is not to have the strongest models, but rather to evaluate models out-of-the-box.As such, there are tasks such as TWEETEMOJI100 that are not easily solved by the selected zero/few shot models. 13Due to the lack of zero/few-shot results for TWEETNER7, we added the subset temporal* that does not include NER.that contain data splits from different time periods (TWEETNER7, TEMPOWIC, TWEETTOPIC, and TWEETNERD).
Multi-label.In this cluster we include the TWEETTOPIC and TWEETEMOTION datasets, analysing the models' performance in multi-label classification.
Multi-class.Similar to the previous cluster, we consider the TWEETSENTIMENT and TWEET-HATE datasets to evaluate the models in singlelabel multi-class tweet classification.
Regression.For this cluster, we include the two regression tasks of TWEETSIM and TWEETINTI-MACY, and also consider TWEETSENTIMENT (ordinal classification).
Target-based.We group all datasets that provide information regarding a target word or entity that is used in models' predictions (TWEETSENTIMENT, TEMPOWIC and TWEETNERD).Table 5: Aggregated results over each test cluster of the best zero-shot, few-shot and fine-tuning methods.
Big-label.In this setting we include classification tasks (both multi and single label) that contain a high number of labels (TWEETEMOJI100 with 100 labels and TWEETTOPIC with 19 labels).
Generation.TWEETQA and TWEETQG were grouped together to create a setting for evaluating the generation capabilities of the models.
Disambiguation.As a final cluster we consider the tasks TEMPOWIC and TWEETNERD which share the goal of understanding and differentiating the meaning of a term between two contexts.
For this analysis, we selected the two best performing models in zero and few shot settings (FlanT5 XXL , OPT-IML 1.3B ) along with chat-gpt-3.5-turbo, and the best model of each architecture from the fine-tuning experiment (TimeLM LARGE , FlanT5 BASE , and OPT 350M ).Table 5 displays the results of each cluster.Although the comparison across clusters is not straightforward given the different evaluation metrics, the most challenging settings for all model tested appear to be the Big-label and Multi-label clusters where no score greater than 60 is achieved.Finally, the results again highlight that in-context learning models (both zero-and few-shot) generally underperform compared to smaller models finetuned on the full training set, with even ChatGPT failing to attain the best score in any of the test clusters.In general, the fine-tuned TimeLM LARGE model achieve the best results across all nongeneration clusters, with the fine-tuned FLAN-T5 model achieving the best results on generation tasks.

Conclusion
In this paper, we have presented a new social media NLP benchmark, SUPERTWEETEVAL.It goes beyond simple tweet classification, including gen-erative, sequence prediction, and regression tasks, in addition to challenging classification tasks.This benchmark is aimed at unifying Twitter-based evaluation protocols and provides a realistic assessment of models in this difficult and important domain.Our evaluation highlighted the challenging nature of the benchmark and the social media domain.In particular, the results show how recent LLMs struggle with the specialised nature of this domain, with smaller but fine-tuned and more specialised models being more competitive overall.In addition to its evaluation purposes, it can also be used as the basis for multi-task learning, as well as for making use of already-trained models.

Limitations
The main issue of SUPERTWEETEVAL is its lack of language variety, as it focuses on English only.By making this first English benchmark publicly available, we hope this can pave the way for future research to extend the benchmark for other languages.
In terms of data, the benchmark only includes one dataset per task.We took this choice to both (1) make the benchmark simpler to understand, and (2) because for most tasks only a single dataset was available.As such, conclusions for individual tasks can be limiting.Also, for some tasks, it is hard to know the performance ceiling for models, often reported as human performance.While we could have attempted to provide an upper bound, we believe this is a challenging problem in itself, as also human performance estimates are often unreliable as a performance indicator (Tedeschi et al., 2023).
Finally, our evaluation is focused on a simple setting comparing language models in supervised and zero/few shot settings and only focused on a limited set of LLMs.We did not intend to provide a new model performing well on all tasks, but rather an assessment of current models in similar conditions.Because of this, we may have not provided the models with their optimal settings.We will release all the evaluation scripts so that other researchers can easily evaluate their model performance on SUPERTWEETEVAL.

Ethics Statement
Our work aims to contribute and extend research in social media and particularly on Twitter.We propose a unified benchmark that can be utilised to train and evaluate new social media models.The datasets collected in SUPERTWEETEVAL are under public licences and follow the rules of Twitter API.Moreover, given that the data presented includes user generated content we are committed to respecting the privacy of the users.This is achieved by applying the necessary preprossessing to remove user mentions (from non-verified users) and URLs that can be used to identify individuals.We also ensured that none of the dataset splits contain more than 50,000 tweets.Finally, regarding the annotation of TWEETSIM, a fair payment was ensured to be paid for the annotators that took part in its collection.

A.1 Dataset Selection
All the datasets used in this benchmark were carefully selected based on their difficulty level, coverage they offer, and availability (licence).Below we discuss the selection process for tasks that a large selection of pre-existing datasets exists.
TWEETSENTIMENT We opted out of using datasets that consider sentiment analysis as a classification (Rane and Kumar, 2018) or regression (Saad and Yang, 2019) task and instead use TWEET-SENTIMENT(aspect based on a five point scale) as a more challenging setting for more recent models and architectures.Finally, as it was part of a Se-mEval competition TWEETSENTIMENTis already well-known and tested dataset.
TWEETEMOTION Similar for the Sentiment Analysis task, the dataset pool for the TWEETEMO-TION task was rather limited when considering our needs (multi-label classification & Twitter based) (Balabantaray et al., 2012;Strapparava and Mihalcea, 2007).Again the data selected is well tested SemEval dataset that fits the needs of our task.
TWEETNER7 Finally, the TWEETNER7 dataset was selected instead of similar ones (Rijhwani and Preotiuc-Pietro, 2020;Jiang et al., 2022;Derczynski et al., 2017;?) as it provides a larger taxonomy of entities, a larger dataset with a uniformed distribution, and temporal characteristics (a more recent corpus) which are appreciated for building the sub-clusters used.

A.3 Annotator Guidelines of TWEETSIM
The dataset will be composed of pairs of tweets and a relatedness score.The annotation task will, therefore, consists of scoring how related or similar two tweets are according to the following scale: (5) Tweets are equivalent, even if some minor de-tails may differ (e.g., commenting about the same situation in different ways, one being more complete than the other, etc.) (4) Almost equivalent, refers to the same situation/event/person but with possibly relevant differences, such as missing significant details.
(3) Not equivalent, but shares details about a similar situation/event/person.Could be tweets around a similar event but with a different emotion or sentiment towards it.
(2) Categorically related, tweets are on the same topic or category (e.g.sports, politics).
(1) Loosely related, there is something minor in common (e.g.same energy/sentiment, same type, etc.).
(0) No relation, the tweets do not have anything in common.
Please note that only exact numbers should be used (e.g. 2 or 3) without any decimal.

B Models Details C Zero-shot Prompts
We list the prompts used for each task in the zeroshot setting.For the few-shot setting we follow a similar approach while adding 2-5 (depending on the task) examples at the beginning of each prompt.

TEMPOWIC
Tweet 1: "In this bullpen, you should be able to ask why and understand why we do the things we do."@user #pitchstock2020 @user Tweet 2: Castro needs to be the last bullpen guy to pitch.Does the word "bullpen" mean the same thing in these two sentences?Options: [ yes, no ] TWEETSIM How similar are the following two tweets?Tweet 1: India is With @republic #Immortal-Sushant #FreeAnujNow #CantBlockRepublic #Na-tion_With_R_Bharat Tweet 2: Trending On 5 Number Retweet And Comment For 1st #Nation_With_R_Bharat Give the answer on a scale from 0 -5, where 0 is "not similar at all" and 5 is "means the same thing".
Identify and categorize named entities in the following tweet: Tweet: New music coming soon via @Columbia_Records . . . . . .# columbia # newmusic # photooftheday # listentothis @ Columbia Records UK URL TWEETTOPIC Which topics from the options below are present in the following tweet?Tweet: Philadelphia clearly didn't take a page out of the @user Game 7 playbook of firing everything on net, make the opposing goalie beat you.There's 6 minutes left and the Flyers have 16 shots Options: [ arts_&_culture, busi-ness_&_entrepreneurs, celebrity_&_pop_culture, diaries_&_daily_life, family, fashion_&_style,
Based on the tweet is the definition of the target correct?Tweet: No. 1 Eastern leads 3-0 at halftime against Shawnee #njsoccer @user Target: Shawnee Definition: city in Pottawatomie County, Oklahoma, United States Options: [ yes, no ] TWEETQA Answer based on context: Context: 5 years in 5 seconds.Darren Booth (@darbooth) January 25, 2013 Question: what site does the link take you to? Answer: TWEETQG Write a question based on this tweet and context.Tweet: 5 years in 5 seconds.Darren Booth (@darbooth) January 25, 2013 Context: vine Question:

Table 2 :
Number of tweets in the train, validation (Valid.)and test splits for each of the tasks in SUPER-TWEETEVAL.

Table 3 :
FlanT5 SMALL FlanT5 BASE OPT 125M OPT 350M RoBERTa BASE RoBERTa LARGE TimeLM BASE TimeLM Large Naive SUPERTWEETEVAL individual task results of selected models in the fine-tuning setting.

Table 4 :
SUPERTWEETEVAL zero & few shot results.Best results for each task and setting are bolded.Chat-GPT results (marked with *