SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles

In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.


Introduction
According to its definition in the Cambridge Dictionary, satire is "a humorous way of criticizing people or ideas". News satire employs this mechanism in the form of seemingly legitimate journalistic reporting, with the intention of ridiculing public figures, politics or contemporary events (McClennen and Maisel, 2014;Peters and Broersma, 2013;Rubin et al., 2016). Although the articles pertaining to this genre contain fictionalized stories, the intent is not to mislead the public into thinking that the discussed subjects are real. On the contrary, satirical news articles are supposed to reveal their nature by the writing style and comedic devices employed, such as irony, parody or exaggeration. Thus, the intention behind the writing differentiates satirical news (Rubin et al., 2016) from fake news (Meel and Vishwakarma, 2019;Pérez-Rosas et al., 2018;Sharma et al., 2019). However, in some rare cases, the real intent might be deeply buried in the complex irony and subtleties of news satire (Barbieri et al., 2015a), which has the effect of fiction being deemed as factual (Zhang et al., 2020). Even so, there is a clear distinction between satirical and fake news. In fake news, the intent is to deceive the readers in thinking that the news is real, while presenting fake facts to influence the readers' opinion. Since our study is focused on satire detection, we consider discussing research on fake news detection as being out of our scope. At the same time, we acknowledge the growing importance of detecting fake news and the fact that an accurate differentiation of satirical from legitimate journalistic reports might be seen as a starting point in controlling the spread of deceptive news (De Sarkar et al., 2018).
Satire detection is an important task that could be addressed prior to the development of conversational systems and robots that interact with humans. Certainly, the importance of understanding satirical (funny, ridiculous or ironical) text becomes obvious when we consider a scenario in which a robot performs a dangerous action because it takes a satirical comment of the user too literally. Given the relevance of the task for the natural language processing community, satire detection has already been investigated in several well-studied languages such as Arabic (Saadany et al., 2020), English (Burfoot and Baldwin, 2009;De Sarkar et al., 2018;Goldwasser and Zhang, 2016;Yang et al., 2017), French (Ionescu andChifu, 2021;), German (McHardy et al., 2019), Spanish (Barbieri et al., 2015b and Turkish (Toçoglu and Onan, 2019). Through the definition of satire, the satire detection task is tightly connected to irony and sarcasm detection. These tasks strengthen or broaden the language variety with  languages such as Arabic (Karoui et al., 2017), Chinese (Jia et al., 2019), Dutch (Liebrecht et al., 2013) and Italian (Giudice, 2018).
In this work, we introduce SaRoCo 1 , the Satire detection Romanian Corpus, which comprises 55,608 news articles collected from various sources. To the best of our knowledge, this is the first and only data set for the study of Romanian satirical news. Furthermore, SaRoCo is also one of the largest data sets for satirical news detection, being surpassed only two corpora, one for English (Yang et al., 2017) and one for German (McHardy et al., 2019). However, our corpus contains the largest collection of satirical news articles (over 27,000). These facts are confirmed by the comparative statistics presented in Table 1.
Along with the novel data set, we include two strong deep learning methods to be used as baselines in future works. The first method is based on low-level features learned by a character-level convolutional neural network (Zhang et al., 2015), while the second method employs high-level semantic features learned by the Romanian version of BERT (Dumitrescu et al., 2020). The gap between the human-level performance and that of the deep learning baselines indicates that there is enough room for improvement left for future studies. We make our corpus and baselines avail-

Corpus
SaRoCo gathers both satirical and non-satirical news from some of the most popular Romanian news websites. The collected news samples were found in the public web domain, i.e. access is provided for free without requiring any subscription to the publication sources. The entire corpus consists of 55,608 samples (27,628 satirical samples and 27,980 non-satirical samples), having more than 28 million tokens in total, as illustrated in Table 2. Each sample is composed of a title (headline), a body and a corresponding label (satirical or non-satirical). As shown in Table 3, an article has around 515.24 tokens on average, with an average of 24.97 tokens for the headline. We underline that the labels are automatically determined, based on the fact that a publication source publishes either regular or satirical news, but not both. We provide an official split for our corpus, such  that all future studies will use the same training, validation and test sets, easing the direct comparison with prior results. Following McHardy et al.
(2019), we use disjoint sources for training, validation and test, ensuring that models do not achieve high performance by learning author styles or topic biases particular to certain news websites. While crawling the public news articles, we selected the same topics (culture, economy, politics, social, sports, tech) and the same time frame (between 2011 and 2020) for all news sources to control for potential biases induced by uneven topic or time distributions across the satirical and nonsatirical genres. After crawling satirical and non-satirical news samples, our first aim was to prevent discrimination based on named entities. The satirical character of an article should be inferred from the language use rather than specific clues, such as named entities. For example, certain sources of news satire show preference towards mocking politicians from a specific political party, and an automated system might erroneously label a news article about a member of the respective party as satirical simply based on the presence of the named entity. Furthermore, we even noticed that some Romanian politicians have certain mocking nicknames assigned in satirical news. In order to eliminate named entities, we followed a similar approach as the one used for the MOROCO (Butnaru and Ionescu, 2019) data set. Thus, all the identified named entities are replaced with the special token $NE$. Besides eliminating named entities, we also substituted all whitespace characters with space and replaced multiple consecutive spaces with a single space. A set of processed satirical and regular headlines are shown in Table 4.

Baselines
Fine-tuned Ro-BERT. Our first baseline consists of a fine-tuned Romanian BERT (Dumitrescu et al., 2020), which follows the same transformer-based model architecture as the original BERT (Devlin et al., 2019). According to Dumitrescu et al. (2020), the Romanian BERT (Ro-BERT) attains better results than the multilingual BERT on a range of tasks. We therefore assume that the Romanian BERT should represent a stronger baseline for our Romanian corpus.
We use the Ro-BERT encoder to encode each text sequence into a list of token IDs. The tokens are further processed by the model, obtaining the corresponding 768-dimensional embeddings. At this point, we add a global average pooling layer to obtain a Continuous Bag-of-Words (CBOW) representation for each sequence of text, followed by a Softmax output layer with two neural units, each predicting the probability for one category, either non-satirical or satirical. To obtain the final class label for a text sample, we apply argmax on the two probabilities. We fine-tune the whole model for 10 epochs on mini-batches of 32 samples, using the Adam with decoupled weight decay (AdamW) optimizer (Loshchilov and Hutter, 2019), with a learning rate of 10 −7 and the default value for ǫ. Character-level CNN. The second baseline model considered in the experiments is a Convolutional Neural Network (CNN) that operates at the character level (Zhang et al., 2015). We set the input size to 1,000 characters. After the input layer, we add an embedding layer to encode each character into a vector of 128 components. The optimal architecture for the task at hand proved to be composed of three convolutional (conv) blocks, each having a conv layer with 64 filters applied at stride 1, followed by Scaled Exponential Linear Unit (SELU) activation. From the first block to the  third block, the convolutional kernel sizes are 5, 3 and 1, respectively. Max-pooling with a filter size of 3 is applied after each conv layer. After each conv block, we insert a Squeeze-and-Excitation block with the reduction ratio set to r = 64, following Butnaru and Ionescu (2019). To prevent overfitting, we use batch normalization and Alpha Dropout (Klambauer et al., 2017) with a dropout rate of 0.5. The final prediction layer is composed of two neural units, one for each class (i.e. legitimate and satirical), with Softmax activation. We use the Nesterov-accelerated Adaptive Moment Estimation (Nadam) optimizer (Dozat, 2016) with a learning rate of 2 · 10 −4 , training the network for 50 epochs on mini-batches of 128 samples.

Experiments
Evaluation. We conducted binary classification experiments on SaRoCo, predicting if a given piece of text is either satirical or non-satirical. As evaluation metrics, we employ the precision and recall for each of the two classes. We also combine these scores through the macro F 1 and micro F 1 (accuracy) measures. Results. In Table 5, we present the results of the two baselines on the SaRoCo validation and test sets. We observe that both models tend to have higher precision scores in detecting satire than in detecting regular news. The trade-off between precision and recall is skewed towards higher recall for the non-satirical news class. Since both models share the same behavior, we conjecture that the behavior is rather caused by the particularities of the satire detection task. Discriminative feature analysis. We analyze the discriminative features learned by the characterlevel CNN, which is one of the proposed baseline systems for satire detection. We opted for the character-level CNN in favor of the finetuned BERT, as the former method allows us to visualize discriminative features using Grad-CAM (Selvaraju et al., 2017), a technique that was initially used to explain decisions of CNNs applied on images. We adapted this technique for Category Example Translation Slang "cel mai marfȃ serial din lume" "the dopest TV show in the world" "cocalar" "douche" Insult "odiosul primar" "the odious mayor" "bunicuţ retardat" "retarded grandpa" "dugongulȃla slinos de la sectorul 4" Exaggeration "Ne-am sȃturat!" "We're sick of it!" Exclamation "Ruşine sȃ le fie!" "Shame on them!" Irony "Chiar nu suntem o naţie de hoţi!" "We're totally not a nation of thieves!" Popular "a sȃrit calul" "went overboard" Saying "a fȃcut-o de oaie" "messed up" "minte de gȃinȃ" "bird brain" Legal terms "asasinat" "assasinated" "l-au denunţat pe autorul atacului" "denounced the perpetrator" Weather "temperaturaîn timpul nopţii a scȃzut" "the temperature has dropped during the night" Political terms "scrutinul prezidenţial" "presidential election" "prefectura informeazȃ cȃ" "prefecture informs that" the character-level CNN, then extracted and analyzed the most predictive patterns in SaRoCo. The motivation behind this was to validate that the network's decisions are not based on some biases that escaped our data collection and cleaning process.
In Tables 6 and 7, we present a few examples of interesting patterns considered relevant for predicting satire versus regular news, respectively. A broad range of constructions covering a great variety of styles and significant words are underlined via Grad-CAM in the satirical news samples. The network seems to pick up obvious clues such as slang, insults and popular sayings rather than more subtle indicatives of satire, including irony or exaggeration. At the same time, for the real news in SaRoCo, there are fewer categories of predictive patterns. In general, the CNN deems formal, standard news expressions as relevant for regular news. These patterns vary across topics and domains. The CNN also finds that the presence of numbers and statistical clues is indicative for nonsatirical content, which is consistent with the observations of Yang et al. (2017). Our analysis reveals that the discriminative features are appropriate for satire detection, showing that our corpus is indeed suitable for the considered task.
Deep models versus humans. Given 100 satirical and 100 non-satirical news headlines (titles) randomly sampled from the SaRoCo test set, we asked ten Romanian human annotators to label each sample as satirical or non-satirical. We evaluated the deep learning methods on the same subset of 200 samples, reporting the results in Table 8. First, we observe that humans have a similar bias as the deep learning models. Indeed, for both humans and models, the trade-off between precision and recall is skewed towards higher precision for the satirical class and higher recall for the nonsatirical class. We believe this is linked to the way people and machines make a decision. Humans look for patterns of satire in order to label a sample as satire. If a satire-specific pattern is not identified, the respective sample is labeled as regular, increasing the recall for the non-satirical class. Although humans and machine seem to share the same way of thinking, there is a considerable performance gap in satire detection between humans and machines. Indeed, the average accuracy of our ten human annotators is around 87%, while the state-of-the-art deep learning models do not surpass 68% on the same news headlines. Even on full news articles (see Table 5), the models barely reach an accuracy of 73% on the test set. Hence, we conclude there is a significant performance gap between humans and machines, leaving enough room for exploration in future work on Romanian satire detection.
We would like to emphasize that our human evaluation was performed by casual news readers, and the samples were shown after named entity removal, thus having a fair comparison with the AI models. We underline that named entity removal makes the task more challenging, even for humans.

Conclusion
In this work, we presented SaRoCo, a novel data set containing satirical and non-satirical news samples. To the best of our knowledge, SaRoCo is the only corpus for Romanian satire detection and one of the largest corpora regardless of language. We trained two state-of-the-art neural models as baselines for future research on our novel corpus. We also compared the performance of the neural models with the averaged performance of ten human annotators, showing that the neural models lag far behind the human-level performance. Our discriminative feature analysis confirms the limitations of state-of-the-art neural models in detecting satire. Although we selected a set of strong models from the recent literature as baselines for SaRoCo, significant future research is necessary to close the gap with respect to the human-level satire detection performance. Designing models to pick up irony or exaggerations could pave the way towards closing this gap in future work.