TransCasm: A Bilingual Corpus of Sarcastic Tweets
Desline Simon | Sheila Castilho | Pintu Lohar | Haithem Afli
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences
Sarcasm is extensively used in User Generated Content (UGC) in order to express one’s discontent, especially through blogs, forums, or social media such as Twitter. Several works have attempted to detect and analyse sarcasm in UGC. However, the lack of freely available corpora in this field makes the task even more difficult. In this work, we present “TransCasm” corpus, a parallel corpus of sarcastic tweets translated from English into French along with their non-sarcastic representations. To build the bilingual corpus of sarcasm, we select the “SIGN” corpus, a monolingual data set of sarcastic tweets and their non-sarcastic interpretations, created by (Peled and Reichart, 2017). We propose to define linguistic guidelines for developing “TransCasm” which is the first ever bilingual corpus of sarcastic tweets. In addition, we utilise “TransCasm” for building a binary sarcasm classifier in order to identify whether a tweet is sarcastic or not. Our experiment reveals that the sarcasm classifier achieves 61% accuracy on detecting sarcasm in tweets. “TransCasm” is now freely available online and is ready to be explored for further research.