IRMA: the 335-million-word Italian coRpus for studying MisinformAtion

Fabio Carrella, Alessandro Miani, Stephan Lewandowsky


Abstract
The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often under-represented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as ‘untrustworthy’ by professional fact-checkers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domain-specific information such as source type (e.g., political, health, conspiracy, etc.), credibility, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.
Anthology ID:
2023.eacl-main.171
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2339–2349
Language:
URL:
https://aclanthology.org/2023.eacl-main.171
DOI:
10.18653/v1/2023.eacl-main.171
Bibkey:
Cite (ACL):
Fabio Carrella, Alessandro Miani, and Stephan Lewandowsky. 2023. IRMA: the 335-million-word Italian coRpus for studying MisinformAtion. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2339–2349, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
IRMA: the 335-million-word Italian coRpus for studying MisinformAtion (Carrella et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.171.pdf
Video:
 https://aclanthology.org/2023.eacl-main.171.mp4