TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP

Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, Tanja Samardzic


Abstract
We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing. The TeDDi sample currently features 89 languages based on the typological diversity sample in the World Atlas of Language Structures. It consists of more than 20k texts and is accompanied by open-source corpus processing tools. The aim of TeDDi is to facilitate text-based quantitative analysis of linguistic diversity. We describe in detail the TeDDi sample, how it was created, data availability, and its added value through for NLP and linguistic research.
Anthology ID:
2022.lrec-1.123
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1150–1158
Language:
URL:
https://aclanthology.org/2022.lrec-1.123
DOI:
Bibkey:
Cite (ACL):
Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, and Tanja Samardzic. 2022. TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1150–1158, Marseille, France. European Language Resources Association.
Cite (Informal):
TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP (Moran et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.123.pdf
Data
OpenSubtitles