Denny Vrandečić

Also published as: Denny Vrandecic


2020

pdf bib
Wiki-40B: Multilingual Language Model Dataset
Mandy Guo | Zihang Dai | Denny Vrandečić | Rami Al-Rfou
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.

pdf bib
Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons
Bruno Cartoni | Daniel Calvelo Aros | Denny Vrandecic | Saran Lertpradit
Proceedings of the Twelfth Language Resources and Evaluation Conference

The evaluation and exchange of large lexicon databases remains a challenge in many NLP applications. Despite the existence of commonly accepted standards for the format and the features used in a lexicon, there is still a lack of precise and interoperable specification requirements about how lexical entries of a particular language should look like, both in terms of the numbers of forms and in terms of features associated with these forms. This paper presents the notion of “lexical masks”, a powerful tool used to evaluate and exchange lexicon databases in many languages.