Rami Al-Rfou’

Also published as: Rami Al-Rfou


2021

pdf bib
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Linting Xue | Noah Constant | Adam Roberts | Mihir Kale | Rami Al-Rfou | Aditya Siddhant | Aditya Barua | Colin Raffel
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

pdf bib
Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training
Oshin Agarwal | Heming Ge | Siamak Shakeri | Rami Al-Rfou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.

pdf bib
nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?
Mihir Kale | Aditya Siddhant | Rami Al-Rfou | Linting Xue | Noah Constant | Melvin Johnson
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime

2020

pdf bib
Wiki-40B: Multilingual Language Model Dataset
Mandy Guo | Zihang Dai | Denny Vrandečić | Rami Al-Rfou
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.

pdf bib
LAReQA: Language-Agnostic Answer Retrieval from a Multilingual Pool
Uma Roy | Noah Constant | Rami Al-Rfou | Aditya Barua | Aaron Phillips | Yinfei Yang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present LAReQA, a challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool. Unlike previous cross-lingual tasks, LAReQA tests for “strong” cross-lingual alignment, requiring semantically related cross-language pairs to be closer in representation space than unrelated same-language pairs. This level of alignment is important for the practical task of cross-lingual information retrieval. Building on multilingual BERT (mBERT), we study different strategies for achieving strong alignment. We find that augmenting training data via machine translation is effective, and improves significantly over using mBERT out-of-the-box. Interestingly, model performance on zero-shot variants of our task that only target “weak” alignment is not predictive of performance on LAReQA. This finding underscores our claim that language-agnostic retrieval is a substantively new kind of cross-lingual evaluation, and suggests that measuring both weak and strong alignment will be important for improving cross-lingual systems going forward. We release our dataset and evaluation code at https://github.com/google-research-datasets/lareqa.

pdf bib
Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing
Oshin Agarwal | Mihir Kale | Heming Ge | Siamak Shakeri | Rami Al-Rfou
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

We present a system for bilingual Data-ToText Generation and Semantic Parsing. We use a text-to-text generator to learn a single model that works for both languages on each of the tasks. The model is aided by machine translation during both pre-training and fine-tuning. We evaluate the system on WebNLG 2020 data 1 , which consists of RDF triples in English and natural language sentences in English and Russian for both the tasks. We achieve considerable gains over monolingual models, especially on unseen relations and Russian.

2013

pdf bib
Polyglot: Distributed Word Representations for Multilingual NLP
Rami Al-Rfou’ | Bryan Perozzi | Steven Skiena
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

2012

pdf bib
SpeedRead: A Fast Named Entity Recognition Pipeline
Rami Al-Rfou’ | Steven Skiena
Proceedings of COLING 2012