Édouard Grave

Also published as: Edouard Grave


2021

pdf bib
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Gautier Izacard | Edouard Grave
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.

pdf bib
Self-training Improves Pre-training for Natural Language Understanding
Jingfei Du | Edouard Grave | Beliz Gunel | Vishrav Chaudhary | Onur Celebi | Michael Auli | Veselin Stoyanov | Alexis Conneau
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

pdf bib
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
Holger Schwenk | Guillaume Wenzek | Sergey Edunov | Edouard Grave | Armand Joulin | Angela Fan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.

2020

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek | Marie-Anne Lachaux | Alexis Conneau | Vishrav Chaudhary | Francisco Guzmán | Armand Joulin | Edouard Grave
Proceedings of the 12th Language Resources and Evaluation Conference

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

2019

pdf bib
Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction
Paula Czarnowska | Sebastian Ruder | Edouard Grave | Ryan Cotterell | Ann Copestake
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Human translators routinely have to translate rare inflections of words – due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as habláramos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art bilingual lexicon inducers are capable of learning this kind of generalization. We introduce 40 morphologically complete dictionaries in 10 languages and evaluate three of the best performing models on the task of translation of less frequent morphological forms. We demonstrate that the performance of state-of-the-art models drops considerably when evaluated on infrequent morphological inflections and then show that adding a simple morphological constraint at training time improves the performance, proving that the bilingual lexicon inducers can benefit from better encoding of morphology.

pdf bib
Adaptive Attention Span in Transformers
Sainbayar Sukhbaatar | Edouard Grave | Piotr Bojanowski | Armand Joulin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

pdf bib
Training Hybrid Language Models by Marginalizing over Segmentations
Edouard Grave | Sainbayar Sukhbaatar | Piotr Bojanowski | Armand Joulin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we study the problem of hybrid language modeling, that is using models which can predict both characters and larger units such as character ngrams or words. Using such models, multiple potential segmentations usually exist for a given string, for example one using words and one using characters only. Thus, the probability of a string is the sum of the probabilities of all the possible segmentations. Here, we show how it is possible to marginalize over the segmentations efficiently, in order to compute the true probability of a sequence. We apply our technique on three datasets, comprising seven languages, showing improvements over a strong character level language model.

pdf bib
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
Alex Wang | Jan Hula | Patrick Xia | Raghavendra Pappagari | R. Thomas McCoy | Roma Patel | Najoung Kim | Ian Tenney | Yinghui Huang | Katherin Yu | Shuning Jin | Berlin Chen | Benjamin Van Durme | Edouard Grave | Ellie Pavlick | Samuel R. Bowman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo’s pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.

pdf bib
Misspelling Oblivious Word Embeddings
Aleksandra Piktus | Necati Bora Edizel | Piotr Bojanowski | Edouard Grave | Rui Ferreira | Fabrizio Silvestri
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.

2018

pdf bib
Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion
Armand Joulin | Piotr Bojanowski | Tomas Mikolov | Hervé Jégou | Edouard Grave
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a quadratic problem to learn a orthogonal matrix aligning a bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion. Our experiments on standard benchmarks show that our approach outperforms the state of the art on word translation, with the biggest improvements observed for distant language pairs such as English-Chinese.

pdf bib
Advances in Pre-Training Distributed Word Representations
Tomas Mikolov | Edouard Grave | Piotr Bojanowski | Christian Puhrsch | Armand Joulin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Learning Word Vectors for 157 Languages
Edouard Grave | Piotr Bojanowski | Prakhar Gupta | Armand Joulin | Tomas Mikolov
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Colorless Green Recurrent Networks Dream Hierarchically
Kristina Gulordava | Piotr Bojanowski | Edouard Grave | Tal Linzen | Marco Baroni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recurrent neural networks (RNNs) achieved impressive results in a variety of linguistic processing tasks, suggesting that they can induce non-trivial properties of language. We investigate to what extent RNNs learn to track abstract hierarchical syntactic structure. We test whether RNNs trained with a generic language modeling objective in four languages (Italian, English, Hebrew, Russian) can predict long-distance number agreement in various constructions. We include in our evaluation nonsensical sentences where RNNs cannot rely on semantic or lexical cues (“The colorless green ideas I ate with the chair sleep furiously”), and, for Italian, we compare model performance to human intuitions. Our language-model-trained RNNs make reliable predictions about long-distance agreement, and do not lag much behind human performance. We thus bring support to the hypothesis that RNNs are not just shallow-pattern extractors, but they also acquire deeper grammatical competence.

2017

pdf bib
Enriching Word Vectors with Subword Information
Piotr Bojanowski | Edouard Grave | Armand Joulin | Tomas Mikolov
Transactions of the Association for Computational Linguistics, Volume 5

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

pdf bib
Bag of Tricks for Efficient Text Classification
Armand Joulin | Edouard Grave | Piotr Bojanowski | Tomas Mikolov
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

2015

pdf bib
A convex and feature-rich discriminative approach to dependency grammar induction
Édouard Grave | Noémie Elhadad
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
A convex relaxation for weakly supervised relation extraction
Édouard Grave
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
A Markovian approach to distributional semantics with application to semantic compositionality
Édouard Grave | Guillaume Obozinski | Francis Bach
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Hidden Markov tree models for semantic class induction
Édouard Grave | Guillaume Obozinski | Francis Bach
Proceedings of the Seventeenth Conference on Computational Natural Language Learning