Piotr Bojanowski


pdf bib
Adaptive Attention Span in Transformers
Sainbayar Sukhbaatar | Edouard Grave | Piotr Bojanowski | Armand Joulin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

pdf bib
Training Hybrid Language Models by Marginalizing over Segmentations
Edouard Grave | Sainbayar Sukhbaatar | Piotr Bojanowski | Armand Joulin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we study the problem of hybrid language modeling, that is using models which can predict both characters and larger units such as character ngrams or words. Using such models, multiple potential segmentations usually exist for a given string, for example one using words and one using characters only. Thus, the probability of a string is the sum of the probabilities of all the possible segmentations. Here, we show how it is possible to marginalize over the segmentations efficiently, in order to compute the true probability of a sequence. We apply our technique on three datasets, comprising seven languages, showing improvements over a strong character level language model.

pdf bib
Misspelling Oblivious Word Embeddings
Aleksandra Piktus | Necati Bora Edizel | Piotr Bojanowski | Edouard Grave | Rui Ferreira | Fabrizio Silvestri
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.


pdf bib
Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion
Armand Joulin | Piotr Bojanowski | Tomas Mikolov | Hervé Jégou | Edouard Grave
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a quadratic problem to learn a orthogonal matrix aligning a bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion. Our experiments on standard benchmarks show that our approach outperforms the state of the art on word translation, with the biggest improvements observed for distant language pairs such as English-Chinese.

pdf bib
Colorless Green Recurrent Networks Dream Hierarchically
Kristina Gulordava | Piotr Bojanowski | Edouard Grave | Tal Linzen | Marco Baroni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recurrent neural networks (RNNs) achieved impressive results in a variety of linguistic processing tasks, suggesting that they can induce non-trivial properties of language. We investigate to what extent RNNs learn to track abstract hierarchical syntactic structure. We test whether RNNs trained with a generic language modeling objective in four languages (Italian, English, Hebrew, Russian) can predict long-distance number agreement in various constructions. We include in our evaluation nonsensical sentences where RNNs cannot rely on semantic or lexical cues (“The colorless green ideas I ate with the chair sleep furiously”), and, for Italian, we compare model performance to human intuitions. Our language-model-trained RNNs make reliable predictions about long-distance agreement, and do not lag much behind human performance. We thus bring support to the hypothesis that RNNs are not just shallow-pattern extractors, but they also acquire deeper grammatical competence.

pdf bib
Advances in Pre-Training Distributed Word Representations
Tomas Mikolov | Edouard Grave | Piotr Bojanowski | Christian Puhrsch | Armand Joulin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Learning Word Vectors for 157 Languages
Edouard Grave | Piotr Bojanowski | Prakhar Gupta | Armand Joulin | Tomas Mikolov
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Enriching Word Vectors with Subword Information
Piotr Bojanowski | Edouard Grave | Armand Joulin | Tomas Mikolov
Transactions of the Association for Computational Linguistics, Volume 5

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

pdf bib
Bag of Tricks for Efficient Text Classification
Armand Joulin | Edouard Grave | Piotr Bojanowski | Tomas Mikolov
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.