2024
pdf
bib
abs
Memory-Efficient Fine-Tuning of Transformers via Token Selection
Antoine Simoulin
|
Namyong Park
|
Xiaoyi Liu
|
Grey Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.
pdf
bib
abs
Les représentations contextuelles stéréotypées dans les modèles de langue français : mieux les identifier pour ne pas les reproduire
Léandre Adam-Cuvillier
|
Pierre-Jean Larpin
|
Antoine Simoulin
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Nous présentons une étude pour mieux identifier comment les stéréotypes se reflètent dans les modèles de langue français. Nous adaptons le jeu de données StereoSet à la langue française et suivons le même protocole expérimental que celui utilisé pour l’anglais. Alors que les stéréotypes sont connus pour évoluer en fonction des contextes culturels et temporels, notre étude identifie des similitudes avec les résultats observés pour l’anglais, notamment en ce qui concerne la corrélation entre les capacités linguistiques des modèles et la présence de biais mesurables. Nous étendons notre étude en examinant des architectures de réseaux neuronaux similaires pré-entraînées sur des corpus linguistiques différents. Nos résultats mettent en évidence l’impact crucial des données de pré-entraînement sur les biais constatés dans les modèles français. De plus, nous observons que l’utilisation de corpus multilingues pour le pré-entraînement peut avoir un effet positif sur l’atténuation des biais.
2022
pdf
bib
abs
Unifying Parsing and Tree-Structured Models for Generating Sentence Semantic Representations
Antoine Simoulin
|
Benoit Crabbé
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
We introduce a novel tree-based model that learns its composition function together with its structure. The architecture produces sentence embeddings by composing words according to an induced syntactic tree. The parsing and the composition functions are explicitly connected and, therefore, learned jointly. As a result, the sentence embedding is computed according to an interpretable linguistic pattern and may be used on any downstream task. We evaluate our encoder on downstream tasks, and we observe that it outperforms tree-based models relying on external parsers. In some configurations, it is even competitive with Bert base model. Our model is capable of supporting multiple parser architectures. We exploit this property to conduct an ablation study by comparing different parser initializations. We explore to which extent the trees produced by our model compare with linguistic structures and how this initialization impacts downstream performances. We empirically observe that downstream supervision troubles producing stable parses and preserving linguistically relevant structures.
2021
pdf
bib
abs
How Many Layers and Why? An Analysis of the Model Depth in Transformers
Antoine Simoulin
|
Benoit Crabbé
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of Albert that dynamically adapts the number of layers for each token of the input. The key specificity of Albert is that weights are tied across layers. Therefore, the stack of encoder layers iteratively repeats the application of the same transformation function on the input. We interpret the repetition of this application as an iterative process where the token contextualized representations are progressively refined. We analyze this process at the token level during pre-training, fine-tuning, and inference. We show that tokens do not require the same amount of iterations and that difficult or crucial tokens for the task are subject to more iterations.
pdf
bib
abs
Contrasting distinct structured views to learn sentence embeddings
Antoine Simoulin
|
Benoit Crabbé
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to building consistent representations as we expect sentence meaning to be a function of both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to learn individual representation functions for different syntactic frameworks jointly. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. Our results outperform comparable methods on several tasks from standard sentence embedding benchmarks.
pdf
bib
abs
Un modèle Transformer Génératif Pré-entrainé pour le______ français (Generative Pre-trained Transformer in______ (French) We introduce a French adaptation from the well-known GPT model)
Antoine Simoulin
|
Benoit Crabbé
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale
Nous proposons une adaptation en français du fameux modèle Generative Pre-trained Transformer (GPT). Ce dernier appartient à la catégorie des architectures transformers qui ont significativement transformé les méthodes de traitement automatique du langage. Ces architectures sont en particulier pré-entraînées sur des tâches auto-supervisées et sont ainsi spécifiques pour une langue donnée. Si certaines sont disponibles en français, la plupart se déclinent avant tout en anglais. GPT est particulièrement efficace pour les tâches de génération de texte. Par ailleurs, il est possible de l’appliquer à de nombreux cas d’usages. Ses propriétés génératives singulières permettent de l’utiliser dans des conditions originales comme l’apprentissage sans exemple qui ne suppose aucune mise à jour des poids du modèle, ou modification de l’architecture.