Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.
This paper describes work in progress on Visualisation tools to foster collaborations between translators and computational scientists. We aim to describe how visualisation features can be used to explain translation and NMT outputs. We tested several visualisation functionalities with three NMT models based on Chinese-English, Spanish-English and French-English language pairs. We created three demos containing different visualisation tools and analysed them within the framework of performance-explainability, focusing on the translator’s perspective.
This paper describes MAKE-NMTViz, a project designed to help translators visualize neural machine translation outputs using explainable artificial intelligence visualization tools initially developed for computer vision.
L’apprentissage auto-supervisé (SSL) a fait ses preuves pour le traitement automatique de la parole mais est généralement très consommateur de données, de mémoire et de ressources matérielles. L’approche BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer) est une approche SSL performante en reconnaissance automatique de la parole (RAP), plus efficiente que wav2vec 2.0. L’article original de Google qui introduit BEST-RQ manque de détails, comme le nombre d’heures de GPU/TPU utilisées pour le pré-entraînement et il n’existe pas d’implémentation open-source facile à utiliser. De plus, BEST-RQ n’a pas été évalué sur d’autres tâches que la RAP et la traduction de la parole. Dans cet article, nous décrivons notre implémentation open-source de BEST-RQ et réalisons une première étude en le comparant à wav2vec 2.0 sur quatre tâches. Nous montrons que BERT-RQ peut atteindre des performances similaires à celles de wav2vec 2.0 tout en réduisant le temps d’apprentissage d’un facteur supérieur à deux.
Les modèles de langue préentraînés (PLM) constituent aujourd’hui de facto l’épine dorsale de la plupart des systèmes de traitement automatique des langues. Dans cet article, nous présentons Jargon, une famille de PLMs pour des domaines spécialisés du français, en nous focalisant sur trois domaines : la parole transcrite, le domaine clinique / biomédical, et le domaine juridique. Nous utilisons une architecture de transformeur basée sur des méthodes computationnellement efficaces(LinFormer) puisque ces domaines impliquent souvent le traitement de longs documents. Nous évaluons et comparons nos modèles à des modèles de l’état de l’art sur un ensemble varié de tâches et de corpus d’évaluation, dont certains sont introduits dans notre article. Nous rassemblons les jeux de données dans un nouveau référentiel d’évaluation en langue française pour ces trois domaines. Nous comparons également diverses configurations d’entraînement : préentraînement prolongé en apprentissage autosupervisé sur les données spécialisées, préentraînement à partir de zéro, ainsi que préentraînement mono et multi-domaines. Nos expérimentations approfondies dans des domaines spécialisés montrent qu’il est possible d’atteindre des performances compétitives en aval, même lors d’un préentraînement avec le mécanisme d’attention approximatif de LinFormer. Pour une reproductibilité totale, nous publions les modèles et les données de préentraînement, ainsi que les corpus utilisés.
This paper describes the MAKE-NMTVIZ Systems trained for the WMT 2023 Literary task. As a primary submission, we used Train, Valid1, test1 as part of the GuoFeng corpus (Wang et al., 2023) to fine-tune the mBART50 model with Chinese-English data. We followed very similar training parameters to (Lee et al. 2022) when fine-tuning mBART50. We trained for 3 epochs, using gelu as an activation function, with a learning rate of 0.05, dropout of 0.1 and a batch size of 16. We decoded using a beam search of size 5. For our contrastive1 submission, we implemented a fine-tuned concatenation transformer (Lupo et al., 2023). The training was developed in two steps: (i) a sentence-level transformer was implemented for 10 epochs trained using general, test1, and valid1 data (more details in contrastive2 system); (ii) second, we fine-tuned at document-level using 3-sentence concatenation for 4 epochs using train, test2, and valid2 data. During the fine-tuning, we used ReLU as an activation function, with an inverse square root learning rate, dropout of 0.1, and a batch size of 64. We decoded using a beam search of size. Four our contrastive2 and last submission, we implemented a sentence-level transformer model (Vaswani et al., 2017). The model was trained with general data for 10 epochs using general-purpose, test1, and valid 1 data. The training parameters were an inverse square root scheduled learning rate, a dropout of 0.1, and a batch size of 64. We decoded using a beam search of size 4. We then compared the three translation outputs from an interdisciplinary perspective, investigating some of the effects of sentence- vs document-based training. Computer scientists, translators and corpus linguists discussed the linguistic remaining issues for this discourse-level literary translation.
Context-aware translation can be achieved by processing a concatenation of consecutive sentences with the standard Transformer architecture. This paper investigates the intuitive idea of providing the model with explicit information about the position of the sentences contained in the concatenation window. We compare various methods to encode sentence positions into token representations, including novel methods. Our results show that the Transformer benefits from certain sentence position encoding methods on English to Russian translation, if trained with a context-discounted loss. However, the same benefits are not observed on English to German. Further empirical efforts are necessary to define the conditions under which the proposed approach is beneficial.
In this paper we present the final result of a project focused on Tunisian Arabic encoded in Arabizi, the Latin-based writing system for digital conversations. The project led to the realization of two integrated and independent tools: a linguistic corpus and a neural network architecture created to annotate the former with various levels of linguistic information (code-switching classification, transliteration, tokenization, POS-tagging, lemmatization). We discuss the choices made in terms of computational and linguistic methodology and the strategies adopted to improve our results. We report on the experiments performed in order to outline our research path. Finally, we explain the reasons why we believe in the potential of these tools for both computational and linguistic researches.
Les approches de compréhension automatique de la parole ont récemment bénéficié de l’apport de modèles préappris par autosupervision sur de gros corpus de parole. Pour le français, le projet LeBenchmark a rendu disponibles de tels modèles et a permis des évolutions impressionnantes sur plusieurs tâches dont la compréhension automatique de la parole. Ces avancées ont un coût non négligeable en ce qui concerne le temps de calcul et la consommation énergétique. Dans cet article, nous comparons plusieurs stratégies d’apprentissage visant à réduire le coût énergétique tout en conservant des performances compétitives. Les expériences sont effectuées sur le corpus MEDIA, et montrent qu’il est possible de réduire significativement le coût d’apprentissage tout en conservant des performances à l’état de l’art.
Multi-encoder models are a broad family of context-aware neural machine translation systems that aim to improve translation quality by encoding document-level contextual information alongside the current sentence. The context encoding is undertaken by contextual parameters, trained on document-level data. In this work, we discuss the difficulty of training these parameters effectively, due to the sparsity of the words in need of context (i.e., the training signal), and their relevant context. We propose to pre-train the contextual parameters over split sentence pairs, which makes an efficient use of the available data for two reasons. Firstly, it increases the contextual training signal by breaking intra-sentential syntactic relations, and thus pushing the model to search the context for disambiguating clues more frequently. Secondly, it eases the retrieval of relevant context, since context segments become shorter. We propose four different splitting methods, and evaluate our approach with BLEU and contrastive test sets. Results show that it consistently improves learning of contextual parameters, both in low and high resource settings.
A straightforward approach to context-aware neural machine translation consists in feeding the standard encoder-decoder architecture with a window of consecutive sentences, formed by the current sentence and a number of sentences from its context concatenated to it. In this work, we propose an improved concatenation approach that encourages the model to focus on the translation of the current sentence, discounting the loss generated by target context. We also propose an additional improvement that strengthen the notion of sentence boundaries and of relative sentence distance, facilitating model compliance to the context-discounted objective. We evaluate our approach with both average-translation quality metrics and contrastive test sets for the translation of inter-sentential discourse phenomena, proving its superiority to the vanilla concatenation approach and other sophisticated context-aware systems.
TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.
In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem.
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and “arithmographs” (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. Arabish differs for each Arabic dialect and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the attention of NLP studies on Arabic dialects has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian Arabish.
Nous proposons une architecture neuronale avec les caractéristiques principales des modèles neuronaux de ces dernières années : les réseaux neuronaux récurrents bidirectionnels, les modèles encodeur-décodeur, et le modèle Transformer. Nous évaluons nos modèles sur trois tâches d’étiquetage de séquence, avec des résultats aux environs de l’état de l’art et souvent meilleurs, montrant ainsi l’intérêt de cette architecture hybride pour ce type de tâches.
Depuis quelques années les réseaux neuronaux se montrent très efficaces dans toutes les tâches de Traitement Automatique des Langues (TAL). Récemment, une variante de réseau neuronal particulièrement adapté à l’étiquetage de séquences textuelles a été proposée, utilisant des représentations distributionnelles des étiquettes. Dans cet article, nous reprenons cette variante et nous l’améliorons avec une version profonde. Dans cette version, différentes couches cachées permettent de prendre en compte séparément les différents types d’informations données en entrée au réseau. Nous évaluons notre modèle sur les mêmes tâches que la première version de réseau de laquelle nous nous sommes inspirés. Les résultats montrent que notre variante de réseau neuronal est plus efficace que les autres, mais aussi qu’elle est plus efficace que tous les autres modèles évalués sur ces tâches, obtenant l’état-de-l’art.
Dans cet article, nous proposons un modèle pour détecter dans les textes générés par des utilisateurs (en particulier les tweets), les mots non-standards à corriger. Nous utilisons pour cela des réseaux de neurones convolutifs au niveau des caractères, associés à des “plongements” (embeddings) des mots présents dans le contexte du mot courant. Nous avons utilisé pour l’évaluation trois corpus de référence. Nous avons testé différents modèles qui varient suivant leurs plongements pré-entrainés, leurs configurations et leurs optimisations. Nous avons finalement obtenu une F1-mesure de 0.972 en validation croisée pour la classe des mots non-standards. Cette détection des mots à corriger est l’étape préliminaire pour la normalisation des textes non standards comme les tweets.
Cet article présente trois expériences de détection de mentions dans un corpus de français oral : ANCOR. Ces expériences utilisent des outils préexistants d’analyse syntaxique du français et des méthodes issues de travaux sur la coréférence, les anaphores et la détection d’entités nommées. Bien que ces outils ne soient pas optimisés pour le traitement de l’oral, la qualité de la détection des mentions que nous obtenons est comparable à l’état de l’art des systèmes conçus pour l’écrit dans d’autres langues. Nous concluons en proposant des perspectives pour l’amélioration des résultats que nous obtenons et la construction d’un système end-to-end pour lequel nos expériences peuvent servir de base de travail.
In this paper we explain how we created a labelled corpus in English for a Named Entity Recognition (NER) task from multi-source and multi-domain data, for an industrial partner. We explain the specificities of this corpus with examples and describe some baseline experiments. We present some results of domain adaptation on this corpus using a labelled Twitter corpus (Ritter et al., 2011). We tested a semi-supervised method from (Garcia-Fernandez et al., 2014) combined with a supervised domain adaptation approach proposed in (Raymond and Fayolle, 2010) for machine learning experiments with CRFs (Conditional Random Fields). We use the same technique to improve the NER results on the Twitter corpus (Ritter et al., 2011). Our contributions thus consist in an industrial corpus creation and NER performance improvements.
Dans cet article nous étudions plusieurs types de réseaux neuronaux récurrents (RNN) pour l’étiquetage de séquences. Nous proposons deux nouvelles variantes de RNN et nous les comparons aux variantes plus classiques de type Jordan et Elman. Nous expliquons en détails quels sont les avantages de nos nouvelles variantes par rapport aux autres RNN. Nous évaluons tous les modèles, les nouvelles variantes ainsi que les RNN existants, sur deux tâches de compréhension de la parole : ATIS et MEDIA. Les résultats montrent que nos nouvelles variantes de RNN sont plus efficaces que les autres.
The work presented in this article takes place in the field of opinion mining and aims more particularly at finding the polarity of a text by relying on machine learning methods. In this context, it focuses on studying various strategies for adapting a statistical classifier to a new domain when training data only exist for one or several other domains. This study shows more precisely that a self-training procedure consisting in enlarging the initial training corpus with texts from the target domain that were reliably classified by the classifier is the most successful and stable strategy for the tested domains. Moreover, this strategy gets better results in most cases than (Blitzer et al., 2007)’s method on the same evaluation corpus while it is more simple.
In this paper we deal with named entity detection on data acquired via OCR process on documents dating from 1890. The resulting corpus is very noisy. We perform an analysis to find possible strategies to overcome errors introduced by the OCR process. We propose a preprocessing procedure in three steps to clean data and correct, at least in part, OCR mistakes. The task is made even harder by the complex tree-structure of named entities annotated on data, we solve this problem however by adopting an effective named entity detection system we proposed in previous work. We evaluate our procedure for preprocessing OCR-ized data in two ways: in terms of perplexity and OOV rate of a language model on development and evaluation data, and in terms of the performance of the named entity detection system on the preprocessed data. The preprocessing procedure results to be effective, allowing to improve by a large margin the system we proposed for the official evaluation campaign on Old Press, and allowing to outperform also the best performing system of the evaluation campaign.