Jingshu Liu


2024

pdf bib
Améliorer la traduction au niveau du document grâce au sur-echantillage négatif et au masquage ciblé
Gaëtan Caillaut | Mariam Nakhlé | Jingshu Liu | Raheel Qader
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Ces travaux visent à améliorer les capacités des systèmes de traduction automatique à tenir compte du contexte dans lequel se trouve la phrase source, et donc, ultimement, à améliorer les performances globales des systèmes de traduction automatique. L’approche que nous proposons repose uniquement sur les données et la manière dont elles sont fournies au modèle durant l’entraînement et est complètement agnostique de l’architecture du modèle. Nous montrons que les performances des modèles de traduction, sur la paire en-fr, peuvent être améliorées simplement en fournissant des données plus pertinentes vis-à-vis de la tâche cible, et ce sans modifier ni complexifier les architectures existantes, en particulier l’architecture Transformer couramment utilisée par les systèmes de TAL modernes. Pour ce faire, nous présentons deux stratégies d’augmentation de données (sur-échantillonnage négatif et masquage ciblé) conçues pour inciter le modèle à s’appuyer sur le contexte. Nous montrons, au travers de métriques appropriées, que ces méthodes permettent d’améliorer les performances des systèmes de traduction sans pour autant modifier ni l’architecture du modèle, ni le processus d’entraînement.

pdf bib
Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task
Gaëtan Caillaut | Mariam Nakhlé | Raheel Qader | Jingshu Liu | Jean-Gabriel Barthélemy
Proceedings of the Ninth Conference on Machine Translation

Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention.This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual (8 languages) and multidomain (9 domains) dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.

2023

pdf bib
Large Language Model Adaptation for Financial Sentiment Analysis
Pau Rodriguez Inserte | Mariam Nakhlé | Raheel Qader | Gaetan Caillaut | Jingshu Liu
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

Natural language processing (NLP) has recently gained relevance within financial institutions by providing highly valuable insights into companies and markets’ financial documents. However, the landscape of the financial domain presents extra challenges for NLP, due to the complexity of the texts and the use of specific terminology. Generalist language models tend to fall short in tasks specifically tailored for finance, even when using large language models (LLMs) with great natural language understanding and generative capabilities. This paper presents a study on LLM adaptation methods targeted at the financial domain and with high emphasis on financial sentiment analysis. To this purpose, two foundation models with less than 1.5B parameters have been adapted using a wide range of strategies. We show that through careful fine-tuning on both financial documents and instructions, these foundation models can be adapted to the target domain. Moreover, we observe that small LLMs have comparable performance to larger scale models, while being more efficient in terms of parameters and data. In addition to the models, we show how to generate artificial instructions through LLMs to augment the number of samples of the instruction dataset.

pdf bib
Lingua Custodia’s Participation at the WMT 2023 Terminology Shared Task
Jingshu Liu | Mariam Nakhlé | Gaëtan Caillout | Raheel Qadar
Proceedings of the Eighth Conference on Machine Translation

This paper presents Lingua Custodia’s submission to the WMT23 shared task on Terminology shared task. Ensuring precise translation of technical terms plays a pivotal role in gauging the final quality of machine translation results. Our goal is to follow the terminology constraint while applying the machine translation system. Inspired by the recent work of terminology control, we propose to annotate the machine learning training data by leveraging a synthetic dictionary extracted in a fully non supervised way from the give parallel corpora. The model learned with this training data can then be then used to translate text with a given terminology in a flexible manner. In addition, we introduce a careful annotated data re-sampling step in order to guide the model to see different terminology types enough times. In this task we consider all the three language directions: Chinese to English, English to Czech and German to English. Our automatic evaluation metrics with the submitted systems show the effectiveness of the proposed method.

2022

pdf bib
Encouraging Neural Machine Translation to Satisfy Terminology Constraints.
Melissa Ailem | Jingshu Liu | Raheel Qader
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Encouraging Neural Machine Translation to Satisfy Terminology Constraints. We present a new approach to encourage neural machine translation to satisfy lexical constraints. Our method acts at the training step and thereby avoiding the introduction of any extra computational overhead at inference step. The proposed method combines three main ingredients. The first one consists in augmenting the training data to specify the constraints. Intuitively, this encourages the model to learn a copy behavior when it encounters constraint terms. Compared to previous work, we use a simplified augmentation strategy without source factors. The second ingredient is constraint token masking, which makes it even easier for the model to learn the copy behavior and generalize better. The third one, is a modification of the standard cross entropy loss to bias the model towards assigning high probabilities to constraint words. Empirical results show that our method improves upon related baselines in terms of both BLEU score and the percentage of generated constraint terms.

pdf bib
Lingua Custodia’s Participation at the WMT 2022 Word-Level Auto-completion Shared Task
Melissa Ailem | Jingshu Liu | Jean-gabriel Barthelemy | Raheel Qader
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents Lingua Custodia’s submission to the WMT22 shared task on Word Level Auto-completion (WLAC). We consider two directions, namely German-English and English-German.The WLAC task in Neural Machine Translation (NMT) consists in predicting a target word given few human typed characters, the source sentence to translate, as well as some translation context. Inspired by recent work in terminology control, we propose to treat the human typed sequence as a constraint to predict the right word starting by the latter. To do so, the source side of the training data is augmented with both the constraints and the translation context. In addition, following new advances in WLAC, we use a joint optimization strategy taking into account several types of translation context. The automatic as well as human accuracy obtained with the submitted systems show the effectiveness of the proposed method.

2021

pdf bib
Encouraging Neural Machine Translation to Satisfy Terminology Constraints
Melissa Ailem | Jingshu Liu | Raheel Qader
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Lingua Custodia’s Participation at the WMT 2021 Machine Translation Using Terminologies Shared Task
Melissa Ailem | Jingshu Liu | Raheel Qader
Proceedings of the Sixth Conference on Machine Translation

This paper describes Lingua Custodia’s submission to the WMT21 shared task on machine translation using terminologies. We consider three directions, namely English to French, Russian, and Chinese. We rely on a Transformer-based architecture as a building block, and we explore a method which introduces two main changes to the standard procedure to handle terminologies. The first one consists in augmenting the training data in such a way as to encourage the model to learn a copy behavior when it encounters terminology constraint terms. The second change is constraint token masking, whose purpose is to ease copy behavior learning and to improve model generalization. Empirical results show that our method satisfies most terminology constraints while maintaining high translation quality.

2020

pdf bib
BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining
Zachariah Zhang | Jingshu Liu | Narges Razavian
Proceedings of the 3rd Clinical Natural Language Processing Workshop

ICD coding is the task of classifying and cod-ing all diagnoses, symptoms and proceduresassociated with a patient’s visit. The process isoften manual, extremely time-consuming andexpensive for hospitals as clinical interactionsare usually recorded in free text medical notes. In this paper, we propose a machine learningmodel, BERT-XML, for large scale automatedICD coding of EHR notes, utilizing recentlydeveloped unsupervised pretraining that haveachieved state of the art performance on a va-riety of NLP tasks. We train a BERT modelfrom scratch on EHR notes, learning with vo-cabulary better suited for EHR tasks and thusoutperform off-the-shelf models. We furtheradapt the BERT architecture for ICD codingwith multi-label attention. We demonstratethe effectiveness of BERT-based models on thelarge scale ICD code classification task usingmillions of EHR notes to predict thousands ofunique codes.

2018

pdf bib
Alignement de termes de longueur variable en corpus comparables spécialisés (Alignment of variable length terms in specialized comparable corpora)
Jingshu Liu | Emmanuel Morin | Sebastián Peña Saldarriaga
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Nous proposons dans cet article une adaptation de l’approche compositionnelle étendue capable d’aligner des termes de longueurs variables à partir de corpus comparables, en modifiant la représentation des termes complexes. Nous proposons également de nouveaux modes de pondération pour l’approche standard qui améliorent les résultats des approches état de l’art pour les termes simples et complexes en domaine de spécialité.

pdf bib
Towards a unified framework for bilingual terminology extraction of single-word and multi-word terms
Jingshu Liu | Emmanuel Morin | Peña Saldarriaga
Proceedings of the 27th International Conference on Computational Linguistics

Extracting a bilingual terminology for multi-word terms from comparable corpora has not been widely researched. In this work we propose a unified framework for aligning bilingual terms independently of the term lengths. We also introduce some enhancements to the context-based and the neural network based approaches. Our experiments show the effectiveness of our enhancements of previous works and the system can be adapted in specialized domains.