Hervé Déjean

Also published as: H. Dejean, Herve Dejean


2024

pdf bib
Retrieval-augmented generation in multilingual settings
Nadezhda Chirkova | David Rau | Hervé Déjean | Thibault Formal | Stéphane Clinchant | Vassilina Nikoulina
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at https://github.com/naver/bergen, Documentation: https://github.com/naver/bergen/blob/main/documentations/multilingual.md.

pdf bib
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
David Rau | Hervé Déjean | Nadezhda Chirkova | Thibault Formal | Shuai Wang | Stéphane Clinchant | Vassilina Nikoulina
Findings of the Association for Computational Linguistics: EMNLP 2024

Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets.

2020

pdf bib
Vital Records: Uncover the past from historical handwritten records
Herve Dejean | Jean-Luc Meunier
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We present Vital Records, a demonstrator based on deep-learning approaches to handwritten-text recognition, table processing and information extraction, which enables data from century-old documents to be parsed and analysed, making it possible to explore death records in space and time. This demonstrator provides a user interface for browsing and visualising data extracted from 80,000 handwritten pages of tabular data.

2004

pdf bib
A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora
Eric Gaussier | J.M. Renders | I. Matveeva | C. Goutte | H. Dejean
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

pdf bib
Reducing Parameter Space for Word Alignment
Herve Dejean | Eric Gaussier | Cyril Goutte | Kenji Yamada
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

2002

pdf bib
Combining Labelled and Unlabelled Data: A Case Study on Fisher Kernels and Transductive Inference for Biological Entity Recognition
Cyril Goutte | Hervé Déjean | Eric Gaussier | Nicola Cancedda | Jean-Michel Renders
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction
Hervé Déjean | Éric Gaussier | Fatiha Sadat
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Introduction to the CoNLL-2001 shared task: clause identification
Erik F. Tjong Kim Sang | Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

pdf bib
Learning Computational Grammars
John Nerbonne | Anja Belz | Nicola Cancedda | Hervé Déjean | James Hammerton | Rob Koeling | Stasinos Konstantopoulos | Miles Osborne | Franck Thollard | Erik F. Tjong Kim Sang
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

pdf bib
Using ALLiS for clausing
Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

2000

pdf bib
ALLiS: a Symbolic Learning System for Natural Language Learning
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

pdf bib
Learning Syntactic Structures with XML
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

pdf bib
How To Evaluate and Compare Tagsets? A Proposal
Hervé Déjean
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Theory Refinement and Natural Language Learning
Herve Dejean
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Applying System Combination to Base Noun Phrase Identification
Erik F. Tjong Kim Sang | Walter Daelemans | Herve Dejean | Rob Koeling | Yuval Krymolowski | Vasin Punyakanok | Dan Roth
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1998

pdf bib
Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora
Herve Dejean
New Methods in Language Processing and Computational Natural Language Learning