2024
pdf
bib
abs
Efficient Citer: Tuning Large Language Models for Enhanced Answer Quality and Verification
Marzieh Tahaei
|
Aref Jafari
|
Ahmad Rashid
|
David Alfonso-Hermelo
|
Khalil Bibi
|
Yimeng Wu
|
Ali Ghodsi
|
Boxing Chen
|
Mehdi Rezagholizadeh
Findings of the Association for Computational Linguistics: NAACL 2024
In recent years, there has been a growing interest in utilizing external knowledge to reduce hallucinations in large language models (LLMs) and provide them with updated information. Despite this improvement, a major challenge lies in the lack of explicit citations, which hampers the ability to verify the information generated by these models.This paper focuses on providing models with citation capabilities efficiently. By constructing a dataset of citations, we train two model architectures: an FID-style FLAN-T5 model for efficient answer composition and a 13B model known for its success in instruction following after tuning. Evaluation on fluency, correctness, and citation quality is conducted through human assessment and the newly introduced Automatic LLMs’ Citation Evaluation (ALCE) benchmark.Results demonstrate significant improvements in answer quality and efficiency, surpassing the performance of the popular ChatGPT on some of the metrics. The models exhibit exceptional out-of-domain generalization in both human and automatic evaluation. Notably, the FID-style FLAN-T5 model with only 3B parameters performs impressively compared to the 13B model.
pdf
bib
abs
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
Abbas Ghaddar
|
David Alfonso-Hermelo
|
Philippe Langlais
|
Mehdi Rezagholizadeh
|
Boxing Chen
|
Prasanna Parthasarathi
Findings of the Association for Computational Linguistics: ACL 2024
In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a testbed, designed for evaluating supposedly non-hallucinatory models trained on the FaithDial dataset. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. Data, models, and source code will be publicly available upon acceptance.
pdf
bib
abs
“Knowing When You Don’t Know”: A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation
Nandan Thakur
|
Luiz Bonifacio
|
Crystina Zhang
|
Odunayo Ogundepo
|
Ehsan Kamalloo
|
David Alfonso-Hermelo
|
Xiaoguang Li
|
Qun Liu
|
Boxing Chen
|
Mehdi Rezagholizadeh
|
Jimmy Lin
Findings of the Association for Computational Linguistics: EMNLP 2024
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish **NoMIRACL**, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) *hallucination rate*, measuring model tendency to hallucinate when the answer is not present in passages in the non-relevant subset, and (ii) *error rate*, measuring model inaccuracy to recognize relevant passages in the relevant subset. In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.
pdf
bib
abs
EUROPA: A Legal Multilingual Keyphrase Generation Dataset
Olivier Salaün
|
Frédéric Piedboeuf
|
Guillaume Le Berre
|
David Alfonso-Hermelo
|
Philippe Langlais
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a novel dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU official languages. We run multilingual models on our corpus and analyze the results, showing room for improvement on a domain-specific multilingual corpus such as the one we present.
pdf
bib
abs
EWEK-QA : Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems
Mohammad Dehghan
|
Mohammad Alomrani
|
Sunyam Bagga
|
David Alfonso-Hermelo
|
Khalil Bibi
|
Abbas Ghaddar
|
Yingxue Zhang
|
Xiaoguang Li
|
Jianye Hao
|
Qun Liu
|
Jimmy Lin
|
Boxing Chen
|
Prasanna Parthasarathi
|
Mahdi Biparva
|
Mehdi Rezagholizadeh
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The emerging citation-based QA systems are gaining more attention especially in generative AI search applications. The importance of extracted knowledge provided to these systems is vital from both accuracy (completeness of information) and efficiency (extracting the information in a timely manner). In this regard, citation-based QA systems are suffering from two shortcomings. First, they usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system. Second, web-retrieved contents are usually obtained by some simple heuristics such as fixed length or breakpoints which might lead to splitting information into pieces. To mitigate these issues, we propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system. This has been done through designing an adaptive web retriever and incorporating KGs triples in an efficient manner. We demonstrate the effectiveness of over the open-source state-of-the-art (SoTA) web-based and KG baseline models using a comprehensive set of quantitative and human evaluation experiments. Our model is able to: first, improve the web-retriever baseline in terms of extracting more relevant passages (>20%), the coverage of answer span (>25%) and self containment (>35%); second, obtain and integrate KG triples into its pipeline very efficiently (by avoiding any LLM calls) to outperform the web-only and KG-only SoTA baselines significantly in 7 quantitative QA tasks and our human evaluation.
2023
pdf
bib
abs
Evaluating Embedding APIs for Information Retrieval
Ehsan Kamalloo
|
Xinyu Zhang
|
Odunayo Ogundepo
|
Nandan Thakur
|
David Alfonso-hermelo
|
Mehdi Rezagholizadeh
|
Jimmy Lin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
The ever-increasing size of language models curtails their widespread access to the community, thereby galvanizing many companies and startups into offering access to large language models through APIs. One particular API, suitable for dense retrieval, is the semantic embedding API that builds vector representations of a given text. With a growing number of APIs at our disposal, in this paper, our goal is to analyze semantic embedding APIs in realistic retrieval scenarios in order to assist practitioners and researchers in finding suitable services according to their needs. Specifically, we wish to investigate the capabilities of existing APIs on domain generalization and multilingual retrieval. For this purpose, we evaluate the embedding APIs on two standard benchmarks, BEIR, and MIRACL. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective on English, in contrast to the standard practice, i.e., employing them as first-stage retrievers. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best albeit at a higher cost. We hope our work lays the groundwork for thoroughly evaluating APIs that are critical in search and more broadly, in information retrieval.
pdf
bib
abs
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages
Xinyu Zhang
|
Nandan Thakur
|
Odunayo Ogundepo
|
Ehsan Kamalloo
|
David Alfonso-Hermelo
|
Xiaoguang Li
|
Qun Liu
|
Mehdi Rezagholizadeh
|
Jimmy Lin
Transactions of the Association for Computational Linguistics, Volume 11
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.
2022
pdf
bib
abs
Refining an Almost Clean Translation Memory Helps Machine Translation
Shivendra Bhardwa
|
David Alfonso-Hermelo
|
Philippe Langlais
|
Gabriel Bernier-Colborne
|
Cyril Goutte
|
Michel Simard
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
While recent studies have been dedicated to cleaning very noisy parallel corpora to improve Machine Translation training, we focus in this work on filtering a large and mostly clean Translation Memory. This problem of practical interest has not received much consideration from the community, in contrast with, for example, filtering large web-mined parallel corpora. We experiment with an extensive, multi-domain proprietary Translation Memory and compare five approaches involving deep-, feature-, and heuristic-based solutions. We propose two ways of evaluating this task, manual annotation and resulting Machine Translation quality. We report significant gains over a state-of-the-art, off-the-shelf cleaning system, using two MT engines.
2020
pdf
bib
abs
Human or Neural Translation?
Shivendra Bhardwaj
|
David Alfonso Hermelo
|
Phillippe Langlais
|
Gabriel Bernier-Colborne
|
Cyril Goutte
|
Michel Simard
Proceedings of the 28th International Conference on Computational Linguistics
Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .