Shubhi Tyagi

2025

HALLUCANA: Fixing LLM Hallucination with A Canary Lookahead
Tianyi Li | Erenay Dayanik | Shubhi Tyagi | Andrea Pierleoni
Findings of the Association for Computational Linguistics: NAACL 2025

In this paper, we present HALLUCANA, a canary lookahead to detect and correct factual hallucinations of Large Language Models (LLMs) in long-form generation. HALLUCANA detects and intervenes as soon as traces of hallucination emerge, during and even before generation. To support timely detection, we exploit the internal factuality representation in the LLM hidden space, where we investigate various proxies to the LLMs’ factuality self-assessment, and discuss its relation to the models’ context familiarity from their pre-training. On biography generation, our method improves generation quality by up to 2.5x, while consuming over 6 times less compute.

2024

pdf bib abs

REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking
Nacime Bouziani | Shubhi Tyagi | Joseph Fisher | Jens Lehmann | Andrea Pierleoni
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual sub-task and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which will be available at https://github.com/amazon-science/e2e-docie.

2022

pdf bib abs

ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking
Tom Ayoola | Shubhi Tyagi | Joseph Fisher | Christos Christodoulopoulos | Andrea Pierleoni
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

We introduce ReFinED, an efficient end-to-end entity linking model which uses fine-grained entity types and entity descriptions to perform linking. The model performs mention detection, fine-grained entity typing, and entity disambiguation for all mentions within a document in a single forward pass, making it more than 60 times faster than competitive existing approaches. ReFinED also surpasses state-of-the-art performance on standard entity linking datasets by an average of 3.7 F1. The model is capable of generalising to large-scale knowledge bases such as Wikidata (which has 15 times more entities than Wikipedia) and of zero-shot entity linking. The combination of speed, accuracy and scale makes ReFinED an effective and cost-efficient system for extracting entities from web-scale datasets, for which the model has been successfully deployed.

2021

pdf bib abs

Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems
Shubhi Tyagi | Antonio Bonafonte | Jaime Lorenzo-Trueba | Javier Latorre
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released.

Co-authors

Christos Christodoulopoulos 1

Jaime Lorenzo-Trueba 1

Venues

NAACL3
Findings1

Fix author