Shehzaad Dhuliawala


2024

pdf bib
Towards Aligning Language Models with Textual Feedback
Saüc Abadal Lloret | Shehzaad Dhuliawala | Keerthiram Murugesan | Mrinmaya Sachan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present ALT (ALignment with Textual feedback), an approach that aligns language models with user preferences expressed in text. We argue that text offers greater expressiveness, enabling users to provide richer feedback than simple comparative preferences and this richer feedback can lead to more efficient and effective alignment. ALT aligns the model by conditioning its generation on the textual feedback. Our method relies solely on language modeling techniques and requires minimal hyper-parameter tuning, though it still presents the main benefit of RL-based algorithms and can effectively learn from textual feedback. We explore the efficacy and efficiency of textual feedback across different tasks such as toxicity reduction, summarization, and dialog response. We find that ALT outperforms PPO for the task of toxicity reduction while being able to match its performance on summarization with only 20% of the samples. We also explore how ALT can be used with feedback provided by an existing LLM.

pdf bib
Chain-of-Verification Reduces Hallucination in Large Language Models
Shehzaad Dhuliawala | Mojtaba Komeili | Jing Xu | Roberta Raileanu | Xian Li | Asli Celikyilmaz | Jason Weston
Findings of the Association for Computational Linguistics: ACL 2024

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

pdf bib
Implicit Personalization in Language Models: A Systematic Study
Zhijing Jin | Nils Heil | Jiarui Liu | Shehzaad Dhuliawala | Yahang Qi | Bernhard Schölkopf | Rada Mihalcea | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2024

Implicit Personalization (IP) is a phenomenon of language models inferring a user’s background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research.

2023

pdf bib
A Diachronic Perspective on User Trust in AI under Uncertainty
Shehzaad Dhuliawala | Vilém Zouhar | Mennatallah El-Assady | Mrinmaya Sachan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In human-AI collaboration, users typically form a mental model of the AI system, which captures the user’s beliefs about when the system performs well and when it does not. The construction of this mental model is guided by both the system’s veracity as well as the system output presented to the user e.g., the system’s confidence and an explanation for the prediction. However, modern NLP systems are seldom calibrated and are often confidently incorrect about their predictions, which violates users’ mental model and erodes their trust. In this work, we design a study where users bet on the correctness of an NLP system, and use it to study the evolution of user trust as a response to these trust-eroding events and how the user trust is rebuilt as a function of time after these events. We find that even a few highly inaccurate confidence estimation instances are enough to damage users’ trust in the system and performance, which does not easily recover over time. We further find that users are more forgiving to the NLP system if it is unconfidently correct rather than confidently incorrect, even though, from a game-theoretic perspective, their payoff is equivalent. Finally, we find that each user can entertain multiple mental models of the system based on the type of the question. These results highlight the importance of confidence calibration in developing user-centered NLP applications to avoid damaging user trust and compromising the collaboration performance.

pdf bib
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).

pdf bib
Extracting Victim Counts from Text
Mian Zhong | Shehzaad Dhuliawala | Niklas Stoehr
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts are however often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely tagging approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare tagging approaches: regex, dependency parsing, semantic role labeling, and advanced text-to-text models. Beyond model accuracy, we analyze extraction reliability and robustness which are key for this sensitive task. In particular, we discuss model calibration and investigate out-of-distribution and few-shot performance. Ultimately, we make a comprehensive recommendation on which model to select for different desiderata and data domains. Our work is among the first to apply numeracy-focused large language models in a real-world use case with a positive impact.

2022

pdf bib
Calibration of Machine Reading Systems at Scale
Shehzaad Dhuliawala | Leonard Adolphs | Rajarshi Das | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: ACL 2022

In typical machine learning systems, an estimate of the probability of the prediction is used to assess the system’s confidence in the prediction. This confidence measure is usually uncalibrated; i.e. the system’s confidence in the prediction does not match the true probability of the predicted output. In this paper, we present an investigation into calibrating open setting machine reading systemssuch as open-domain question answering and claim verification systems. We show that calibrating such complex systems which contain discrete retrieval and deep reading components is challenging and current calibration techniques fail to scale to these settings. We propose simple extensions to existing calibration approaches that allows us to adapt them to these settings. Our experimental results reveal that the approach works well, and can be useful to selectively predict answers when question answering systems are posed with unanswerable or out-of-the-training distribution questions.

pdf bib
TopiOCQA: Open-domain Conversational Question Answering with Topic Switching
Vaibhav Adlakha | Shehzaad Dhuliawala | Kaheer Suleman | Harm de Vries | Siva Reddy
Transactions of the Association for Computational Linguistics, Volume 10

In a conversational question answering scenario, a questioner seeks to extract information about a topic through a series of interdependent questions and answers. As the conversation progresses, they may switch to related topics, a phenomenon commonly observed in information-seeking search sessions. However, current datasets for conversational question answering are limiting in two ways: 1) they do not contain topic switches; and 2) they assume the reference text for the conversation is given, that is, the setting is not open-domain. We introduce TopiOCQA (pronounced Tapioca), an open-domain conversational dataset with topic switches based on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. On average, a conversation in our dataset spans 13 question-answer turns and involves four topics (documents). TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models. Our best model achieves F1 of 55.8, falling short of human performance by 14.2 points, indicating the difficulty of our dataset. Our dataset and code are available at https://mcgill-nlp.github.io/topiocqa.

2019

pdf bib
Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference
Rajarshi Das | Ameya Godbole | Manzil Zaheer | Shehzaad Dhuliawala | Andrew McCallum
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

This paper describes our submission to the shared task on “Multi-hop Inference Explanation Regeneration” in TextGraphs workshop at EMNLP 2019 (Jansen and Ustalov, 2019). Our system identifies chains of facts relevant to explain an answer to an elementary science examination question. To counter the problem of ‘spurious chains’ leading to ‘semantic drifts’, we train a ranker that uses contextualized representation of facts to score its relevance for explaining an answer to a question. Our system was ranked first w.r.t the mean average precision (MAP) metric outperforming the second best system by 14.95 points.

2016

pdf bib
SlangNet: A WordNet like resource for English Slang
Shehzaad Dhuliawala | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a WordNet like structured resource for slang words and neologisms on the internet. The dynamism of language is often an indication that current language technology tools trained on today’s data, may not be able to process the language in the future. Our resource could be (1) used to augment the WordNet, (2) used in several Natural Language Processing (NLP) applications which make use of noisy data on the internet like Information Retrieval and Web Mining. Such a resource can also be used to distinguish slang word senses from conventional word senses. To stimulate similar innovations widely in the NLP community, we test the efficacy of our resource for detecting slang using standard bag of words Word Sense Disambiguation (WSD) algorithms (Lesk and Extended Lesk) for English data on the internet.

pdf bib
A picture is worth a thousand words: Using OpenClipArt library for enriching IndoWordNet
Diptesh Kanojia | Shehzaad Dhuliawala | Pushpak Bhattacharyya
Proceedings of the 8th Global WordNet Conference (GWC)

WordNet has proved to be immensely useful for Word Sense Disambiguation, and thence Machine translation, Information Retrieval and Question Answering. It can also be used as a dictionary for educational purposes. The semantic nature of concepts in a WordNet motivates one to try to express this meaning in a more visual way. In this paper, we describe our work of enriching IndoWordNet with image acquisitions from the OpenClipArt library. We describe an approach used to enrich WordNets for eighteen Indian languages. Our contribution is three fold: (1) We develop a system, which, given a synset in English, finds an appropriate image for the synset. The system uses the OpenclipArt library (OCAL) to retrieve images and ranks them. (2) After retrieving the images, we map the results along with the linkages between Princeton WordNet and Hindi WordNet, to link several synsets to corresponding images. We choose and sort top three images based on our ranking heuristic per synset. (3) We develop a tool that allows a lexicographer to manually evaluate these images. The top images are shown to a lexicographer by the evaluation tool for the task of choosing the best image representation. The lexicographer also selects the number of relevant images. Using our system, we obtain an Average Precision (P @ 3) score of 0.30.

2015

pdf bib
TransChat: Cross-Lingual Instant Messaging for Indian Languages
Diptesh Kanojia | Shehzaad Dhuliawala | Abhijit Mishra | Naman Gupta | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Judge a Book by its Cover: Conservative Focused Crawling under Resource Constraints
Shehzaad Dhuliawala | Arjun Atreya V | Ravi Kumar Yadav | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing