Yashar Mehdad - ACL Anthology

Yashar Mehdad

2025

Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support
Cen Zhao | Tiantian Zhang | Hanchen Su | Yufeng Zhang | Shaowei Su | Mingzhi Xu | Yu Liu | Wei Han | Jeremy Werner | Claire Na Cheng | Yashar Mehdad
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

We introduce an Agent-in-the-Loop (AITL) framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system. Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations: (1) pairwise response preferences, (2) agent adoption and rationales, (3) knowledge relevance checks, and (4) identification of missing knowledge. These feedback signals seamlessly feed back into models’ updates, reducing retraining cycles from months to weeks. Our production pilot involving US-based customer support agents demonstrated significant improvements in retrieval accuracy (+11.7% recall@75, +14.8% precision@8), generation quality (+8.4% helpfulness) and agent adoption rates (+4.5%). These results underscore the effectiveness of embedding human feedback loops directly into operational workflows to continuously refine LLM-based customer support system.

LLM-Friendly Knowledge Representation for Customer Support
Hanchen Su | Wei Luo | Yashar Mehdad | Wei Han | Elaine Liu | Wayne Zhang | Mia Zhao | Joy Zhang
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

We propose a practical approach by integrating Large Language Models (LLMs) with a framework designed to navigate the complexities of Airbnb customer support operations. In this paper, our methodology employs a novel reformatting technique, the Intent, Context, and Action (ICA) format, which transforms policies and workflows into a structure more comprehensible to LLMs. Additionally, we develop a synthetic data generation strategy to create training data with minimal human intervention, enabling cost-effective fine-tuning of our model. Our internal experiments (not applied to Airbnb products) demonstrate that our approach of restructuring workflows and fine-tuning LLMs with synthetic data significantly enhances their performance, setting a new benchmark for their application in customer support. Our solution is not only cost-effective but also improves customer support, as evidenced by both accuracy and manual processing time evaluation metrics.

Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback
Yisha Wu | Cen Zhao | Yuanpei Cao | Xiaoqing Xu | Yashar Mehdad | Mindy Ji | Claire Na Cheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents’ cognitive load and redundant review. Our approach combines a fine-tuned Mixtral-8×7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.

2024

We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass gpt-3.5-turbo-16k‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu | Barlas Oguz | Changsheng Zhao | Ernie Chang | Pierre Stock | Yashar Mehdad | Yangyang Shi | Raghuraman Krishnamoorthi | Vikas Chandra
Findings of the Association for Computational Linguistics: ACL 2024

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization-aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and supporting long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.

2023

How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval
Sheng-Chieh Lin | Akari Asai | Minghan Li | Barlas Oguz | Jimmy Lin | Yashar Mehdad | Wen-tau Yih | Xilun Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our Dense Retriever trained with diverse AuGmentatiON, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction.

Adapting Pretrained Text-to-Text Models for Long Text Sequences
Wenhan Xiong | Anchit Gupta | Shubham Toshniwal | Yashar Mehdad | Scott Yih
Findings of the Association for Computational Linguistics: EMNLP 2023

We present an empirical study of adapting an existing pretrained text-to-text model for long-sequence inputs. Through a comprehensive study along three axes of the pretraining pipeline – model architecture, optimization objective, and pretraining corpus, we propose an effective recipe to build long-context models from existing short-context models. Specifically, we replace the full attention in transformers with pooling-augmented blockwise attention, and pretrain the model with a masked-span prediction task with spans of varying lengths. In terms of the pretraining corpus, we find that using randomly concatenated short-documents from a large open-domain corpus results in better performance than using existing long document corpora, which are typically limited in their domain coverage. With these findings, we build a long-context model that achieves competitive performance on long-text QA tasks and establishes the new state of the art on five long-text summarization datasets, often outperforming previous methods with larger model sizes.

CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval
Minghan Li | Sheng-Chieh Lin | Barlas Oguz | Asish Ghoshal | Jimmy Lin | Yashar Mehdad | Wen-tau Yih | Xilun Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers and have achieved state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.CITADEL learns to route different token vectors to the predicted lexical keys such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy. Notably, CITADEL achieves the same or slightly better performance than the previous state of the art, ColBERT-v2, on both in-domain (MS MARCO) and out-of-domain (BEIR) evaluations, while being nearly 40 times faster. Source code and data are available at https://github.com/facebookresearch/dpr-scale/tree/citadel.

2022

Domain-matched Pre-training Tasks for Dense Retrieval
Barlas Oguz | Kushal Lakhotia | Anchit Gupta | Patrick Lewis | Vladimir Karpukhin | Aleksandra Piktus | Xilun Chen | Sebastian Riedel | Scott Yih | Sonal Gupta | Yashar Mehdad
Findings of the Association for Computational Linguistics: NAACL 2022

Pre-training on larger datasets with ever increasing model size isnow a proven recipe for increased performance across almost all NLP tasks.A notable exception is information retrieval, where additional pre-traininghas so far failed to produce convincing results. We show that, with theright pre-training setup, this barrier can be overcome. We demonstrate thisby pre-training large bi-encoder models on 1) a recently released set of 65 millionsynthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering
Barlas Oguz | Xilun Chen | Vladimir Karpukhin | Stan Peshterliev | Dmytro Okhonko | Michael Schlichtkrull | Sonal Gupta | Yashar Mehdad | Scott Yih
Findings of the Association for Computational Linguistics: NAACL 2022

We study open-domain question answering with structured, unstructured and semi-structured knowledge sources, including text, tables, lists and knowledge bases. Departing from prior work, we propose a unifying approach that homogenizes all sources by reducing them to text and applies the retriever-reader model which has so far been limited to text sources only. Our approach greatly improves the results on knowledge-base QA tasks by 11 points, compared to latest graph-based methods. More importantly, we demonstrate that our unified knowledge (UniK-QA) model is a simple and yet effective way to combine heterogeneous sources of knowledge, advancing the state-of-the-art results on two popular question answering benchmarks, NaturalQuestions and WebQuestions, by 3.5 and 2.6 points, respectively. The code of UniK-QA is available at: https://github.com/facebookresearch/UniK-QA.

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
Xiangru Tang | Arjun Nair | Borui Wang | Bingyao Wang | Jai Desai | Aaron Wade | Haoran Li | Asli Celikyilmaz | Yashar Mehdad | Dragomir Radev
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.

Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
Xilun Chen | Kushal Lakhotia | Barlas Oguz | Anchit Gupta | Patrick Lewis | Stan Peshterliev | Yashar Mehdad | Sonal Gupta | Wen-tau Yih
Findings of the Association for Computational Linguistics: EMNLP 2022

Despite their recent popularity and well-known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query and to generalize to out-of-domain data. It has been argued that this is an inherent limitation of dense models. We rebut this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dense retriever with the lexical matching capacity of a sparse model. We show that a dense Lexical Model Λ can be trained to imitate a sparse one, and SPAR is built by augmenting a standard dense retriever with Λ. Empirically, SPAR shows superior performance on a range of tasks including five question answering datasets, MS MARCO passage retrieval, as well as the EntityQuestions and BEIR benchmarks for out-of-domain evaluation, exceeding the performance of state-of-the-art dense and sparse retrievers. The code and models of SPAR are available at: https://github.com/facebookresearch/dpr-scale/tree/main/spar

Simple Local Attentions Remain Competitive for Long-Context Tasks
Wenhan Xiong | Barlas Oguz | Anchit Gupta | Xilun Chen | Diana Liskovich | Omer Levy | Scott Yih | Yashar Mehdad
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results — using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer with half of its pretraining compute.

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries
Xiangru Tang | Alexander Fabbri | Haoran Li | Ziming Mao | Griffin Adams | Borui Wang | Asli Celikyilmaz | Yashar Mehdad | Dragomir Radev
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.

STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension
Borui Wang | Chengcheng Feng | Arjun Nair | Madelyn Mao | Jai Desai | Asli Celikyilmaz | Haoran Li | Yashar Mehdad | Dragomir Radev
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing, but no previous work has explored the possibility of whether abstractive dialogue summarization can also be used as a means to boost an NLP system’s performance on other important dialogue comprehension tasks. In this paper, we propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization (STRUDEL) - that can help pre-trained language models to better understand dialogues and improve their performance on important dialogue comprehension tasks. In contrast to the holistic approach taken by the traditional free-form abstractive summarization task for dialogues, STRUDEL aims to decompose and imitate the hierarchical, systematic and structured mental process that we human beings usually go through when understanding and analyzing dialogues, and thus has the advantage of being more focused, specific and instructive for dialogue comprehension models to learn from. We further introduce a new STRUDEL dialogue comprehension modeling framework that integrates STRUDEL into a dialogue reasoning module over transformer encoder language models to improve their dialogue comprehension ability. In our empirical experiments on two important downstream dialogue comprehension tasks - dialogue question answering and dialogue response prediction - we demonstrate that our STRUDEL dialogue comprehension models can significantly improve the dialogue comprehension performance of transformer encoder language models.

Bridging the Training-Inference Gap for Dense Phrase Retrieval
Gyuwan Kim | Jinhyuk Lee | Barlas Oguz | Wenhan Xiong | Yizhe Zhang | Yashar Mehdad | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2022

Building dense retrievers requires a series of standard procedures, including training and validating neural models and creating indexes for efficient search. However, these procedures are often misaligned in that training objectives do not exactly reflect the retrieval scenario at inference time. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021) where billions of representations are indexed at inference. Since validating every dense retriever with a large-scale index is practically infeasible, we propose an efficient way of validating dense retrievers using a small subset of the entire corpus. This allows us to validate various training strategies including unifying contrastive loss terms and using hard negatives for phrase retrieval, which largely reduces the training-inference discrepancy. As a result, we improve top-1 phrase retrieval accuracy by 2 3 points and top-20 passage retrieval accuracy by 2 4 points for open-domain question answering. Our work urges modeling dense retrievers with careful consideration of training and inference via efficient validation while advancing phrase retrieval as a general solution for dense retrieval.

2021

Do Explanations Help Users Detect Errors in Open-Domain QA? An Evaluation of Spoken vs. Visual Explanations
Ana Valeria González | Gagan Bansal | Angela Fan | Yashar Mehdad | Robin Jia | Srinivasan Iyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

EASE: Extractive-Abstractive Summarization End-to-End using the Information Bottleneck Principle
Haoran Li | Arash Einolghozati | Srinivasan Iyer | Bhargavi Paranjape | Yashar Mehdad | Sonal Gupta | Marjan Ghazvininejad
Proceedings of the Third Workshop on New Frontiers in Summarization

Current abstractive summarization systems outperform their extractive counterparts, but their widespread adoption is inhibited by the inherent lack of interpretability. Extractive summarization systems, though interpretable, suffer from redundancy and possible lack of coherence. To achieve the best of both worlds, we propose EASE, an extractive-abstractive framework that generates concise abstractive summaries that can be traced back to an extractive summary. Our framework can be applied to any evidence-based text generation problem and can accommodate various pretrained models in its simple architecture. We use the Information Bottleneck principle to jointly train the extraction and abstraction in an end-to-end fashion. Inspired by previous research that humans use a two-stage framework to summarize long documents (Jing and McKeown, 2000), our framework first extracts a pre-defined amount of evidence spans and then generates a summary using only the evidence. Using automatic and human evaluations, we show that the generated summaries are better than strong extractive and extractive-abstractive baselines.

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining
Alexander Fabbri | Faiaz Rahman | Imad Rizvi | Borui Wang | Haoran Li | Yashar Mehdad | Dragomir Radev
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

While online conversations can cover a vast amount of information in many different formats, abstractive text summarization has primarily focused on modeling solely news articles. This research gap is due, in part, to the lack of standardized datasets for summarizing online discussions. To address this gap, we design annotation protocols motivated by an issues–viewpoints–assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data. To create a comprehensive benchmark, we also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. Furthermore, we incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noisy input, showing comparable or improved results according to automatic and human evaluations.

Syntax-augmented Multilingual BERT for Cross-lingual Transfer
Wasi Ahmad | Haoran Li | Kai-Wei Chang | Yashar Mehdad
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In recent years, we have seen a colossal effort in pre-training multilingual text encoders using large-scale corpora in many languages to facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have shown that pre-trained multilingual encoders, such as mBERT (CITATION), capture language syntax, helping cross-lingual transfer. This work shows that explicitly providing language syntax and training mBERT using an auxiliary objective to encode the universal dependency tree structure helps cross-lingual transfer. We perform rigorous experiments on four NLP tasks, including text classification, question answering, named entity recognition, and task-oriented semantic parsing. The experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks, such as PAWS-X and MLQA, by 1.4 and 1.6 points on average across all languages. In the generalized transfer setting, the performance boosted significantly, with 3.9 and 3.1 points on average in PAWS-X and MLQA.

Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation
Alexander Fabbri | Simeng Han | Haoyuan Li | Haoran Li | Marjan Ghazvininejad | Shafiq Joty | Dragomir Radev | Yashar Mehdad
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains. In this work, we introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner. WikiTransfer fine-tunes pretrained models on pseudo-summaries, produced from generic Wikipedia data, which contain characteristics of the target dataset, such as the length and level of abstraction of the desired summaries. WikiTransfer models achieve state-of-the-art, zero-shot abstractive summarization performance on the CNN-DailyMail dataset and demonstrate the effectiveness of our approach on three additional diverse datasets. These models are more robust to noisy data and also achieve better or comparable few-shot performance using 10 and 100 training examples when compared to few-shot transfer from other summarization datasets. To further boost performance, we employ data augmentation via round-trip translation as well as introduce a regularization term for improved few-shot transfer. To understand the role of dataset aspects in transfer performance and the quality of the resulting output summaries, we further study the effect of the components of our unsupervised fine-tuning data and analyze few-shot performance using both automatic and human evaluation.

FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation
Kushal Lakhotia | Bhargavi Paranjape | Asish Ghoshal | Scott Yih | Yashar Mehdad | Srini Iyer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations. However, these models have many shortcomings; they can fabricate explanations even for incorrect predictions, they are difficult to adapt to long input documents, and their training requires a large amount of labeled data. In this paper, we develop FiD-Ex, which addresses these shortcomings for seq2seq models by: 1) introducing sentence markers to eliminate explanation fabrication by encouraging extractive generation, 2) using the fusion-in-decoder architecture to handle long input contexts, and 3) intermediate fine-tuning on re-structured open domain QA datasets to improve few-shot performance. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy on five tasks from the ERASER explainability benchmark in both fully supervised and few-shot settings.

MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark
Haoran Li | Abhinav Arora | Shuohui Chen | Anchit Gupta | Sonal Gupta | Yashar Mehdad
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Scaling semantic parsing models for task-oriented dialog systems to new languages is often expensive and time-consuming due to the lack of available datasets. Available datasets suffer from several shortcomings: a) they contain few languages b) they contain small amounts of labeled examples per language c) they are based on the simple intent and slot detection paradigm for non-compositional queries. In this paper, we present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains. We use this dataset and other publicly available datasets to conduct a comprehensive benchmarking study on using various state-of-the-art multilingual pre-trained models for task-oriented semantic parsing. We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments. Furthermore, we demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.

RECONSIDER: Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering
Srinivasan Iyer | Sewon Min | Yashar Mehdad | Wen-tau Yih
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

State-of-the-art Machine Reading Comprehension (MRC) models for Open-domain Question Answering (QA) are typically trained for span selection using distantly supervised positive examples and heuristically retrieved negative examples. This training scheme possibly explains empirical observations that these models achieve a high recall amongst their top few predictions, but a low overall accuracy, motivating the need for answer re-ranking. We develop a successful re-ranking approach (RECONSIDER) for span-extraction tasks that improves upon the performance of MRC models, even beyond large-scale pre-training. RECONSIDER is trained on positive and negative examples extracted from high confidence MRC model predictions, and uses in-passage span annotations to perform span-focused re-ranking over a smaller candidate set. As a result, RECONSIDER learns to eliminate close false positives, achieving a new extractive state of the art on four QA tasks, with 45.5% Exact Match accuracy on Natural Questions with real user questions, and 61.7% on TriviaQA. We will release all related data, models, and code.

2020

Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing
Xilun Chen | Asish Ghoshal | Yashar Mehdad | Luke Zettlemoyer | Sonal Gupta
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Task-oriented semantic parsing is a critical component of virtual assistants, which is responsible for understanding the user’s intents (set reminder, play music, etc.). Recent advances in deep learning have enabled several approaches to successfully parse more complex queries (Gupta et al., 2018; Rongali et al.,2020), but these models require a large amount of annotated training data to parse queries on new domains (e.g. reminder, music). In this paper, we focus on adapting task-oriented semantic parsers to low-resource domains, and propose a novel method that outperforms a supervised neural model at a 10-fold data reduction. In particular, we identify two fundamental factors for low-resource domain adaptation: better representation learning and better training techniques. Our representation learning uses BART (Lewis et al., 2019) to initialize our model which outperforms encoder-only pre-trained representations used in previous work. Furthermore, we train with optimization-based meta-learning (Finn et al., 2017) to improve generalization to low-resource domains. This approach significantly outperforms all baseline methods in the experiments on a newly collected multi-domain task-oriented semantic parsing dataset (TOPv2), which we release to the public.

Efficient One-Pass End-to-End Entity Linking for Questions
Belinda Z. Li | Sewon Min | Srinivasan Iyer | Yashar Mehdad | Wen-tau Yih
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present ELQ, a fast end-to-end entity linking model for questions, which uses a biencoder to jointly perform mention detection and linking in one pass. Evaluated on WebQSP and GraphQuestions with extended annotations that cover multiple entities per question, ELQ outperforms the previous state of the art by a large margin of +12.7% and +19.6% F1, respectively. With a very fast inference time (1.57 examples/s on a single CPU), ELQ can be useful for downstream question answering systems. In a proof-of-concept experiment, we demonstrate that using ELQ significantly improves the downstream QA performance of GraphRetriever.

Conversational Semantic Parsing
Armen Aghajanyan | Jean Maillard | Akshat Shrivastava | Keith Diedrick | Michael Haeger | Haoran Li | Yashar Mehdad | Veselin Stoyanov | Anuj Kumar | Mike Lewis | Sonal Gupta
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The structured representation for semantic parsing in task-oriented assistant systems is geared towards simple understanding of one-turn queries. Due to the limitations of the representation, the session-based properties such as co-reference resolution and context carryover are processed downstream in a pipelined system. In this paper, we propose a semantic representation for such task-oriented conversational systems that can represent concepts such as co-reference and context carryover, enabling comprehensive understanding of queries in a session. We release a new session-based, compositional task-oriented parsing dataset of 20k sessions consisting of 60k utterances. Unlike Dialog State Tracking Challenges, the queries in the dataset have compositional forms. We propose a new family of Seq2Seq models for the session-based parsing above, which also set state-of-the-art in ATIS, SNIPS, TOP and DSTC2. Notably, we improve the best known results on DSTC2 by up to 5 points for slot-carryover.

2017

DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging
Sheng Chen | Akshay Soni | Aasish Pappu | Yashar Mehdad
Proceedings of the 2nd Workshop on Representation Learning for NLP

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec – two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.

2016

Do Characters Abuse More Than Words?
Yashar Mehdad | Joel Tetreault
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Extractive Summarization under Strict Length Constraints
Yashar Mehdad | Amanda Stent | Kapil Thadani | Dragomir Radev | Youssef Billawala | Karolina Buchner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we report a comparison of various techniques for single-document extractive summarization under strict length budgets, which is a common commercial use case (e.g. summarization of news articles by news aggregators). We show that, evaluated using ROUGE, numerous algorithms from the literature fail to beat a simple lead-based baseline for this task. However, a supervised approach with lightweight and efficient features improves over the lead-based baseline. Additional human evaluation demonstrates that the supervised approach also performs competitively with a commercial system that uses more sophisticated features.

A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization
William Yang Wang | Yashar Mehdad | Dragomir R. Radev | Amanda Stent
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Abstractive Summarization of Spoken and Written Conversations Based on Phrasal Queries
Yashar Mehdad | Giuseppe Carenini | Raymond T. Ng
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Abstractive Summarization of Product Reviews Using Discourse Structure
Shima Gerani | Yashar Mehdad | Giuseppe Carenini | Raymond T. Ng | Bita Nejat
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A Template-based Abstractive Meeting Summarization: Leveraging Summary and Source Text Relationships
Tatsuro Oya | Yashar Mehdad | Giuseppe Carenini | Raymond Ng
Proceedings of the 8th International Natural Language Generation Conference (INLG)

2013

Dialogue Act Recognition in Synchronous and Asynchronous Conversations
Maryam Tavafi | Yashar Mehdad | Shafiq Joty | Giuseppe Carenini | Raymond Ng
Proceedings of the SIGDIAL 2013 Conference

Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization
Matteo Negri | Alessandro Marchetti | Yashar Mehdad | Luisa Bentivogli | Danilo Giampiccolo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis
Shafiq Joty | Giuseppe Carenini | Raymond Ng | Yashar Mehdad
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Abstractive Meeting Summarization with Entailment and Fusion
Yashar Mehdad | Giuseppe Carenini | Frank Tompa | Raymond T. Ng
Proceedings of the 14th European Workshop on Natural Language Generation

Towards Topic Labeling with Phrase Entailment and Aggregation
Yashar Mehdad | Giuseppe Carenini | Raymond T. Ng | Shafiq Joty
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

FBK: Machine Translation Evaluation and Word Similarity metrics for Semantic Textual Similarity
José Guilherme Camargo de Souza | Matteo Negri | Yashar Mehdad
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

FBK: Cross-Lingual Textual Entailment Without Translation
Yashar Mehdad | Matteo Negri | José Guilherme C. de Souza
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization
Matteo Negri | Alessandro Marchetti | Yashar Mehdad | Luisa Bentivogli | Danilo Giampiccolo
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Match without a Referee: Evaluating MT Adequacy without Reference Translations
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the Seventh Workshop on Statistical Machine Translation

Chinese Whispers: Cooperative Paraphrase Acquisition
Matteo Negri | Yashar Mehdad | Alessandro Marchetti | Danilo Giampiccolo | Luisa Bentivogli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a framework for the acquisition of sentential paraphrases based on crowdsourcing. The proposed method maximizes the lexical divergence between an original sentence s and its valid paraphrases by running a sequence of paraphrasing jobs carried out by a crowd of non-expert workers. Instead of collecting direct paraphrases of s, at each step of the sequence workers manipulate semantically equivalent reformulations produced in the previous round. We applied this method to paraphrase English sentences extracted from Wikipedia. Our results show that, keeping at each round n the most promising paraphrases (i.e. the more lexically dissimilar from those acquired at round n-1), the monotonic increase of divergence allows to collect good-quality paraphrases in a cost-effective manner.

2011

Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
Matteo Negri | Luisa Bentivogli | Yashar Mehdad | Danilo Giampiccolo | Alessandro Marchetti
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Is it Worth Submitting this Run? Assess your RTE System with a Good Sparring Partner
Milen Kouylekov | Yashar Mehdad | Matteo Negri
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

Towards Cross-Lingual Textual Entailment
Yashar Mehdad | Matteo Negri | Marcello Federico
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush
Matteo Negri | Yashar Mehdad
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

Mining Wikipedia for Large-scale Repositories of Context-Sensitive Entailment Rules
Milen Kouylekov | Yashar Mehdad | Matteo Negri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper focuses on the central role played by lexical information in the task of Recognizing Textual Entailment. In particular, the usefulness of lexical knowledge extracted from several widely used static resources, represented in the form of entailment rules, is compared with a method to extract lexical information from Wikipedia as a dynamic knowledge resource. The proposed acquisition method aims at maximizing two key features of the resulting entailment rules: coverage (i.e. the proportion of rules successfully applied over a dataset of TE pairs), and context sensitivity (i.e. the proportion of rules applied in appropriate contexts). Evaluation results show that Wikipedia can be effectively used as a source of lexical entailment rules, featuring both higher coverage and context sensitivity with respect to other resources.

Syntactic/Semantic Structures for Textual Entailment Recognition
Yashar Mehdad | Alessandro Moschitti | Fabio Massimo Zanzotto
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization
Yashar Mehdad
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Optimizing Textual Entailment Recognition Using Particle Swarm Optimization
Yashar Mehdad | Bernardo Magnini
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

Co-authors

Dragomir Radev 7

Luisa Bentivogli 4

Marcello Federico 4

Danilo Giampiccolo 4

Srinivasan Iyer 4

Alessandro Marchetti 4

Asli Celikyilmaz 3

Asish Ghoshal 3

Kushal Lakhotia 3

Alexander Richard Fabbri 3

José G. C. de Souza 2

Claire Na Cheng 2

Marjan Ghazvininejad 2

Vladimir Karpukhin 2

Milen Kouylekov 2

Patrick Lewis 2

Sheng-Chieh Lin 2

Bhargavi Paranjape 2

Stan Peshterliev 2

William Yang Wang 2

Griffin Adams 1

Armen Aghajanyan 1

Abhinav Arora 1

Prajjwal Bhargava 1

Shruti Bhosale 1

Youssef Billawala 1

Karolina Buchner 1

Vikas Chandra 1

Kai-Wei Chang 1

Keith Diedrick 1

Sergey Edunov 1

Arash Einolghozati 1

Chengcheng Feng 1

Ana Valeria González 1

Michael Haeger 1

Madian Khabsa 1

Raghuraman Krishnamoorthi 1

Belinda Z. Li 1

Diana Liskovich 1

Bernardo Magnini 1

Jean Maillard 1

Kshitiz Malik 1

Alessandro Moschitti 1

Sharan Narang 1

Dmytro Okhonko 1

Aleksandra Piktus 1

Sebastian Riedel 1

Karthik Abinav Sankararaman 1

Michael Schlichtkrull 1

Akshat Shrivastava 1

Veselin Stoyanov 1

Maryam Tavafi 1

Joel Tetreault 1

Kapil Thadani 1

Shubham Toshniwal 1

Jeremy Werner 1

Fabio Massimo Zanzotto 1

Luke Zettlemoyer 1

Tiantian Zhang 1

Changsheng Zhao 1

Venues